SING: Symbol-to-Instrument Neural Generator

Authors:
Alexandre Defossez Facebook
Neil Zeghidour Facebook A.I. Research / Ecole Normale Supérieure
Nicolas Usunier Facebook AI Research
Leon Bottou Facebook AI Research
Francis Bach INRIA - Ecole Normale Superieure

Introduction:

Recent progress in deep learning for audio synthesis opensthe way to models that directly produce the waveform, shifting awayfrom the traditional paradigm of relying on vocoders or MIDI synthesizers for speech or music generation.Inthis work, the authors study the more computationally efficient alternative of generating the waveform frame-by-frame with large strides.The authors present a lightweight neural audio synthesizer for the original task of generating musical notes given desired instrument, pitch and velocity.

Abstract:

Recent progress in deep learning for audio synthesis opensthe way to models that directly produce the waveform, shifting awayfrom the traditional paradigm of relying on vocoders or MIDI synthesizers for speech or music generation. Despitetheir successes, current state-of-the-art neural audio synthesizers suchas WaveNet and SampleRNN suffer from prohibitive training and inference times because they are based onautoregressive models that generate audio samples one at a time at a rate of 16kHz. Inthis work, we study the more computationally efficient alternative of generating the waveform frame-by-frame with large strides.We present a lightweight neural audio synthesizer for the original task of generating musical notes given desired instrument, pitch and velocity. Our model is trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.On the generalization task of synthesizing notes for pairs of pitch and instrument not seen during training, SING produces audio with significantly improved perceptual quality compared to a state-of-the-art autoencoder based on WaveNet as measured by a Mean Opinion Score (MOS), and is about 32 times faster for training and 2, 500 times faster for inference.

You may want to know: