Gender classification from speech using convolutional networks augmented with synthetic spectrograms

Automatic gender classification from speech is an integral component of human-computer interfaces. Gender information is utilized in user authentication, speech recognizers, or human-centered intelligent agents. This study focuses on gender classification from speech spectrograms using AlexNet-inspired 2D convolutional neural networks (CNN) trained on real samples augmented with synthetic spectrograms. A generative adversarial network (GAN) is trained to produce synthetic male/female-like speech spectrograms. In limited training data experiments on LibriSpeech, augmenting a training set of 200 real samples by 800 synthetic samples reduces equal error rate of the classifier from 23.7% to 1.0%. To further test the ‘quality’ of the generated samples, in a subsequent experiment, the real training samples are progressively replaced (rather than augmented) with synthetic samples at various ratios from 0 (all original samples preserved) to 1 (all original samples replaced by synthetic ones). Depending on the system setup, substituting between 50% to 90% of the original samples with the synthetic ones is found to have a minimal impact on the classifier performance. Finally, viewing the input CNN layers as filters that select salient spectrogram features, the learned convolutional kernels and filter outputs are studied to understand which spectrogram areas receive a prominent attention in the classifier.
    • Correction
    • Source
    • Cite
    • Save