$$hf_0$$: A Hybrid Pitch Extraction Method for Multimodal Voice

2020 
Pitch or fundamental frequency ( $$f_0$$ ) estimation is a fundamental problem extensively studied for its potential speech and clinical applications. The existing $$f_0$$ estimation methods degrade in performance when applied over real-time audio signals with varying $$f_0$$ modulations and high SNR environment. In this work, a $$f_0$$ estimation method using both signal processing and deep learning approaches is developed. Specifically, we train a convolutional neural network to map the periodicity-rich input representation to pitch classes, such that the number of pitch classes is drastically reduced compared to existing deep learning approaches. Then, the accurate $$f_0$$ is estimated from the nominal pitch classes based on signal processing approaches. The observations from the experimental results showed that the proposed method generalizes to unseen modulations of speech and noisy signals (with various types of noise) for large-scale datasets. Also, the proposed hybrid model significantly reduces the learning parameters required to train the model compared to other methods. Furthermore, the evaluation measures showed that the proposed method performs significantly better than the state-of-the-art signal processing and deep learning approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    0
    Citations
    NaN
    KQI
    []