HANME: Hierarchical Attention Network for Singing Melody Extraction

2021 
Singing melody extraction in polyphonic musical audio is a very critical and challenging task in music information retrieval (MIR). Contextual frame-level information has proven its effectiveness in this task. However, existing works assign equal weight to each contextual frame, which may hinder the further improvement of the performance. To this end, we propose a h ierarchical a ttention n etwork for singing m elody e xtraction (HANME) to extract the discriminative attention-aware features and alleviate the workload of the convolutional recurrent neural network (CRNN) for extracting local spatial and temporal features. Specifically, the first attention layer learns the context vector based on local spatial features extracted by residual convolutional neural network (CNN), and the second attention layer learns the temporal context vector based on long-term features extracted by Bidirectional Gated Recurrent Units (BiGRU). Due to the scarcity of labeled training data, we further propose a partial parameter adaptation approach to address the imbalance distribution of the labels for this task. We use the RWC dataset and part of vocal tracks of the MedleyDB dataset for training the model and evaluate the performance on the ADC2004, MIREX 05 and MedleyDB datasets. The experimental study demonstrates the superiority of our method compared with other state-of-the-art ones.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    1
    Citations
    NaN
    KQI
    []