Evaluation of Post-Processing Algorithms for Polyphonic Sound Event Detection

2019 
Sound event detection (SED) aims at identifying sound events (audio tagging task) in recordings and then locating them temporally (segmentation task). This last task ends with the segmentation of the frame-level class predictions, that determines the onsets and offsets of the sound events. This step is often overlooked in scientific publications. In this paper, we focus on the post-processing algorithms used to identify the sound event boundaries. Different post-processing steps are investigated through smoothing, thresholding, and optimization. In particular, we evaluate different approaches for temporal segmentation, namely statistics-based and parametric methods. Experiments were carried out on the DCASE 2018 challenge task 4 data. We compared post-processing algorithms on the temporal prediction curves of two models: one based on the challenge’s baseline and one based on Multiple Instance Learning (MIL). Results show the crucial impact of the post-processing methods on the final detection scores. When using ground truth audio tags to retain the final temporal predictions of interest, statistics-based methods yielded a 29.9% event-based F-score on the evaluation set with MIL. Moreover, the best results were obtained using class-dependent parametric methods with a 43.9% F-score. The post-processing methods and optimization algorithms have been compiled into a Python library named "aeseg"1.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    8
    Citations
    NaN
    KQI
    []