Comparison of Convolution Types in CNN-based Feature Extraction for Sound Source Localization

2021 
This paper presents an overview of several approaches to convolutional feature extraction in the context of deep neural network (DNN) based sound source localization. Different ways of processing multichannel audio data in the time-frequency domain using convolutional neural networks (CNNs) are described and tested with the aim to provide a comparative study of their performance. In most considered approaches, models are trained with phase and magnitude components of the Short-Time Fourier Transform (STFT). In addition to state-of-the-art 2D convolutional layers, we investigate several solutions for the processing of 3D matrices containing multichannel complex representation of the microphone signals. The first two proposed approaches are the 3D convolutions and depthwise separable convolutions in which two types of filters are used to exploit information within and between the channels. Note that this paper presents the first application of depthwise separable convolutions in a task of sound source localization. The third approach is based on complex-valued neural networks which allows for performing convolutions directly on complex signal representations. Experiments are conducted using two synthetic datasets containing noise and speech signals recorded using a tetrahedral microphone array. The paper presents the results obtained using all investigated model types and discusses the resulting accuracy and computational complexity in DNN-based source localization.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    3
    Citations
    NaN
    KQI
    []