The Role of Long-Term Dependency in Synthetic Speech Detection

2022 
Although much progress has been made in synthetic speech detection, there lacks comprehensive analysis of the essential differences between spoofed and genuine speech. We here utilize supervised contrastive loss originated from contrastive learning as an analytical tool to characterize the class similarity structure of ASVspoof 2019 logical access (LA) dataset, which shows that an ideal back-end classifier for synthetic speech detection should have the ability to capture long-term dependencies. Recently, Transformer has been found to have an excellent ability in learning long-term dependencies of input data. We hence propose a back-end classifier based on Transformer Encoder for synthetic speech detection. Convolution blocks are added before the Transformer Encoder, which leverages inductive biases to improve the generalization ability. Compared to two-dimensional convolution, one-dimensional convolution makes better architectural assumptions about the input speech features, which helps with modeling long-term dependencies and decreases the risk of overfitting. The proposed Transformer combined with one-dimensional convolution has fewer parameters than most existing back-end classifiers, and achieves an equal error rate of 1.06% and a minimum tandem detection cost function metric of 0.0345 when evaluated on ASVspoof 2019 LA dataset, which is one of the best models reported in the literature.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    0
    Citations
    NaN
    KQI
    []