Neural Coupled Sequence Labeling for Heterogeneous Annotation Conversion

2022 
Supervised statistical models rely on large-scale high-quality labeled data, which is important for model training but expensive to construct. Therefore, instead of constructing new dataset, researchers have attempted to make full use of various existing heterogeneous datasets to boost model performance, considering it is ubiquitous that the same task may have multiple annotated data following different and incompatible annotation guidelines. Representative methods include the guide-feature method which use the knowledge projected from the source-side to the target-side as extra features for target model guidance, and the multi-task learning (MTL) method which simultaneously train on multiple heterogeneous annotations with shared parameters to gain resource-share knowledge. Though effective, the guide-feature method fails to directly use the source-side data as training data, and the MTL method ignores the implicit mappings between heterogeneous datasets. Compared with the above methods, directly converting the heterogeneous datasets into homogeneous datasets for target model training is a more straightforward and effective way to fully exploit heterogeneous resources. In this work, we propose a neural coupled sequence labeling model for heterogeneous annotation conversion. First, for each token, we map a given one-side tag into a set of bundled tags by concatenating the tag with all the possible tags at the other side. Then, we build a neural coupled model over the bundled tag space. Finally, we convert heterogeneous annotations into homogeneous annotations by performing constraint decoding on the coupled model. We also propose a pruning strategy to address the oversize issue of the bundled tag space, which improves efficiency without hurting model performance.Experiments for part-of-speech (POS) tagging, word segmentation (WS), and WS&POS tagging tasks show that our proposed neural coupled model consistently outperforms several benchmark models for all the three tasks by large margin.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    58
    References
    0
    Citations
    NaN
    KQI
    []