Separation Inference: A Unified Framework for Word Segmentation in East Asian Languages

2022 
Existing methods consider Word Segmentation (WS) as sequence tagging. Each tag indicates the position of the current character in a segment. The exactness of the position for any non-boundaries character is unnecessary. Any incorrect inner prediction reduces model performance. The position information restricts tag-to-tag transition. Thereby, extra context information and the Conditional Random Field (CRF) network are desired to control unreasonable tag transition. To steer away from the implicit restriction, we propose the Separation(Sp)-Adhesion(Ad), which targets straight on the essential character-to-character connections, to tackle the WS task directly. Merely bigram that is specially tailored for “Sp-Ad” is required and considered as the processing unit to identify the connection states of every two adjacent characters. The elimination of the position restriction makes the model independent of the CRF layer which is widely adopted to revise unreasonable tags. Therefore, CRF can then be substituted with a classification network. We construct the Separation Inference (SpIn) framework based on the bigram features and softmax classification network to tackle the WS task. SpIn significantly reduces the inference complexity, dispels extra context information, and boosts the accuracy of the WS task. Besides its effectiveness in Chinese Word Segmentation, performance boosts on Japanese and Korean Word Segmentation further prove SpIn is universal for East Asian Languages. Moreover, our extensive experiments also verify the cross-domain effectiveness of SpIn by attaining state-of-the-art performances in the benchmark tests of in-domain and cross-domain Chinese Word Segmentation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    0
    Citations
    NaN
    KQI
    []