Employing Auto-Annotated Data for Government Document Classification

2019 
In China, the government documents are documents with legal effect and of standard forms formulated in the process of government administration. With the continuous development of e-government in China, government database size increases hugely. To fully utilize the potential of the database, many applications based on natural language processing (NLP) are developed. Classification is a fundamental task for many NLP applications such as automatic document archive, intelligent search, and personalized recommendation. Presently, in China, the government document classification method which based on issuing departments has very low accuracy. Traditional text classifiers based on machine learning or deep learning models rely heavily on human-labeled training data. While there are no open data sets on the government documents, we propose a method to automatically constructing large-scale annotated data set for government document classification based on the information retrieval method. Experiment results show that the supervised classification model trained on our automatically constructed data set outperforms the baseline method 15% on F1-score.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    3
    Citations
    NaN
    KQI
    []