Chinese News Data Extraction System Based on Readability Algorithm

2020 
In this era of data explosion, the number of Chinese news has increased exponentially. We need an efficient way to collect data to support the data analysis industry or to meet the data needs of artificial intelligence domain model training. The purpose of this article is to build an efficient data extraction system for Chinese news data collection. First, we will introduce the development of the field of network data collection, review the previous research routes and ideas, then we choose the Readability algorithm to extract the text data, improve some of the rules, and add the consideration of text data sparsity to make it more suitable for Chinese news data. The system is based on the Scrapy framework to facilitate large-scale Crawling. By comparing the basic readability algorithm with the experimental results, the improved crawling system can extract Chinese news data more accurately and efficiently.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    0
    Citations
    NaN
    KQI
    []