A General Web Page Extraction Method Aiming at Online Social Networks

2020 
Recently, online social networks such as Twitter and Facebook have become one of the most important information communication platforms, which contain large number of useful user data for various research fields. Extracting information from online social network pages is the fundamental key technique of online social network study. In this paper, we have proposed a general web page extraction method according to the structural feature of online social network web pages. Based on our method we can select useful information in web page with class tag frequency and after some simple string manipulation processes the exact content will be achieved. Furthermore, we have proposed an automatic particular regular expression generation algorithm which can make our content extraction method apply to many similar structure online social networks without re-writing the entire extraction process facing to each web site. The experimental results show that our approach has good adaptability and shows significant advantages in both precision and recall.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []