Research on complex structure-oriented accurate web information extraction rules

2010 
With the rapid growth of web information, there is an increasing need to easily and efficiently acquire accurate information from the massive and heterogamous web. Web information extraction is such a research area to meet these needs. In this paper, we analyze the shortcomings of related researches and systems and find that when extracting accurate web information with complex structures, few systems can do so without being too much of a burden to users. Aiming at overcoming this type of pitfalls, this paper will study and propose a comprehensive model and framework that can combine the automatic web data analysis and extraction with the user interaction-based semi-supervised web data extraction. The new model and framework has a good trade-off between the automatic generation of extraction rules and their expression capability towards the accurate information extraction. Based on this, we further present a multi-functional data extraction rule system that will use a variety of structural and textual extraction rules of different functions to achieve powerful expression capability. Furthermore, to offer powerful expression mechanism for data extraction, this paper will describe a well-designed, XML-based data extraction language which works well for rule generation based on both automatic web structure analysis and user interaction.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    2
    Citations
    NaN
    KQI
    []