A Method of Eliminating Noises in Web Pages by Style Tree Model and Its Applications

2004 
A Web page typically contains many information blocks.Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements.We call these blocks the noisy blocks.The noises in Web pages can seriously harm Web data mining.To the question of eliminating these noises, we introduce a new tree structure, called Style Tree, and study an algorithm how to construct a site style tree.The Style Tree Model is employed to detect and eliminate noises in any Web pages of the site.An information based measure to determine which element node is noisy is also constructed.In addition, the applications of this method are discussed in detail.Experimental results show that our noises elimination technique is able to improve the mining results significantly.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []