Web data similarity detection method based on two-stage filtration of structure and content

2014 
The invention discloses a Web data similarity detection method based on two-stage filtration of a structure and content. On the basis of a traditional universal similarity detection method, the distribution characteristics of the structure and the content of Web data are dug out, and detected document sets are subjected to two-stage filtration; the first-stage filtration of the two-stage filtration is structure similarity filtration, wherein each Web document is modeled into a Tag tree structure so as to remove the document sets dissimilar in structure, the remaining documents are subjected to key content extraction, key content is expressed in the form of tuple vectors, and key messages are connected to generate character string sets; the second-stage filtration of the two-stage filtration is to conduct Trie tree structure modeling on the character string sets generated after the first-stage filtration, and similar character strings are connected to obtain a final result. Multiple experiments prove that by the adoption of the method, the efficiency of data similarity detection in the web field can be improved remarkably.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []