Approximate web document detection method based on content and position features

Shijun Li,Yueting Wu,Jian Zhang,Wei Yu,Yuxuan Li

Approximate web document detection method based on content and position features

2016

The invention provides an approximate web document detection method based on content and position features. In the approximate web document detection method disclosed by the invention, noise information in a page is eliminated before webpage features are calculated; therefore, influence of the noise content in the page on an approximate web document detection process can be effectively reduced; on the basis of selective analysis of a page text, in combination with a key concept, distribution characteristics and position features in a webpage text are compared, such that the approximate page detection precision is increased; the advantages of an indexing mechanism and a retrieval system in massive data are sufficiently utilized; a reverse index is used as a storage and access medium for keyword item vectors and position feature vectors in the page; the execution efficiency and the feasibility of the method are improved; in the method disclosed by the invention, the page content and the position feature vectors are used as basis for approximate page judgement; dependence on a related corpus and a conception semantic net is greatly reduced; the applicability of the method is enhanced; and thus, the approximate detection range of web documents is widened.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations