A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

M. Bhavani,V. A. Narayana,Gaddameedi Sreevani

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

2021

Web mining is a part of data mining in which the web consists of enormous amount of data. The search engines faces large amount of problems due to the presence of Near duplicate documents in web which leads to irrelevant answers. The performance and reliability of search engines are critically affecting since the near duplicate documents present in web. For detection of near duplicate web documents two attempts are found in the literature. The former considered domain and size of the document and the later considered text and image as the search parameters. This article proposes a novel approach combining the parameters such as text, image, size and domain of the document to detect near duplicate documents. The approach extracts the keywords and images of the crawled document and compares them with the existing documents for similarity measure. If the similarity score measure value is less than 19.5 and image comparison value is greater than 70%, then it is detected as near duplicate document.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations