A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

2021 
Web mining is a part of data mining in which the web consists of enormous amount of data. The search engines faces large amount of problems due to the presence of Near duplicate documents in web which leads to irrelevant answers. The performance and reliability of search engines are critically affecting since the near duplicate documents present in web. For detection of near duplicate web documents two attempts are found in the literature. The former considered domain and size of the document and the later considered text and image as the search parameters. This article proposes a novel approach combining the parameters such as text, image, size and domain of the document to detect near duplicate documents. The approach extracts the keywords and images of the crawled document and compares them with the existing documents for similarity measure. If the similarity score measure value is less than 19.5 and image comparison value is greater than 70%, then it is detected as near duplicate document.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    0
    Citations
    NaN
    KQI
    []