Challenges and Solutions of Identifying Similarities and Duplication in Digital Libraries

2020 
Large-scale digital libraries such as the HathiTrust, with access to over 17 million texts, include duplicative texts and metadata inconsistencies that impede information access and retrieval. At this scale, it is untenable to manually evaluate each text. The SaDDL (Similarities and Duplication in Digital Libraries) project has been developing content-based methods for identifying similar text relationships within a digital library. This framework allows us to quantify words, themes, and concepts in order to identify similarity on a massive scale currently unobtainable by human effort. Secondly, this method allows us to identify the most representative scan of a target work. This poster presents the way of reconstructing the same work relationships deriving from content comparison directly, rather than matched superficial metadata to obtain the most representative copy---defined as the most complete, correct, cleanest expression of work. The philosophy behind this approach is to study the collection emphasizing the content itself, which is closer to the essence of the text.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    0
    Citations
    NaN
    KQI
    []