An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

2019 
In real world, entities may occur several times in a database. These duplicates may have varying keys and/or include errors that make deduplication a difficult task. Deduplication cannot be solved accurately using either machine-based or crowdsourcing techniques only. Crowdsourcing were used to resolve the shortcomings of machine-based approaches. Compared to machines, the crowd provided relatively accurate results, but with a slow execution time and very expensive too. A hybrid technique for data deduplication using a Euclidean distance and a chromatic correlation clustering algorithm was presented. The technique aimed at: reducing the crowdsourcing cost, reducing the time the crowd use in deduplication and finally providing higher accuracy in data deduplication. In the experiments, the proposed algorithm was compared with some existing techniques and outperformed some, offering an utmost deduplication accuracy efficiency and also incurring low crowdsourcing cost.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []