Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing
2019
Given a set of records, entity resolution algorithms find all the records referring to each entity. In top-k entity resolution, the goal is to find all the records referring to the k largest (in terms of number of records) entities. Top-k entity resolution is driven by many modern applications that operate over just the few most popular entities in a dataset. In this paper we introduce the problem of top-k entity resolution and we summarize a novel approach for this problem; full details are presented in a technical report. Our approach is based on locality-sensitive hashing, and can very rapidly and accurately process massive datasets. Our key insight is to adaptively decide how much processing each record requires to ascertain if it refers to a top-k entity or not: the less likely a record is to refer to a top-k entity, the less it is processed. The heavily reduced amount of processing for the vast majority of records that do not refer to top-k entities, leads to significant speedups. Our experiments with images, web articles, and scientific publications show a 2x to 25x speedup compared to traditional approaches for high-dimensional data.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
32
References
4
Citations
NaN
KQI