Schema-agnostic blocking for streaming data

Tiago Brasileiro Araújo,Kostas Stefanidis,Carlos Eduardo Santos Pires,Jyrki Nummenmaa,Thiago Pereira da Nóbrega

Schema-agnostic blocking for streaming data

2020

Currently, a wide number of information systems produce a large amount of data continuously. Since these sources may have overlapping knowledge, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Considering the quadratic cost of the ER task, blocking techniques are often used to improve efficiency. Such techniques face two main challenges related to data volume (i.e., large data sources) and variety (i.e., heterogeneous data). Besides these challenges, blocking techniques also face two other ones: streaming data and incremental processing. To address these four challenges simultaneously, we propose PI-Block, a novel incremental schema-agnostic blocking technique that utilizes parallelism (through distributed computational infrastructure) to enhance blocking efficiency. In our experimental evaluation, we use four real-world data source pairs, and highlight that PI-Block achieves better results regarding efficiency and effectiveness compared to the state-of-the-art technique.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations