A Hybrid Data Cleaning Framework Using Markov Logic Networks (Extended Abstract)
2021
With the growth of dirty data, data cleaning turns into a crux of data analysis. In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean, which is capable of learning instantiated rules to supplement the insufficient integrity constraints. MLNClean consists of two steps, i.e., pre-processing and two-stage data cleaning. In the pre-processing step, MLNClean first infers a set of probable instantiated rules according to Markov logic network (MLN) and then builds a two-layer MLN index to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, MLNClean first presents a concept of reliability score to clean errors within each data version separately, and then, it eliminates the conflict values among different data versions using a novel concept of fusion score. Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of MLNClean.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
4
References
0
Citations
NaN
KQI