A Hybrid Data Cleaning Framework Using Markov Logic Networks (Extended Abstract)

2021 
With the growth of dirty data, data cleaning turns into a crux of data analysis. In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean, which is capable of learning instantiated rules to supplement the insufficient integrity constraints. MLNClean consists of two steps, i.e., pre-processing and two-stage data cleaning. In the pre-processing step, MLNClean first infers a set of probable instantiated rules according to Markov logic network (MLN) and then builds a two-layer MLN index to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, MLNClean first presents a concept of reliability score to clean errors within each data version separately, and then, it eliminates the conflict values among different data versions using a novel concept of fusion score. Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of MLNClean.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    0
    Citations
    NaN
    KQI
    []