FAHES: A Robust Disguised Missing Values Detector

Authors:
Mourad Ouzzani Qatar Computing Research Institute, HBKU
Nan Tang Qatar Computing Research Institute, HBKU
Ahmed Elmagarmid Qatar Computing Research Institute, HBKU
Raul Castro Fernandez CSAIL MIT
Abdulhakim A. Qahtan Qatar Computing Research Institute, HBKU

Introduction:

This paper deals with disguised missing values(DMV). In this paper, the authors present FAHES, a robust system for detecting DMVs from two angles: DMVs as detectable outliers and as detectable inliers.

Abstract:

Missing values are common in real-world data and may seriously affect data analytics such as simple statistics and hypothesis testing. Generally speaking, there are two types of missing values: explicitly missing values (i.e. NULL values), and implicitly missing values (a.k.a. disguised missing values (DMVs)) such as “11111111” for a phone number and “Some college” for education. While detecting explicitly missing values is trivial, detecting DMVs is not; the essential challenge is the lack of standardization about how DMVs are generated. In this paper, we present FAHES, a robust system for detecting DMVs from two angles: DMVs as detectable outliers and as detectable inliers. For DMVs as outliers, we propose a syntactic outlier detection module for categorical data, and a density-based outlier detection module for numerical values. For DMVs as inliers, we propose a method that detects DMVs which follow either missing-completely-at-random or missing-at-random models. The robustness of FAHES is achieved through an ensemble technique that is inspired by outlier ensembles. Our extensive experiments using real-world data sets show that FAHES delivers better results than existing solutions.

You may want to know: