Outlier and Anomaly Detection Methods with Applications to the 2021 Census

Zoheir Sabeur,Gianluca Correndo,Galina Veres,Paul A. Smith,James Dawber

Outlier and Anomaly Detection Methods with Applications to the 2021 Census

2021

The Office of National Statistics (ONS) contracted the University of Southampton to conduct research concerning the use of statistical and data science methods for the automatic detection of outliers and anomalies in Census data. This project considered both Census 2011, which was based mostly on traditional survey methods, and Census 2021, which was mostly conducted using online surveys. The ONS has since given us permission to publish the findings of this project. This information is licensed under the Open Government Licence v3.0. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open‐government‐licence/version/3/ This research project is in close collaboration with ONS and has run under two phases. These are: Phase 1: Literature review & selection of methods for detection of outliers and anomalies in Census 2021 data Phase 2: Prototype demonstrator of outlier and anomaly detection in Census 2021 data This document is the final report of phase 2 concerning the testing of statistical and data science methods, selected during phase 1, for the detection of outliers and anomalies in Census 2021 data. These methods were investigated using synthetic data perturbations to simulate anomalies which are likely to occur in the census data. The data perturbation strategies were decided in consultation with experts at ONS. The second phase of the project has been conducted with the following procedures in consultation with ONS:  The acquisition of experimental data in this study was achieved by accessing “2011 Census Microdata LA data”, which were made available by the UK Data Service.  The experimental 2011 census data had already been processed and cleaned, with no expected anomalies to be present. Thus, in order to assess our anomaly detection methods on the census data, anomalies needed to be synthetically added in this project.  Several discussions with ONS experts led us to strategize on how to synthetically add anomalies in the data. Data perturbations were performed in accord with real errors that occurred in the previous Census.  A significant number of selected potential methods (both statistical and data science‐based) for the detection of outliers and anomalies were investigated using the newly perturbed 2011 Census data. Their respective performances for the detection of census data anomalies were obtained.  Benchmarking for the Spark implementations of the selected outlier detection methods was performed. This early testbed experiment revealed scalability trends over increasing volumes and complexities of census records.  The various outlier detection scripts were integrated onto the Jupyter environment as the first prototype demonstrator for ONS.  Three major research programmes have been identified for future studies: Methods for Census Data Perturbation, Outlier and Anomaly Detection Methods and Machine Learning Strategies, and Methods Scalability using Spark Technology. Each of the topics are discussed in Section 6 with future recommendations.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations