A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

2019 
There exist a multitude of techniques for privacy preservation in data publishing, but these methods are mostly designed for traditional and small databases and are incapable of handling big data. The leading method of privacy preservation in data publishing is called de-identification, and one of the best-known techniques of de-identification is k-anonymity. The Mondrian algorithm is among the best and fastest algorithms developed for implementing this technique. This algorithm has been made scalable through solutions such as data partitioning for multiple workers and then running Mondrian on individual nodes, or implementing the algorithm in parallel runs. This study examined and tested a solution involving the use of k-means clustering for data partitioning and then distributing clusters among workers. The test was performed to measure the improvement over the serial method in terms of de-identification accuracy, memory usage, and runtime. The results show that the serial method is completely incapable of handling big data, but the problem can be resolved with the proposed solution.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    1
    Citations
    NaN
    KQI
    []