InPrivate Digging: Enabling Tree-based Distributed Data Mining With Differential Privacy

Lingchen Zhao Wuhan University, P.R. China
Lihao Ni Wuhan University, P.R. China
Shengshan Hu Wuhan University, P.R. China
Yanjiao Chen State Key Lab of Software Engineering, Wuhan University, P.R. China
Pan Zhou Huazhong University of Science and Technology, P.R. China
Fu Xiao Nanjing University of Posts and Telecommunications, P.R. China
Libing Wu Wuhan University, P.R. China


Data mining has heralded the major breakthrough in data analysis, serving as a "super cruncher" to discover hidden information and valuable knowledge in big data systems. For many applications, the collection of big data usually involves various parties who are interested in pooling their private data sets together to jointly train machine-learning models that yield more accurate prediction results. However, data owners may not be willing to disclose their own data due to privacy concerns, making it imperative to provide privacy guarantee in collaborative data mining over distributed data sets. In this paper, we focus on tree-based data mining. To begin with, we design novel privacy-preserving schemes for two most common tasks: regression and binary classification, where individual data owners can perform training locally in a differentially private manner. Then, for the first time, we design and implement a privacy-preserving system for gradient boosting decision tree (GBDT), where different regression trees trained by multiple data owners can be securely aggregated into an ensemble. We conduct extensive experiments to evaluate the performance of our system on multiple real-world data sets. The results demonstrate that our system can provide a strong privacy protection for individual data owners while maintaining the prediction accuracy of the original trained model.

You may want to know: