|Lingchen Zhao||Wuhan University, P.R. China|
|Lihao Ni||Wuhan University, P.R. China|
|Shengshan Hu||Wuhan University, P.R. China|
|Yanjiao Chen||State Key Lab of Software Engineering, Wuhan University, P.R. China|
|Pan Zhou||Huazhong University of Science and Technology, P.R. China|
|Fu Xiao||Nanjing University of Posts and Telecommunications, P.R. China|
|Libing Wu||Wuhan University, P.R. China|
Data mining has heralded the major breakthrough in data analysis, serving as a "super cruncher" to discover hidden information and valuable knowledge in big data systems. For many applications, the collection of big data usually involves various parties who are interested in pooling their private data sets together to jointly train machine-learning models that yield more accurate prediction results. However, data owners may not be willing to disclose their own data due to privacy concerns, making it imperative to provide privacy guarantee in collaborative data mining over distributed data sets. In this paper, we focus on tree-based data mining. To begin with, we design novel privacy-preserving schemes for two most common tasks: regression and binary classification, where individual data owners can perform training locally in a differentially private manner. Then, for the first time, we design and implement a privacy-preserving system for gradient boosting decision tree (GBDT), where different regression trees trained by multiple data owners can be securely aggregated into an ensemble. We conduct extensive experiments to evaluate the performance of our system on multiple real-world data sets. The results demonstrate that our system can provide a strong privacy protection for individual data owners while maintaining the prediction accuracy of the original trained model.