Projected Clustering for Huge Data Sets in MapReduce

2014 
Fast growing data sets with a very high number of attributes become a common situation in social, industry and scientific areas. A meaningful analysis of these data sets requires sophisticated data mining techniques as projected clustering that are able to deal with such complex data. In this work, we investigate solutions for extending the state-of-theart projected clustering algorithm P3C for large data sets in highdimensional spaces. We show that the original model of the P3C algorithm is not suitable to deal with huge data sets. Therefore, we propose the necessary changes of the underlying clustering model and then present an efficient MapReduce-based implementation our novel P3C + -MR algorithm. The effectiveness of the proposed changes on large data sets and the efficiency of the P3C + -MR algorithm are comprehensively evaluated on synthetic and real-world data sets. Additionally, we propose the P3C + -MR-Light algorithm, a simplified version of P3C + -MR that shows extraordinary good results in terms of runtime and result quality on large data sets. In the end, we compare our solutions to existing approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    6
    Citations
    NaN
    KQI
    []