Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

2014 
Extracting information from a training data set for predictive inference is a fundamental task in data mining and machine learning. With the exponential growth in the amount of data being generated in the past few years, there is an urgent need to develop or adapt existing learning algorithms to efficiently learn from large data sets. This paper describes three scaling techniques enabling machine learning algorithms to learn from large distributed data sets. First, a general single-pass formula for computing the covariance matrix of large data sets using the MapReduce framework is derived. Second, two new efficient and accurate sampling schemes for scaling down large data sets for local processing are presented. The first sampling scheme uses the single-pass covariance formula to select the most informative data points based on uncertainties in the linear discriminant score. The second technique on the other hand selects informative points based on uncertainties in the logistic regression model. A series of numerical experiments demonstrates numerically stable results from the application of the formula and a fast, efficient, accurate and cost effective sampling scheme.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    2
    Citations
    NaN
    KQI
    []