Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary

2019 
Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    0
    Citations
    NaN
    KQI
    []