LOAD: LSH-Based ℓ 0 -Sampling over Stream Data with Near-Duplicates.

2020 
Massive amounts of stream data nowadays almost make any real-time analysis impossible. To overcome the challenge of processing this huge amount of data, previous works typically use sampling to extract representatives and conduct analysis on this sampled dataset. In this paper, we propose LOAD, a Locality-Sensitive Hashing (LSH) based \(\ell _0\)-sampling over stream data. Instead of having the same diameter for all dimensions, LOAD utilizes the dimension-specific diameters which could fit the distribution of groups better. Therefore, LOAD always generates a better representative identification result. To facilitate the real-time analysis, we further optimize LOAD by applying LSH. Since nearest items are hashed into the same bucket with high probability, hence distinguishing the representatives becomes lightning fast. Extensive experiments show that LOAD is not only more accurate than other state-of-the-art algorithms, but also faster by an order of magnitude.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []