MELODY-JOIN: Efficient Earth Mover's Distance similarity joins using MapReduce

2014 
The Earth Mover's Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubic to the number of bins in the histograms used to represent the data objects. Consequently, the EMD similarity join operation is prohibitive for large datasets. This is the first paper that specifically addresses the EMD similarity join and we propose to use MapReduce to approach this problem. The MapReduce algorithms designed for generic metric distance similarity joins are inefficient for the EMD similarity join because they involve a large number of distance computations and have unbalanced workloads on reducers when dealing with skewed datasets. We propose a novel framework, named Melody-Join, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has a constant complexity. Furthermore, we address two key problems, the limited pruning power and the unbalanced workloads, by enhancing each phase in the Melody-Join framework. We conduct extensive experiments on real datasets. The results show that Melody-Join outperforms the state-of-the-art technique by an order of magnitude, scales up better on large datasets than the state-of-the-art technique, and scales out well on distributed machines.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    15
    Citations
    NaN
    KQI
    []