Fast Similarity Sketching

Søren Dahlgaard,Mathias Bæk Tejs Knudsen,Mikkel Thorup

Fast Similarity Sketching

2017

We consider the Similarity Sketching problem: Given a universe $[u]= \{0,\ldots,u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that similarity is preserved. More precisely: Given sets $A,B\subseteq [u]$, define $X_i=[S(A)[i]= S(B)[i]]$ and $X=\sum_{i\in [t]}X_i$. We want to have $E[X]=t\cdot J(A,B)$, where $J(A,B)=|A\cap B|/|A\cup B|$ and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for $X$. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called sketches. The seminal $t\times$MinHash algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right)$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size $t$ in $O(t + |A|)$ time, but with the drawback that possibly some of the $t$ entries are "empty" when $|A| = O(t)$. One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case $|A| = O(t)$, where they are needed. (continued...)

Keywords:

Correction
Cite
Save
Machine Reading By IdeaReader

References

Citations