During my work, the need arose to evaluate the similarity between text documents.

The problem had two parts. During training the documents had to be clustered and during evaluation a new document had to be “assigned” its most similar (from all the documents already on our DB).

One of the bottlenecks of the approach was the computational complexity of comparing a set of N documents between themselves for the clustering part, and also the time required to later determine, at run time of the algorithm, which document was the most similar to a new incoming document.

Comparing the similarity of every pair in N documents requires operations. If comparing one pair takes 1 microsecond, comparing 5 000 documents takes just over 10 seconds, but comparing every pair in 500 000 requires more than 1 day.

However, one might not be interested in determining the similarity between all pairs, but just between those that are above a certain threshold of similarity. **The amazing thing is, how to know which pairs are those without comparing them all to begin with?**

LSH gives us the answer, at the cost of a small probability of error.

Regarding LSH, there are two important ideas that were badly explained in most references:

– Firstly, the threshold similarity s above which you consider two documents to be similar underpins everything else.

– Secondly, LSH is a generic approach and every similarity measure has its own specific implementation ^{1} as seen on the table below.

Distance/Similarity metric | LSH implementation |
---|---|

Euclidean Distance | Random Projection |

Jaccard Similarity | MinHash |

Hamming Distance | Bit Sampling |

Cosine Similarity | SimHash |

There are even distance measures that do not have a possible implementation of LSH ^{2}.

#### Choosing the metric

Normally one chooses the metric that most suits the notion of similarity of the problem and then uses the corresponding LSH. What the appropriate metric should be will not be discussed.

Although it also sounds reasonable to disregard a specific metric if it doesn’t have a LSH or lean towards another metric if its LSH implementation is particularly faster or easier somehow.

I will discuss two specific implementations: **simHash** and **minHash**.

#### Overall view

The figure below represents a high level view of LSH implementations.

The input is a set of documents and the output are tables (only one shown) whose entries contain the documents. The documents in the same entry are supposed to have a similarity between themselves of at least a threshold value s. Documents that are in distinct entries are supposed to have a similarity below s.

LSH is not exact. It has false positives and false negatives. Hence the term ‘supposed’.

False positives occur when a pair is on the same entry but is below the threshold.

False negatives occur when a pair is above the threshold but on distinct entries.

LSH is therefore a function/algorithm/procedure that maps similar documents to the same value and not-similar documents to different values. Where similar depends both on the metric being used and on the threshold s.

Notice that the resulting structure considerably reduces the computational complexity of the problems described in the beginning of the post.

For example, given a new incoming document X, we can apply the same LSH procedure and immediately get the documents (already on our DB) that are similar to it – the ones present in the same bucket. Then it would only be required to compute the similarity between those.

## simHash

With SimHash, documents are represented as points in a multi-dimensional space. The similarity of two documents is the cosine of the angle between them – the cosine distance.

How a document is represented as a vector, while very important overall, is not relevant for understanding the LSH. In my implementations I use the tf-idf scheme to determine the weights along each dimension. That is, each dimension is a word, and the weight of a document along that axis is given by the tf-idf scheme.

So, know that the similarity metric has been explained, lets move to the actual LSH.

What is required is a function such that, when applied to all the documents, documents that are close together will have with high probability the same value, and documents that are far apart will have a different value.

Here, of course, *close together* is within the context of the similarity metric. In this case, documents are as close together the smaller the angle between them.

More formally, this is often expressed has:

if then and

if then

such that and .

On the other, for the simHash case this is achieved by dividing the space with hyperplanes.

For a given plane, each document ends up in one of the two regions which the plane separates.

Intuitively, the closest two documents are, the more likely they will be on the same region for a random plane. For example, if two documents are almost on top of each other, then they will end up on the same region for most of the planes that you can come up with.

For every random plane, all the documents on each region are assigned the value of 1 – if on the positive side – or 0 if on the negative side. This procedure is illustrated on the table below concerning the image above.

1 | 0 | 0 | |

1 | 0 | 1 | |

1 | 1 | 0 |

Each random plane is seen as representing a function, whose value applied to a document is either 0 or 1. This is a pivotal concept.

For planes drawn at random, it can be mathematically proven that two documents hash to the same value with probability:

The expression encapsulates our intuition: The closer two documents are (small ) the greater the chances they end up in the same region.

Notice that here are either 0 or 1.

#### LSH function

How to create random hyperplanes?

Better still, how to automatically compute the function values of each document for a given random hyperplane?

We select a random vector from the d-dimensional space where all the documents are represented. The weight of each dimension is drawn from a Gaussian distribution with .

This random vector represents a random hyperplane which cuts the dimensional space into two regions.

The LSH value of each document is then simply:

If you are curious why one must use a normal distribution for the coefficients of random vector check my next post.

#### So where are we?

LSH maps similar documents to the same value and different documents to different values. Right now, we have a function that kinda does that:

In particular notice that this mapping function respects the formal definition of a LSH function stated above.

If this were the end, i.e. if two documents were considered similar if they had the same value for a single function (single random hyperplane) there would be a great number of false positives and false negatives.

For example, there would be a 50% probability that a pair of documents with similarity of 0 would be considered similar.

It would be better if we could magnify the difference between and . That is, to decrease the probability that “dissimilar” documents hash to the same value, and at the same time, increase the probability that similar documents hash to the same value.

The best scenario would be if we could somehow reach a step function, on which all the pairs above and below a threshold similarity are and are not considered candidate pairs, respectively.

### LSH amplification

To magnify this difference we proceed as follows:

1 -Instead of using only one hash function from the hyperspace to the space {0,1} we apply (to every document) distinct hash functions

2. We divide this hash functions in bands/groups of .

3. A pair of documents is considered a candidate pair, if and only if they have the **same values** on **all** the hash functions in a band **for at least 1 band.**

In the example above, documents and are considered candidate pairs via bands 1 and 3. Documents and hashed to the same values on all hash functions of band 2, and so they are also candidate pairs. Document has no candidate pair for the shown bands.

#### How does this help?

Previously, the probability that two documents hashed to the same value (and thus became similar pairs) was

With magnification, the probability is computed as follows:

So now, the probability that two documents are considered similar pairs is:

, with

The improvement is illustrated on the following plots:

The new probability plot on the right is much closer to the idealized case.

It is up to you to decide what is the threshold above which you consider your documents to be similar. This similarity depends greatly on context. Then you play with parameters and such that the peek increase in probability occurs around that similarity .

If the similarity threshold was 0.4, it would make no sense choosing as parameters the ones used on the right plot above. It would lead to many False Negatives, that is, many pairs of documents whose similarity was in fact greater than 0.4 but that would be ignored by the LSH.

The plot above would be adequate for a threshold of around .

## minHashing

minHash is the LSH implementation when the similarity between documents (or anything else really) is the Jaccard Similarity. In contrast with the cosine distance, documents are here represented as sets.

There are many ways one can represent a document as a set and that is very much relevant for the notion of similarity we want to calculate:

Examples:

doc.a : “the quick brown fox jumps over the lazy dog”

1-shingles: {the, quick, brown, fox, jumps, over, the, lazy, dog}

2-shingles: {the quick, quick brown,brown fox, fox jumps, jumps over, over the, the lazy, lazy dog}

However, for understanding and applying minHash, the key note is that documents are sets of elements, regardless of what elements those are (they could be 1-shingles or 2-shingles or something else.)

The Jaccard Similarity of two sets X and Y is the ratio of the size of the intersection between X and Y to the size of their union:

As an example:

doc.b= “the silver dog hunted a brown fox”

Above, the Jaccard Similarity between documents A and B is

So, know that the similarity metric has been explained, lets move to the actual LSH.

Similarly to the simHash case above, and to any other LSH, we know require a function that is able to map each of our documents into another space, such that:

if then and

if then

with and .

I found it harder to understand and specially to implement such functions for minHashing than for simHashing.

The method for doing so can be found at [3], which I will explain next. The result is a minHash family of functions whose mapping has the property . That is, if documents dÂ¹ and dÂ² have a jaccard similarity of 0.4, then for any given minHash function of this family

Notice that such family of functions respects the aforementioned property, as the next plot illustrates:

After we have found such (family of) functions we proceed with augmentation, similarly to the simHash case. That is, we use such functions, dividing them into bands of rows each.

## How to find such a function for the minHash case?

As an example, consider documents C and D that add no new elements to documents A and B.

The procedure is as follows:

Conceptually, one builds a table where the columns are the various documents being considered and each row is an element from the set of all elements. (the union of all the document-sets).

Each of this elements is present in some documents and not in others. We populate the table with 1’s and 0’s according with whether a given element is present or absent in the document respectively.

Then:

1 – Maintain the columns fixed and randomly permute the order of the rows.

2 – For this permutation, the minHash value of a document is the 1st row for which the respective element is present in the document (i.e. it has a 1 in the cell).

In the image above , and .

It may yet not be apparent but for our purposes it is equivalent to say instead , and . There is a 1-to-1 relation in each permutation. On the actual implementation, the later form is the more practical.

Notice that, each random permutation of the rows gives us one minHash value for each document. It therefore defines a minHash function.

But we have seen that we need many minHash functions for the magnification step. Therefore:

3 – Repeat steps 1-2 as many times as the required number of minhash functions. Each permutation defines a minHash function and gives one minHash value for each document.

Notice also, that while for the simHash the possible LSH values are {0, 1}, in the minHash case, the codomain of the function is an integer from {0, Number-of-Elements}; where here the number of elements are the number of distinct words.

#### Why does this work?

For the simHash case, it was simply stated that using random hyperplanes (or equivalently random vectors) translated into:

In this case why does permuting the rows and selecting the first element present in the document for each document, translates into:

?

This is after-all, the core of LSH.

To avoid just copying the very good explanation, you can check sub-chapter 3.3.3 of Mining For Massive Datasets . The proof is simple and easy to understand.

Conceptually this is it. However, as explained in the reference, permutation of many elements becomes inconceivable as the number of rows (i.e. the universal set size) goes to millions.

In reality a trick is used to emulate the above procedure. This is where it became trickier for me:

1 – To emulate a permutation one uses an **exact hash function** that “throws” each row to a different position, the same way that in a true permutation any row can change positions.

**Side notes:**

**1.** The part ‘hashing’ in ‘minHashing’ doesn’t come from this exact hash function above.

**2.** Again, this exact hash functions are only there to emulate the permutation of rows.

**3.** Locality Sensitive Hashing and Exact Hashing are separate topics. In this case however minHashing requires the help of exact hashing for its implementation.

The references explain poorly how to come up with this exact hash functions. By understanding what is their purpose, the following solution (that I implemented and thoroughly tested) seems to be adequate. It might or not be the one actually used in production systems.

Where:

is the modulus operation

is the number of rows (i.e. the size of the universal set)

is any prime number bigger than . (I normally choose the 1st prime after )

and are random integers such that and

is the row index. One applies each function to all rows.

is defined by the problem.

depends on and remains fixed after defined.

and define a Hash Function Family.

Each random pair defines one exact hash function (from the family) that we require.

I came up with this exact hashing function by looking at [4] and [5].

Overall, if we want n minHash functions then we require n random permutations. To emulate each permutation we need an exact hash function that is defined by a pair. We need in total n pairs.

The drawback is that there is a finite probability that two different (initial) rows are “thrown” to the same index by the same exact hash function. In those cases we are not emulating a true permutation.

This probability should be inversely proportional to the size of the universal set . For problems with hundreds of thousand of elements this should of course not be a problem.

I found it instructive however to analyse what happens when is within a few hundred.

Lets see how these collisions for a given exact hash function affect the associated minHash values. To that end, I used the document dataset NEWS20.^{6}

Considering only the documents 82781, 82785, 83798 and 83900 on directory ‘/NEWS20/20news-bydate-train/talk.religion.misc’ from the NEWS20 dataset. From this 4 the universal set has elements and the pair 82781, 82785 has a Jaccard similarity of 0.17.

By running the above procedure for 200 hash functions, we have that of the minHash values agree. Therefore the estimated similarity is . Quit fair from the real value.

By adding to these 4 all the remaining documents on that path of the NEWS20 dataset, the universal set grows to elements.

Now, for another 200 hash functions, the minHash values between the same two documents agree in 36. This amounts to a similarity of . Extremely close to the actual Jaccard similarity of 0.17.

Remember that the initial purpose was finding a function h (a minHash function) that mapped documents to another space, in such a way that for two documents and , the probability for their values to be the same is: .

We summed the number of times the hash values agreed for all the minHash values at once, merely to make use of the law or large numbers. Considering just one hash function wouldn’t be sufficient to verify that in fact the procedure translated to .

### Buckets

Like the amplification, this step is common to both procedures.

So now, for both the minHash and simHash cases, we are able to build this tables which hold the LSH values for LSH functions for all documents. From them, following amplification, we can identify which documents are candidate pairs. Previously, for simHash, this was done by visually inspecting the data frame.

How could the LSH algorithm recognize which pairs are candidate-pairs? Looping on each band and comparing the value in each row seems inefficient and defeats the purpose of all this. The solution is to use exact hashing (again).

So, on each band, one uses a universal hash function and associated hash table . This hash functions map the LSH values on all rows of that band (for a given document) to an integer. This integer is the index in a hash table. Two documents that share the same LSH values on all rows are mapped to the same integer and index in the hash table. Documents in the same index in the table are candidate pairs. This procedure incorporates the previous definition of a candidate pair following the amplification step.

So the problem goes again to using an appropriate exact hash function. To avoid discussing exact hashing again, and although this step is crucial to the LSH procedure, I forward you to page 13 on [7], which explains this topic well enough in this context.

Still, the simHash case is very suitable to explaining this part. Because the LSH values are always 0 or 1, one can pretend the sequence of LSH values in a given band is a binary number, whose decimal representation is the index in the Table. This would be the same thing as a Direct-Adress Table instead of a Hash Table. But I think it would be practical only if the size of the bands were small. If we wanted = 30, we would get tables with ~ 1 billion elements. For illustrating purposes however:

### Conclusion

Imagine we have 1 million documents in a database.

For a new incoming document, we want to know which of this million is the most similar to it.

The brute force approach would be to make 1 million comparisons.

With LSH however, we would have these hash tables saved somewhere (one for each band). Their generation would have had to occur beforehand, but only once!

Then, we would apply **the same LSH functions** to this new document. For the simHash case this would require using the same random vectors ( and in the same order of course). For the minHash case this would require using the same permutations (also in the same order).

The new document would ultimately be mapped to a given index in the hash table. Finally, one would compare the new document **only** to the other documents present in that same entry of the hash table (for each hash table representing a band).

Similarly, if the objective is to compare all documents between themselves, then one would now only need to compute for each pair which shared an index in any of the hash tables (bands).

*If you have any questions or comments about the post, I would be happy to hear. Just write on the comment section.*

#### References

1. Ni, Y. & Chu, K. Locality Sensitive Hashing Design Doc

2. Charikar, M. Similarity Estimation Techniques from Rounding Algorithms

3. Mining of Massive Datasets, chapter 3

4. Coursera: Data Structures course by University of California, San Diego. Week: Hash Tables, Video: Universal Family

5. Coursera: Data Structures course by University of California, San Diego. Week: Hash Tables, Video: Hashing Integers

6. NEWS20 Dataset

7. Shrivastava, A. & Li, P. In Defense of MinHash Over SimHash

8. Andoni, A. & Indyk, P. EÂ²LSH 0.1 User Manual

The best materials to understand LSH. Thanks a lot!

LikeLike

Thanks a lot. I have been reading this topic for a month, and this are the best post about this subject.

LikeLike

Thanks, I appreciate the positive comment đŸ™‚

LikeLike

Great article!

One question though. Since after applying LSH similar docs end up in the same buckets, what’s the purpose of comparing the docs within the same bucket at a later stage? I mean you already know that they are more than (e.g. 30%) similar, so I guess you do that to determine exactly how similar are they, right?

In my specific scenario, I am only interested in determining whether some docs are similar above a specific threshold and I was a bit confused about whether I need this extra comparison afterwards.

LikeLike

Hello! Thanks for the informative article. I do have a question for you, however

In the equation for SimHash.. p = 1 – arccos(sim(x,y))/pi

We note that sim(x,y) is valid over the interval 0 -> 1 inclusively. This makes the numerator in the second term span pi/2 -> 0 inclusively. This means that the second term in “p” is at least 0 / pi (so that p = 1) or at most pi/2 / pi (p = 1 – 1/2 = 1/2). This means that the probability that two documents hash to the same bit for a single hyperplane can never be 0, which I find puzzling.

In fact if you visit this video (https://www.youtube.com/watch?v=h21irtHDsBw) explaining LSH error calculations, and jump exactly to minute 10:00, you will find that they actually use the equation p = 1 – 2*arccos(sim(x,y))/pi .. The added factor of “2” takes care of allowing p to span from 0 -> 1 instead of 0 -> 1/2, which seems strange to me.

Can you confirm or deny my claim?

LikeLike