Algorithms

MinHash

Set similarity estimation via minimum hash matching.


In plain terms

Andrei Broder at AltaVista, 1997. Compute Jaccard similarity in O(k) instead of O(|A∪B|). Used in web deduplication.

Origin

Andrei Broder at AltaVista, 1997 — "On the resemblance and containment of documents." Used to dedupe near-duplicate web pages in the early search-engine era.

Where it shows up in production
  • Web dedup Google and Bing have used MinHash variants for crawl deduplication since the early 2000s.
  • Datasketches library Apache project; MinHash + many other sketches. Used in Druid for approximate joins.
On Semicolony
Sources & further reading
Found this useful?