MinHash
Set similarity estimation via minimum hash matching.
Origin
Andrei Broder at AltaVista, 1997 — "On the resemblance and containment of documents." Used to dedupe near-duplicate web pages in the early search-engine era.
Where it shows up in production
- Web dedup Google and Bing have used MinHash variants for crawl deduplication since the early 2000s.
- Datasketches library Apache project; MinHash + many other sketches. Used in Druid for approximate joins.
On Semicolony
Sources & further reading
Found this useful?