Count-min Sketch - Nate's Notes

### Raison d'être > **Problem**: How many times does a value occur in a list? > **Solution**: A probabilistic approach that utilizes a matrix of hash functions that serve as counters. The result can only *overestimate* how many times a given value has occurred in a list, but it will never *underestimate*. ### Background ![](Screen%20Shot%202021-06-28%20at%2010.37.30%20AM.png) ![](Screen%20Shot%202021-06-28%20at%2010.37.51%20AM.png) ![](Screen%20Shot%202021-06-28%20at%2010.38.41%20AM.png) ![](Screen%20Shot%202021-06-28%20at%2010.38.53%20AM.png) ![](Screen%20Shot%202021-06-28%20at%2010.39.12%20AM.png) ![](Screen%20Shot%202021-06-28%20at%2010.39.32%20AM.png) ### Difference from [Bloom-filters 1](Bloom-filters%201.md) The main conceptual different is that CM sketch represents a multiset, and has different assumptions about the kind of updates. More formally, CM sketch summarizes a frequency distribution, while Bloom Filter is concerned with representing which elements are present in a set Count–min sketches are essentially the same data structure as the [counting Bloom filters](https://en.wikipedia.org/wiki/Counting_Bloom_filter "Counting Bloom filter"). However, they are used differently and therefore sized differently: a count-min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set. The main conceptual different is that CM sketch represents a multiset, and has different assumptions about the kind of updates. More formally, CM sketch summarizes a frequency distribution, while Bloom Filter is concerned with representing which elements are present in a set The implementation details of the countmin sketch are very similar to those of a bloom filter. The latter structure only uses bits, rather than integer-valued counters. When an object is inserted into a bloom filter, hash functions indicate bits that should be set to 1 (whether or not they were previously 0 or 1). The count-min sketch, which is responsible for keeping counts rather than just tracking membership, instead increments counters. Looking up an object in a bloom filter just involves checking the bits corresponding to that object — if any of them are still 0, then the object has not been previously inserted. Thus Lookup in a bloom filter can be thought of as taking the minimum of bits, which exactly parallels the Count operation of a countmin-sketch. That the count-min sketch only overestimates frequency counts corresponds to the bloom filter’s property that it only suffers from false positives. --- Date: 20210628 Links to: [Computer Science MOC](Computer%20Science%20MOC.md) [Algorithms and Data Structures](Algorithms%20and%20Data%20Structures.md) [Count-Sketch](Count-Sketch.md) Tags: #review References: * [Count min sketch](https://www.youtube.com/watch?v=lGoCslwItiU&t=192s) * [Count min sketches, another youtube video](https://www.youtube.com/watch?v=mPxslXpg8wA&t=288s) * [Count sketch explain](https://stackoverflow.com/questions/6811351/explaining-the-count-sketch-algorithm) * [Count min sketch slides](https://www.slideshare.net/gakhov/probabilistic-data-structures-part-3-frequency) * [Wikipedia](https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch) * [Difference from bloom filter](https://sites.google.com/site/countminsketch/home/faq) * [Web analytics data mining example](https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)