Hash Maps - Nate's Notes

# Hash Maps The key idea to keep in mind with hash maps is that under the hood they are using a **list**, but the ingenuity stems from their clever use of two things: 1. Hash functions 2. Compression functions A hash function takes in some *immutable* value (whether it be an integer, string, tuple, etc), and maps it to an *integer*. A compression function then takes this integer and maps it to some *index* of the underlying list that will contain our values. The usefulness of a hashmap comes from the fact that they are very fast, generally having lookup and insertion time of O(1). Let's think about why this is so fast. Consider a hash map that was built on top of a list containing 1000 slots. If we were looking to see if some name, say `John`, was in the list, we would need to scan the entire list, which is O(n) (where `n` is the length of the list). Where hash functions come in is they allow us to take the name `John` and map it to an index very quickly. Specifically the two steps look like: ```python my_hash_list = [] # Map 'John' to an integer >> hash('John') -1005131343229729964 # Map this integer to a bucket of our list def compress(i): a = 11 b = 13 p = 109345121 # prime number larger than N N = 1000 # num slots in list return ((a * i + b) % p) % N >> compress(-1005131343229729964) 390 my_hash_list[390] = 'John' ``` Now, we have an operation that is very fast! This ability to hash and then compression some key, such as `John`, only take several operations (for instance, our `compress` function requires `1` multiplication, `1` addition, and `2` mods, for a total of `4` operations. This is *constant* for any key that needs to be compressed). What we *must* keep in mind though is that the entire usefulness of a hash function comes from the fact that they map *uniformly random* to indices of the list. If this process is not random, then we will end up with many values being mapped to the same index. ### Meta idea One of the cool things to keep in mind here is that we do *not* need to "keep track" of where are particular key resides. In other words, we don't have some "table" that we go to an ask: "What index does `John` reside at?" and the table responds, "Ah `John` is at index `390`". In a sense, we have pushed that work off to the **mathematical process of hashing and compressing**. And ya know what, if we think about it in some ways we are "asking" what index `John` resides at, but the answer is algorithmically computable It lives in a space that our function captures. ### General Properties Hash functions typically have these key properties: * They map some type of input, such as strings or floats, to discrete values, such as integers. * They’re designed so that two inputs will result in hash outputs that are either different or the same based on key properties of the inputs. ### Why do hash functions use prime numbers? See [here](https://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/) as well as [this notebook](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/computer_science/Algorithms_and_data_structures_in_python/maps_and_dictionaries/hash_collisions_and_primes.ipynb). Note that if the data are random then we do not need to use a prime. However, if the data exhibit any sort of pattern then a prime will be very useful in spreading out collisions (see [here](https://qr.ae/pGJaSN)). --- Date: 20210813 Links to: Tags: References: * [Pythons default hash algorithm](https://andrewbrookins.com/technology/pythons-default-hash-algorithm/)