A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)
While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.
h = 5381
while still has data:
h = h * 33 + next_byte()
return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.
The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?
I think older processors used to have a slower implementation for shifts, which made this slower.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
Sorry, I've read and reread TFA but the concept still evades me. Is it that, since it's easier for a hash function to have higher entropy in the higher bits than in the lower ones, it would be more logical for hash tables to discard the lower bits and keep the higher ones?
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
You'd probably want 'return len(str)', if this is Python?
In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Any `return c` for some constant is a valid and correct hash function. It just has a lot of collisions and degenerates hash-maps to terrible performance. That was in fact my first thought when I read "simplest hash functions".
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
It's very very unlikely to get collisions there, but still not impossible. Whenever you map data of arbitrary length (infinite possibilities) to a limited length collisions are possible.
padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.
A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.
The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
You are right!
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
This comment feels about half as long as it ought to be. Can you say more?
I just realized that a hash function is nothing less than the output of a deterministic random number generator xored with some data
Could you explain what you mean here?
Hashes are _functions_ so provide the same output given the same input.
If you don't reseed the RNG after every hash computation, then you break this vital property of hashes.
And if you do reseed, then your claim boils down to "every hash function is just an XOR against a contstant" which certainly is not true either.
Sorry, what?
That might we one very particular way to write a hash function, but it's far from the only one.
Believe it or not, for some purposes taking the first few bytes of a string or even just the length of the string are good hash functions.
Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)
What are those purposes?
While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?
I think older processors used to have a slower implementation for shifts, which made this slower.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
Sorry, I've read and reread TFA but the concept still evades me. Is it that, since it's easier for a hash function to have higher entropy in the higher bits than in the lower ones, it would be more logical for hash tables to discard the lower bits and keep the higher ones?
> Like addition
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.
At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...
That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.
This is a excellent article for anyone looking for some more in-depth analysis of tabulation based hashing methods: https://arxiv.org/abs/1505.01523
If you use the identity function as your hashing function then is it O(0) because you are done before you start?
You'd probably want 'return len(str)', if this is Python?
In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Re: missing return keyword
Actually, in Python, None is a valid key...
(I'm so sorry. JavaScript has ruined me.)
Any `return c` for some constant is a valid and correct hash function. It just has a lot of collisions and degenerates hash-maps to terrible performance. That was in fact my first thought when I read "simplest hash functions".
O(1) only if you're working in a language with length-stored strings (like Pascal[0]), right?
In something like C with its generic strings[1], it would surely have to be O(n) since you have to scan the entire string to calculate its length?
(I have always been terrible at big-O, mind.)
[0] There's probably more of them by now.
[1] ie. not a specific length-stored string type.
a hash function that produce hashes that are already in the hash table should, IMO, not be called a hash function.
I get why technically it is a hash function, but still, no.
It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle
ok damn, I did not know this, obviously. Thanks.
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
It's very very unlikely to get collisions there, but still not impossible. Whenever you map data of arbitrary length (infinite possibilities) to a limited length collisions are possible.
Doesn't even have to be arbitrary length.
Whenever you map into a smaller space, you get collisions. The bigger space doesn't have to be infinite.
A perfect hash function https://en.wikipedia.org/wiki/Perfect_hash_function has to be specially constructed for each desired set of inputs. Generic hash functions cannot be 'perfect'.
Here is a hash function that does not have hash collisions:
That is a function, but not a hash function!
Well it no longer constrains the data in a fixed output length.
Sure, but if you constrain to fixed output length, you will definitely have collisions (Pigeon Hole Principle). There's no way around that.
padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.
what programming language is this?