Feature #5175
Updated by Davide Pesavento over 3 years ago
The design of [CityHash](https://github.com/google/cityhash), the hash function used in NameTree and DeadNonceList, is somewhat outdated. dated. While it's still very fast, the hashes produced, especially by the seed-less `CityHash64` variant, have a number of quality issues (see [SMhasher](https://github.com/rurban/smhasher) for details). More modern hash functions produce better quality hashes and can be are as fast as CityHash, if not faster. After a quick survey and a few some simple benchmarks, I came up with three a few candidates that I believe are the most promising potential replacements for CityHash in NFD. All three produce high quality hashes. NFD: * [**xxHash**](https://github.com/Cyan4973/xxHash), specifically the new `XXH3_64bits` variant. Good quality. Speed comparable to `CityHash64` (sometimes faster, sometimes slower, but not by much). When compiled with `-march=native` on an AVX2-capable CPU, CPU it becomes almost always faster than CityHash (but marginally so). * [**wyhash**](https://github.com/wangyi-fudan/wyhash). This seems to be the fastest in my tests. * [**t1ha**](https://github.com/erthink/t1ha), specifically the `t1ha2_atonce` variant. Good quality. Slightly, but consistently, slower than all previous functions (including CityHash). Also provides a pair of endian-specific `t1ha1_*` variants that are supposed to be faster but lower quality than `t1ha2`; however, in my tests I didn't see any measurable speed difference so I guess there's no reason to consider the `t1ha1_*` variants. I explicitly did not include [SipHash](https://github.com/veorq/SipHash). While it provides stronger security guarantees, I don't think it makes much sense to use SipHash just in a couple of places (NameTree and DNL) while everywhere else we use the standard C++ hash table implementation with its default hash function. SipHash is also much slower than the proposed alternatives. Note that I've only considered 64-bit hashes here. The above functions do work on 32-bit architectures as well, but they will typically be slower than dedicated 32-bit hash functions. In 2021 I don't think we should spend too much time optimizing for obsolete architectures such as x86 and arm, but if someone wanted to look into this, it would be trivial to select the hash function at compile time based on the target platform (we don't need stability/portability of the hash value in NFD).