Project

General

Profile

Feature #5175

Replace CityHash with a more modern hash function

Added by Davide Pesavento 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Tables
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

The design of CityHash, the hash function used in NameTree and DeadNonceList, is somewhat outdated. While it's still very fast, the hashes produced, especially by the seed-less CityHash64 variant, have a number of quality issues (see SMhasher for details). More modern hash functions produce better quality hashes and can be as fast as CityHash, if not faster.

After a quick survey and a few simple benchmarks, I came up with three candidates that I believe are the most promising potential replacements for CityHash in NFD. All three produce high quality hashes.

  • xxHash, specifically the new XXH3_64bits variant. Speed comparable to CityHash64 (sometimes faster, sometimes slower, but not by much). When compiled with -march=native on an AVX2-capable CPU, it becomes almost always faster than CityHash (but marginally so).
  • wyhash. This seems to be the fastest in my tests.
  • t1ha, specifically the t1ha2_atonce variant. Slightly, but consistently, slower than all previous functions (including CityHash). Also provides a pair of endian-specific t1ha1_* variants that are supposed to be faster but lower quality than t1ha2; however, in my tests I didn't see any measurable speed difference so I guess there's no reason to consider the t1ha1_* variants.

I explicitly did not include SipHash. While it provides stronger security guarantees, I don't think it makes much sense to use SipHash just in a couple of places (NameTree and DNL) while everywhere else we use the standard C++ hash table implementation with its default hash function. SipHash is also much slower than the proposed alternatives.

Note that I've only considered 64-bit hashes here. The above functions do work on 32-bit architectures as well, but they will typically be slower than dedicated 32-bit hash functions. In 2021 I don't think we should spend too much time optimizing for obsolete architectures such as x86 and arm, but if someone wanted to look into this, it would be trivial to select the hash function at compile time based on the target platform (we don't need stability/portability of the hash value in NFD).

#1

Updated by Davide Pesavento 3 months ago

  • Description updated (diff)

Also available in: Atom PDF