LLMpediaThe first transparent, open encyclopedia generated by LLMs

CityHash

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: FNV Hop 5
Expansion Funnel Raw 80 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted80
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
CityHash
NameCityHash
DeveloperGoogle
Released2011
Programming languageC++
Operating systemCross-platform
LicenseBSD-style

CityHash is a family of non-cryptographic hash functions developed for high-performance string and byte-array hashing in software systems. It was designed by engineers at Google to provide fast hashing for data structures and checksums used in projects like Bigtable, Protocol Buffers, and infrastructure services. CityHash aims to balance speed and avalanche behavior for diverse inputs on architectures including x86-64 and ARM.

Overview

CityHash provides several related hash functions optimized for different input lengths and processor capabilities, addressing needs in projects such as Chromium, YouTube, Android, and Google Search. The algorithms were implemented in C++, widely used alongside libraries like abseil, gRPC, and LevelDB. CityHash's design reflects engineering tradeoffs familiar to developers from Facebook and Microsoft working on systems like RocksDB and SQL Server. Its role intersects with data structures in Redis and sharding strategies in Spanner.

Algorithm and Variants

CityHash includes multiple variants for short, medium, and long inputs and for producing 32-bit, 64-bit, and 128-bit outputs; these variants parallel approaches found in MurmurHash, xxHash, SipHash, and FNV families. The core employs integer multiplication, bitwise rotates, and mixing functions resembling techniques used in SipHash for diffusion, while avoiding the cryptographic constructions in SHA-1, SHA-256, or Blake2. CityHash128 offers 128-bit output comparable in width to outputs from MD5 and RIPEMD-128 but not designed for cryptographic security like SHA-3. Variants leverage CPU features such as SSE2, SSE4.1, and ARM NEON intrinsics when available, similar to optimizations in OpenSSL and compiler toolchains like GCC and Clang.

Performance and Implementation

CityHash was benchmarked against alternatives including MurmurHash3, xxHash, and FNV-1a across workloads in projects like Bigtable and TensorFlow. Its implementation focuses on branchless code paths and cache-friendly access patterns akin to work from Intel performance engineering and the Linux kernel hashing primitives. Implementations are integrated into build systems using Bazel and CMake and are tuned for processors from Intel and AMD as well as mobile SOCs from Qualcomm and Apple. Profiling tools such as perf, Valgrind, and gprof were used in evaluation, and continuous integration systems like Jenkins and Travis CI have been employed in projects that included CityHash.

Security and Collision Analysis

CityHash is explicitly non-cryptographic; security guidance from organizations like OWASP and practitioners who audited code for Google and Facebook recommends avoiding CityHash for security-sensitive use cases such as authentication tokens or digital signatures. Cryptanalysis contrasts CityHash with SipHash—which was designed to resist hash-flooding attacks observed in Django and Ruby on Rails—and with cryptographic hashes in OpenSSL and GnuPG. Collision analyses by researchers in the vein of studies from University of California, Berkeley and University of Cambridge show that while CityHash has low accidental collision rates for typical datasets, it can be vulnerable to crafted collision inputs much like other non-cryptographic hashes used in PHP and Python prior to mitigations. Consequently, systems that faced attacks, such as servers running Apache HTTP Server or nginx, prefer using SipHash or cryptographic alternatives.

Uses and Applications

CityHash has been used for hash tables, sharding keys, checksums, and data deduplication in projects across Google Cloud Platform, Chromium OS, and backend services powering Gmail and Google Drive. It appears in storage and database projects like LevelDB, RocksDB, and file systems where fast non-cryptographic hashing is required, paralleling use cases in Hadoop, Cassandra, and MongoDB. Engineers working on distributed systems such as Kubernetes, Istio, and Envoy have selected similar hash functions for load balancing and partitioning tasks. However, security-conscious projects like OpenSSH and Signal avoid CityHash for authentication.

History and Development

CityHash originated from engineering efforts at Google around 2011 to create a fast, general-purpose hashing family for internal infrastructure, influenced by prior work at Facebook (including MurmurHash derivatives) and academic research from institutions like Stanford University and Massachusetts Institute of Technology. The codebase evolved through public commits and discussions in repositories hosted by GitHub and code review systems used at Google. Subsequent alternatives and successors—such as FarmHash and HighwayHash—addressed some limitations by improving portability and security considerations, echoing development patterns seen in projects like BLAKE2 and Skein. CityHash remains part of the historical lineage of fast non-cryptographic hashes alongside MurmurHash3, xxHash, and SipHash in modern software engineering.

Category:Hash functions