All posts
Mar 20268 min read

Rewriting xxHash in Rust

A clean-room Rust reimplementation of xxHash: bit-exact parity across all four variants, NEON-accelerated XXH3, and comparable CLI-level throughput to the C reference on Apple Silicon.

benchmarksrustrewrite studyhashing

Rewrite studyFirst benchmark pass

Project

xxHash

Baseline

xxhsum 0.8.3 (C reference)

Rewrite

xxhash-rs 0.1.0

Language

Rust

CLI-level throughput across four scenarios on Apple Silicon. The Rust implementation matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB.

Parity

508/508 tests

XXH64 16 MiB

Comparable to C

XXH3_128 1 MiB

~8% behind C

Scenarios

4 of 8 declared

Samples

Smoke-level (2 per tool)

  • Bit-exact output across XXH32, XXH64, XXH3_64, and XXH3_128 for all tested input lengths, seeds, and streaming patterns.
  • NEON-optimized XXH3 long-input paths on Apple Silicon stay bit-exact with the scalar reference.
  • On XXH64 at 16 MiB, the Rust and C implementations are comparable (~3,972 vs ~3,694 MB/s cross-run median CLI-level throughput).
  • On XXH3_128 at 1 MiB, the C reference leads by about 8% (~448 vs ~414 MB/s).
  • At 4 KiB payloads, all comparators converge to the same throughput floor (~2 MB/s), dominated entirely by process startup overhead.

I reimplemented the xxHash family of hash functions in Rust from the published specification, covering all four variants (XXH32, XXH64, XXH3_64, XXH3_128) plus a CLI tool with behavioral parity against the upstream xxhsum. Then I benchmarked the result against the C reference and two contrast comparators.

The short version: the Rust implementation produces bit-exact output for every variant and passes 508 parity tests. On CLI-level throughput, it matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB. At small payloads, process startup dominates and all tools converge to the same floor.

Horizontal bar chart comparing CLI-level cross-run median throughput for c_xxhsum, rust xxhash-rs, b3sum, and md5 on the XXH64 16 MiB scenario. The Rust implementation is at approximately 3,972 MB/s, the C reference at approximately 3,694 MB/s, b3sum at approximately 3,965 MB/s, and md5 at approximately 532 MB/s.
CLI-level cross-run median throughput on XXH64 at 16 MiB. The Rust implementation, C reference, and BLAKE3 are all in the same range (~3.7–4.0 GB/s), while MD5 trails at ~532 MB/s.

Correctness came first

Before touching benchmarks, I validated that the Rust hash core produces the same output as the C reference across every variant and edge case.

The parity suite covers 508 individual test points:

All 508 tests pass at the measured revision. (evidence: parity_summary.json)

Grid showing the parity test categories: hash vectors for all four algorithms, streaming chunk and digest-state parity, SIMD scalar parity on Apple Silicon, and CLI behavioral parity across algorithm selection, output formats, input flows, and check modes. All categories passed.
Parity coverage across all four hash variants, streaming APIs, NEON-optimized paths, and the full CLI surface. Every category passed at the measured revision.

SIMD parity on Apple Silicon

On AArch64, the release build exercises NEON-optimized XXH3 long-input paths. These produce bit-exact output matching the scalar fallback for both XXH3_64 and XXH3_128, covering streaming variants, derived-secret paths, and both seeded and seed-0 inputs.

CLI behavioral parity

The CLI tool achieves behavioral parity with the upstream xxhsum for the validated surface, which includes algorithm selection, seed support, file and stdin hashing, GNU and BSD output formats, little-endian output, escaped-filename handling, file-list input, and the full check-mode policy stack (--quiet, --status, --warn, --strict, --ignore-missing).

Parity is validated through direct output comparison: 31 algorithm-selection tests, 69 output-format tests, 53 input-flow tests, and 355 check-mode tests. (evidence: parity_summary.json)

Benchmark methodology

The benchmarks measure end-to-end CLI throughput: each comparator is invoked as an external process that reads a payload file and produces a digest on stdout. This captures the full cost of process startup, I/O, and hashing rather than isolating the hash function in a microbenchmark loop.

Comparators

IDRoleVersion
c_xxhsumReferencexxhsum 0.8.3 (Yann Collet)
rust_xxhash_rsSubjectxxhash-rs 0.1.0
b3sumContrastb3sum 1.8.3
md5ContrastmacOS system /sbin/md5

c_xxhsum and rust_xxhash_rs are parity oracles: the harness verifies that they produce the same digest before accepting timing samples. b3sum and md5 provide throughput context from different hash families.

Scenarios

ScenarioAlgorithmPayload
xxh64-4kXXH644 KiB
xxh64-1mXXH641 MiB
xxh64-16mXXH6416 MiB
xxh3-128-1mXXH3_1281 MiB

Each scenario uses warmup iterations (discarded) followed by measured iterations. The summary statistic is the median of measured samples. A hard correctness gate ensures c_xxhsum and rust_xxhash_rs agree on the output digest before timing results are accepted. (evidence: benchmark_summary.json)

Results

XXH64 at 16 MiB

At this payload size, process startup is a small fraction of total time, and the numbers primarily reflect hash throughput.

ComparatorMedian throughput
c_xxhsum~3,694 MB/s
rust_xxhash_rs~3,972 MB/s
b3sum~3,965 MB/s
md5~532 MB/s

The Rust implementation, C reference, and BLAKE3 all land in the same range (~3.7–4.0 GB/s) on XXH64 at 16 MiB, while MD5 trails at ~532 MB/s. The Rust and C xxHash numbers are close enough that run-to-run variance could change their relative order.

XXH3_128 at 1 MiB

ComparatorMedian throughput
c_xxhsum~448 MB/s
rust_xxhash_rs~414 MB/s
b3sum~333 MB/s
md5~272 MB/s

For XXH3_128 at 1 MiB, the C reference leads the Rust implementation by about 8% (~448 vs ~414 MB/s). Both C and Rust NEON-optimized XXH3 paths are exercised on this Apple Silicon host.

XXH64 at 1 MiB

ComparatorMedian throughput
c_xxhsum~565 MB/s
rust_xxhash_rs~472 MB/s
b3sum~424 MB/s
md5~306 MB/s

At 1 MiB, process startup is a larger fraction of measured time. The C reference leads the Rust implementation by about 16% (~565 vs ~472 MB/s), though some of that gap reflects startup and I/O variance rather than pure hash throughput differences.

XXH64 at 4 KiB

ComparatorMedian throughput
c_xxhsum~2.2 MB/s
rust_xxhash_rs~2.0 MB/s
b3sum~1.7 MB/s
md5~2.4 MB/s

At 4 KiB, process startup overwhelms the hash computation. All comparators converge to a similar throughput floor (~2 MB/s). These numbers say nothing about hash performance and are included only to illustrate the startup-dominated regime.

Four-panel chart showing CLI-level cross-run median throughput for all four benchmark scenarios: XXH64 at 16 MiB, XXH3_128 at 1 MiB, XXH64 at 1 MiB, and XXH64 at 4 KiB. Each panel compares c_xxhsum, rust xxhash-rs, b3sum, and md5.
CLI-level cross-run median throughput across all four measured scenarios. At 16 MiB the three fast tools are comparable, the C reference leads by ~8% on XXH3_128 at 1 MiB, and at 4 KiB startup dominates.

Interpretation

The CLI-level benchmarks show that xxhash-rs delivers throughput in the same range as the C reference across the measured scenarios. On the largest payload (XXH64 at 16 MiB), the two are comparable. On XXH3_128 at 1 MiB and XXH64 at 1 MiB, the C reference leads by 8–16%, though process startup, file I/O, and output formatting contribute fixed overhead that compresses the apparent gap at smaller payloads.

Applications that embed the hash library directly would see higher throughput from both implementations, with the fixed startup cost removed.

Limitations

  1. Single-platform benchmarks. All measurements were taken on a single Apple Silicon host (arm64, macOS). Performance on x86_64 may differ, particularly for SIMD-accelerated XXH3 paths where SSE2/AVX2 code paths have not been benchmarked.

  2. CLI-level measurement. Process startup overhead dominates at small payload sizes and partially masks hash throughput differences at medium sizes.

  3. Smoke-level sample counts. The pinned runs use 2 measured iterations per comparator per scenario. A production-grade study would use higher sample counts for tighter confidence intervals.

  4. Subset of declared scenarios. The evidence pack covers 4 of the 8 declared benchmark scenarios. The remaining scenarios are declared in the manifest but not included in the pinned runs.

  5. Validated CLI surface only. Features outside the validated surface (for example, the upstream --benchmark mode) are not implemented or tested.

  6. No production deployment evidence. The parity and benchmark evidence demonstrates correctness and baseline performance, not production readiness.

Four cards summarizing the study limitations: smoke-level sample counts of 2 measured iterations, 4 of 8 declared scenarios covered, single Apple Silicon platform, and CLI-level measurement that masks pure hash throughput.
The results are grounded but bounded: smoke-level samples, partial scenario coverage, one platform, and CLI-level overhead that compresses the hash throughput signal.

Licensing and clean-room boundary

This is a clean-room reimplementation. The hash algorithms were implemented from the published xxHash specification and the BSD-2-Clause-licensed reference library material. The CLI achieves behavioral compatibility through black-box observation of the upstream xxhsum tool, without translating or copying any GPL-licensed source code.

The upstream project has two license regimes: BSD-2-Clause for the xxHash library and specification (freely usable, informed the Rust hash core), and GPLv2 for the xxhsum CLI tool (treated as an external behavioral oracle only). No xxhsum source files, help text, error messages, or implementation logic were incorporated into this repository.

xxHash was created by Yann Collet. The Rust reimplementation is released under the MIT OR Apache-2.0 dual license.

Reproducibility

The measured revision for all evidence is evidence-v1.

git clone https://github.com/sagaragas/xxhash-rs.git
cd xxhash-rs
git checkout evidence-v1

cargo build --workspace --release
cargo test --workspace --all-targets -- --test-threads=3
python3 publication/claim_map.py --verify
python3 publication/traceability_check.py

The evidence pack is committed under publication/evidence/ and includes parity test results, benchmark summaries with correctness gate outcomes, raw timing samples for three pinned claim-ready runs, and a claim-to-evidence map.

Repo and evidence