Mar 202612 min read

Rewriting xxHash in Rust

A clean-room Rust reimplementation of xxHash: bit-exact parity across all four variants, NEON/SSE2/AVX2 SIMD paths, and comparable CLI-level throughput to the C reference on Apple Silicon.

benchmarksrustrewrite studyhashing

Rewrite studyFirst benchmark pass

Project

xxHash

Baseline

xxhsum 0.8.3 (C reference)

Rewrite

xxhash-rs 0.1.0

Language

Rust

CLI-level throughput across four scenarios on Apple Silicon. The Rust implementation matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB.

Parity

508/508 tests

XXH64 16 MiB

Comparable to C

XXH3_128 1 MiB

~8% behind C

Scenarios

4 of 8 declared

Samples

Smoke-level (2 per tool)

Bit-exact output across XXH32, XXH64, XXH3_64, and XXH3_128 for all tested input lengths, seeds, and streaming patterns.
SIMD-optimized XXH3 paths for NEON (aarch64), SSE2, and AVX2 (x86_64) all produce bit-exact output matching the scalar fallback.
On XXH64 at 16 MiB, the Rust and C implementations are comparable (~3,972 vs ~3,694 MB/s cross-run median CLI-level throughput).
On XXH3_128 at 1 MiB, the C reference leads by about 8% (~448 vs ~414 MB/s).
At 4 KiB payloads, all comparators converge to the same throughput floor (~2 MB/s), dominated entirely by process startup overhead.

I reimplemented the xxHash family of hash functions in Rust from the published specification, covering all four variants (XXH32, XXH64, XXH3_64, XXH3_128) plus a CLI tool with behavioral parity against the upstream xxhsum. Then I benchmarked the result against the C reference and two contrast comparators.

The short version: the Rust implementation produces bit-exact output for every variant and passes 508 parity tests -- 169 CLI behavioral tests and 339 library-level tests covering hash vectors, streaming equivalence, and SIMD parity. On CLI-level throughput, it matches or exceeds the C reference on XXH64 at 16 MiB and trails by about 8% on XXH3_128 at 1 MiB. At small payloads, process startup dominates and all tools converge to the same floor.

Horizontal bar chart comparing CLI-level cross-run median throughput for c_xxhsum, rust xxhash-rs, b3sum, and md5 on the XXH64 16 MiB scenario. The Rust implementation is at approximately 3,972 MB/s, the C reference at approximately 3,694 MB/s, b3sum at approximately 3,965 MB/s, and md5 at approximately 532 MB/s. — CLI-level cross-run median throughput on XXH64 at 16 MiB. The Rust implementation, C reference, and BLAKE3 are all in the same range (~3.7–4.0 GB/s), while MD5 trails at ~532 MB/s.

Correctness came first

Before touching benchmarks, I validated that the Rust hash core produces the same output as the C reference across every variant and edge case.

The parity suite covers 508 individual test points across two layers:

Library-level (339 tests): Hash vector validation for all four algorithms at boundary lengths (0, 1, 3, 4, 8, 16, 17, 128, 129, 240, 241, and larger), seeded variants with both default and non-zero seeds, streaming equivalence (reset/update/digest patterns match one-shot hashing across multiple chunking patterns), and SIMD-vs-scalar parity on all three instruction sets.

CLI-level (169 tests): Algorithm selection (31 tests), output format parity including GNU/BSD modes, --tag, --little-endian, and escaped filename handling (40 tests), input flow parity for named files, stdin, mixed files, and large file streaming (16 tests), file-list parity via --files-from (11 tests), and the full check-mode policy stack across --quiet, --status, --warn, --strict, and --ignore-missing (71 tests).

All 508 tests pass at the measured revision. (evidence: parity_summary.json)

Grid showing the parity test categories: hash vectors for all four algorithms, streaming chunk and digest-state parity, SIMD scalar parity on Apple Silicon, and CLI behavioral parity across algorithm selection, output formats, input flows, and check modes. All categories passed. — Parity coverage across all four hash variants, streaming APIs, NEON-optimized paths, and the full CLI surface. Every category passed at the measured revision.

The hash core

The xxHash algorithms are built on multiply-rotate-accumulate rounds. The XXH64 round function is compact:

#[inline(always)]
fn round64(acc: u64, input: u64) -> u64 {
    let acc = acc.wrapping_add(input.wrapping_mul(PRIME64_2));
    let acc = rotl64(acc, 31);
    acc.wrapping_mul(PRIME64_1)
}

wrapping_add and wrapping_mul make the integer overflow semantics explicit -- Rust's default arithmetic panics on overflow in debug builds, so wrapping must be opted into. The C reference uses implicit unsigned overflow, which is defined behavior in C but not in Rust's type system. Every arithmetic operation in the hash core uses the wrapping_* methods to match C semantics exactly.

XXH3 is more involved. The core accumulate step XORs input data against a secret buffer, then splits each 64-bit value into 32-bit halves and multiplies them:

pub fn accumulate_stripe_scalar(
    acc: &mut [u64; 8], stripe: &[u8],
    secret: &[u8], secret_offset: usize,
) {
    for i in 0..8 {
        let data_val = read_le_u64(stripe, i * 8);
        let secret_val = read_le_u64(secret, secret_offset + i * 8);
        let value = data_val ^ secret_val;
        acc[i ^ 1] = acc[i ^ 1].wrapping_add(data_val);
        acc[i] = acc[i].wrapping_add(
            (value & 0xFFFFFFFF).wrapping_mul(value >> 32),
        );
    }
}

The i ^ 1 index swap is a deliberate part of the xxHash3 spec: data from lane 0 contributes to lane 1's accumulator and vice versa, which improves diffusion without an extra mixing step.

SIMD acceleration

The scalar accumulate function processes one 64-bit lane at a time. The SIMD variants process multiple lanes per instruction using platform-specific intrinsics.

NEON (aarch64) processes two 64-bit lanes per iteration using 128-bit vector registers:

pub unsafe fn accumulate_stripe_neon(
    acc: &mut [u64; 8], stripe: &[u8],
    secret: &[u8], secret_offset: usize,
) {
    for i in 0..4 {
        let lane = i * 2;
        let data_offset = lane * 8;
        let sec_offset = secret_offset + lane * 8;
        let data_vec = vld1q_u64(stripe.as_ptr().add(data_offset) as *const u64);
        let secret_vec = vld1q_u64(secret.as_ptr().add(sec_offset) as *const u64);
        let value = veorq_u64(data_vec, secret_vec);
        let value_lo = vmovn_u64(value);
        let value_hi = vshrn_n_u64(value, 32);
        let product = vmull_u32(value_lo, value_hi);
        let data_swapped = vcombine_u64(
            vget_high_u64(data_vec), vget_low_u64(data_vec),
        );
        let acc_vec = vld1q_u64(acc.as_ptr().add(lane));
        let result = vaddq_u64(vaddq_u64(acc_vec, data_swapped), product);
        vst1q_u64(acc.as_mut_ptr().add(lane), result);
    }
}

SSE2 (x86_64) uses the same 128-bit width but with _mm_* intrinsics. AVX2 widens to 256-bit __m256i registers, processing four 64-bit lanes per iteration in two passes instead of four. The runtime dispatches to AVX2 when available via is_x86_feature_detected!("avx2"), falling back to SSE2 or scalar.

All three SIMD paths produce bit-exact output matching the scalar reference -- verified by the 12 SIMD-vs-scalar parity tests.

CLI behavioral parity

The CLI tool achieves behavioral parity with the upstream xxhsum for the validated surface: algorithm selection (-H0 through -H128), seed support, file and stdin hashing, GNU and BSD output formats, little-endian output, escaped filename handling (newlines, carriage returns, backslashes), file-list input via --files-from, and the full check-mode policy stack.

Parity is validated through direct output comparison against the C reference binary (when XXHASH_REFERENCE_ROOT is set). Each test invokes both binaries with the same arguments and asserts identical stdout, stderr, and exit code.

Benchmark methodology

The benchmarks measure end-to-end CLI throughput: each comparator is invoked as an external process that reads a payload file and produces a digest on stdout. This captures the full cost of process startup, I/O, and hashing rather than isolating the hash function in a microbenchmark loop.

Comparators

ID	Role	Version
`c_xxhsum`	Reference	xxhsum 0.8.3 (Yann Collet)
`rust_xxhash_rs`	Subject	xxhash-rs 0.1.0
`b3sum`	Contrast	b3sum 1.8.3
`md5`	Contrast	macOS system `/sbin/md5`

c_xxhsum and rust_xxhash_rs are parity oracles: the harness verifies they produce the same digest before accepting timing samples. b3sum and md5 provide throughput context from different hash families.

Scenarios

Scenario	Algorithm	Payload
`xxh64-4k`	XXH64	4 KiB
`xxh64-1m`	XXH64	1 MiB
`xxh64-16m`	XXH64	16 MiB
`xxh3-128-1m`	XXH3_128	1 MiB

Each scenario uses 1 warmup iteration (discarded) followed by 2 measured iterations. The summary statistic is the median of measured samples. A hard correctness gate verifies c_xxhsum and rust_xxhash_rs agree on the output digest before timing results are accepted. Three pinned runs provide the cross-run medians reported below. (evidence: benchmark_summary.json)

Results

XXH64 at 16 MiB

At this payload size, process startup is a small fraction of total time, and the numbers primarily reflect hash throughput.

Comparator	Median throughput
`rust_xxhash_rs`	~3,972 MB/s
`b3sum`	~3,965 MB/s
`c_xxhsum`	~3,694 MB/s
`md5`	~532 MB/s

The Rust implementation, C reference, and BLAKE3 all land in the same range (~3.7–4.0 GB/s), while MD5 trails at ~532 MB/s. The Rust and C xxHash numbers are close enough that run-to-run variance could change their relative order.

XXH3_128 at 1 MiB

Comparator	Median throughput
`c_xxhsum`	~448 MB/s
`rust_xxhash_rs`	~414 MB/s
`b3sum`	~333 MB/s
`md5`	~272 MB/s

The C reference leads the Rust implementation by about 8% (~448 vs ~414 MB/s). Both exercise NEON-optimized XXH3 paths on this Apple Silicon host.

XXH64 at 1 MiB

Comparator	Median throughput
`c_xxhsum`	~565 MB/s
`rust_xxhash_rs`	~472 MB/s
`b3sum`	~424 MB/s
`md5`	~306 MB/s

At 1 MiB, process startup is a larger fraction of measured time. The C reference leads the Rust implementation by about 16% (~565 vs ~472 MB/s), though some of that gap reflects startup and I/O variance rather than pure hash throughput differences.

XXH64 at 4 KiB

Comparator	Median throughput
`md5`	~2.4 MB/s
`c_xxhsum`	~2.2 MB/s
`rust_xxhash_rs`	~2.0 MB/s
`b3sum`	~1.7 MB/s

At 4 KiB, process startup overwhelms the hash computation. All comparators converge to a similar throughput floor (~2 MB/s). These numbers say nothing about hash performance and are included only to illustrate the startup-dominated regime.

Four-panel chart showing CLI-level cross-run median throughput for all four benchmark scenarios: XXH64 at 16 MiB, XXH3_128 at 1 MiB, XXH64 at 1 MiB, and XXH64 at 4 KiB. Each panel compares c_xxhsum, rust xxhash-rs, b3sum, and md5. — CLI-level cross-run median throughput across all four measured scenarios. At 16 MiB the three fast tools are comparable, the C reference leads by ~8% on XXH3_128 at 1 MiB, and at 4 KiB startup dominates.

Where the throughput gap comes from

On XXH64 at 16 MiB, the Rust and C implementations are comparable because the algorithmic work is identical -- the same multiply-rotate-accumulate sequence -- and LLVM optimizes the Rust code similarly to how GCC optimizes the C reference for simple integer arithmetic. The hash core is tight enough that the compiler backend dominates, not the source language.

On XXH3_128 at 1 MiB, the C reference leads by ~8%. XXH3's inner loop is more complex: it combines the XOR-split-multiply accumulate with a secret-derivation step and a scramble pass every 16 stripes. The C reference uses hand-tuned NEON intrinsics that Yann Collet has iterated on for years. The Rust implementation uses equivalent NEON intrinsics but may differ in loop unrolling, register allocation, or instruction scheduling decisions that the compiler makes differently between Clang (for C) and LLVM (for Rust). An 8% gap on a hot SIMD loop is within the range of compiler codegen differences.

At smaller payloads (1 MiB and 4 KiB), process startup and I/O overhead compress the apparent throughput numbers. The hash core's actual speed is masked by fixed costs that both implementations share equally.

Traceability

The repo includes publication tooling that enforces a chain from claims to evidence:

claim_map.py validates that every material claim (methodology, benchmark, parity, limitation, licensing) has a corresponding evidence artifact at a pinned revision
traceability_check.py verifies cross-artifact lineage: publication, parity, and benchmark artifacts must all reference the same measured commit
generate_evidence.py collects test output, benchmark artifacts, and revision metadata into stable machine-readable files under publication/evidence/

This means you cannot update the code without also updating the evidence, and you cannot publish claims that reference a revision different from the one the evidence was collected at.

Limitations

Single platform. All measurements on one Apple Silicon host (arm64, macOS). x86_64 performance may differ, particularly for SSE2/AVX2 XXH3 paths that have not been benchmarked.
CLI-level measurement. Process startup overhead dominates at small payload sizes and partially masks hash throughput differences at medium sizes. Library-level benchmarks would show the raw hash speed more clearly.
Smoke-level sample counts. The pinned runs use 2 measured iterations per comparator per scenario. Production-grade studies would use higher sample counts for tighter confidence intervals.
Partial scenario coverage. 4 of 8 declared scenarios are covered. The remaining four are declared in the manifest but not included in the pinned runs.
Validated CLI surface only. Features outside the validated surface (e.g., the upstream --benchmark mode) are not implemented or tested.
No production deployment evidence. Correctness and benchmark evidence demonstrate parity and baseline performance, not production readiness.

Licensing and clean-room boundary

This is a clean-room reimplementation. The hash algorithms were implemented from the published xxHash specification and the BSD-2-Clause-licensed reference library material. The CLI achieves behavioral compatibility through black-box observation of the upstream xxhsum tool, without translating or copying any GPL-licensed source code.

The upstream project has two license regimes: BSD-2-Clause for the xxHash library and specification (freely usable, informed the Rust hash core), and GPLv2 for the xxhsum CLI tool (treated as an external behavioral oracle only).

xxHash was created by Yann Collet. The Rust reimplementation is released under the MIT OR Apache-2.0 dual license. Zero external runtime dependencies.

How this was built

The implementation was built using Factory mission mode. The mission system planned the project across milestones (hash core, streaming API, SIMD acceleration, CLI tool, benchmark harness, publication), ran worker sessions for each feature, and executed scrutiny reviews and user-testing validators after every implementation step.

The test-to-source ratio is 3.1:1 by line count (11,372 test lines vs 3,676 source lines across 34 Rust files). That ratio reflects the parity-first approach: the majority of the engineering effort went into proving the implementation is correct, not into the implementation itself.

Reproducibility

The measured revision for all evidence is evidence-v1.

git clone https://github.com/sagaragas/xxhash-rs.git
cd xxhash-rs
git checkout evidence-v1

cargo build --workspace --release
cargo test --workspace --all-targets -- --test-threads=3
python3 publication/claim_map.py --verify
python3 publication/traceability_check.py

The evidence pack is committed under publication/evidence/ and includes parity test results, benchmark summaries with correctness gate outcomes, raw timing samples for three pinned claim-ready runs, and a claim-to-evidence map.

Correctness came first

The hash core

SIMD acceleration

CLI behavioral parity

Benchmark methodology

Comparators

Scenarios

Results

XXH64 at 16 MiB

XXH3_128 at 1 MiB

XXH64 at 1 MiB

XXH64 at 4 KiB

Where the throughput gap comes from

Traceability

Limitations

Licensing and clean-room boundary

How this was built

Reproducibility

Source