Purpose
rand_chacha lacked suport for zeroize and did not include NEON SIMD optimizations. While exploring ways to contribute, I began by adding zeroize support to its dependency, ppv-lite86, which had an open issue requesting this feature. During this work, I also noticed a long-standing issue proposing that rand_chacha transition to the RustCrypto chacha20 implementation.
ppv-lite86 was significantly behind in maintenance and had an open "mono-trait refactor" issue that would require substantial restructuring. Given the scope of that work, extending the RustCrypto chacha20 implementation with RNG functionality appeared to be the more substantial path. The issue had been open for over a year without a PR, so I began implementing the necessary features and optimizations.
Optimizations
Eliminating redundant copies
The original RNG workflow involved multiple buffer allocations and copies:
1) Receive a request to fill a buffer.
2) Initialize backend struct, allocate a new buffer, and fill it with the backend.
3) Copy data from the backend's buffer to the RNG's internal buffer.
4) Copy data from the RNG's internal buffer to the user-provided buffer.
5) Repeat steps 2-5 until the provided buffer is filled.
This design introduced unnecessary overhead. I implemented a revised fill_bytes() method that, once the internal buffer is exhausted, calculates how many 4-block chunks remain and writes them directly into the user-provided buffer via the SIMD backends. After all chunks are written, the remainder of the buffer is filled using the RNG's internal buffer without reinitializing the backend.
Performance impact:
Benchmarking with AVX2 on a Raptor Lake Refresh i9 CPU yielded these results:
9% improvement on optimized builds without
-C target-cpu9.4% improvement on builds optimized for the target CPU
The Rust‑Random maintainers ultimately decided not to adopt this change due to the added complexity and their preference for a simpler, more conservative implementation strategy.
SIMD buffer width improvements
The SSE2 backend originally used a 1‑block buffer, while other SIMD backends used 4‑block buffers. Increasing the SSE2 buffer to 4 blocks improved throughput for filling larger buffers and aligned the behavior across SIMD backends. This gives developers using the library clearer expectations when tuning buffer sizes for performance.
Flexible input formats for stream_id and block_pos
I added support for setting the stream_idand block_pos using any supported 64-bit format using a generic implementing From<T> where T could be:
[u8; 8][u32; 2]u64
This provided two benefits:
More domain-separation options for isolating keystreams
Eliminated unnecessary conversions, especially when users start with
u32values that then would have to be converted to au64, only for the RNG to split them back intou32s internally.
This feature was later removed during API simplification, but the work demonstrated a viable approach for ergonomic, flexible input handling. I later reintroduced this capability in the production branch of my fork, replacing conversions with a pointer-based approach that reads 128 bits from memory so that any properly aligned 128-bit data layout can be used to set the final 128 bits of the ChaCha state.
In-place writes with NEON
I added support for in‑place writes in the NEON backend when the destination buffer is aligned to a 16‑byte boundary. This allows the backend to operate directly on the destination buffer without allocating a separate results buffer.
Performance impact: ~0.5% improvement
Security benefit: reduces the amount of temporary buffer space that must be zeroized
Testing
To ensure that all modifications preserve the exact ChaCha20 output, the implementation is validated through a series of deterministic and backend‑specific tests.
The pointer‑based fill_bytes() implementation is first tested against a reference chacha20 implementation to confirm that both produce identical keystreams for incremantally increasing buffer sizes and alignment conditions.
A diagnostic test was later added to verify correctness across SIMD backends. Each backend computes multiple ChaCha20 blocks in parallel, and the test compares every block against the expected output. The diagnostic reports:
the number of incorrect words in the block
the index of the first incorrect word
whether errors are localized or distributed
This information makes it possible to identify the source of an error precisely. For example:
errors in the entire block for parallel blocks 2-4 indicate a 32-bit counter addition early in the rounds function for that block
1-2 incorrect words starting at the 12th or 13th word suggests that the counter was added using 32-bit arithmetic at the end of the rounds function
widespread errors in the first parallel block indicate an incorrect ChaCha20 implementation for that backend
This testing strategy ensures that optimizations such as pointer‑based writes, widened SIMD buffers, and in‑place NEON operations do not alter the cryptographic correctness of the ChaCha20 core.