Designing a SIMD Algorithm from Scratch

I manually unrolled a byte-counting loop with four independent accumulators — the textbook ILP optimization — and it ran 2.08x slower than the plain loop. The plain loop that GCC had quietly autovectorized behind my back.

That result sent me down a path that ended with hand-coded AVX2 intrinsics hitting 16.86 GB/s on a 64 MB buffer. 3.23x faster than what the compiler generated on its own. Here’s every decision along the way.

The function is trivial. Count occurrences of a byte in a buffer:

size_t count_bytes_scalar(const uint8_t* data, size_t n, uint8_t target) {
    size_t count = 0;
    for (size_t i = 0; i < n; ++i) {
        count += (data[i] == target);
    }
    return count;
}

Four lines. The kind of thing you write, ship, and forget about. On an i7-4790 (Haswell) with GCC 15.2.1 at -O2, it processes 5.22 GB/s. Not bad. But not what the hardware can do.

Check what you’re beating

Before writing a single intrinsic, I looked at what -O2 actually produces. The assembly has 47 references to xmm registers and zero ymm references. GCC autovectorized it — but only to 128-bit SSE2.

The inner loop is pcmpeqb for 16-byte comparisons and paddq for accumulation. Competent SSE2 code. But GCC chose not to touch AVX2’s 256-bit registers, even on a CPU that supports them. At -O2, neither GCC nor Clang assumes your binary will only ever run on AVX2 hardware. Fair enough.

Push to -O3 -march=native and GCC emits 35 ymm references — full AVX2 with an alignment prologue, a 256-bit main loop, and a scalar tail.

Clang gets there too at -O3 -march=native, but picks different instructions: vpbroadcastb to splat the target byte (GCC uses a punpcklbw/punpcklwd/pshufd chain), vpermq for lane shuffling (GCC uses vperm2i128), and vpmovzxbq for zero-extension. Same width, different strategy. Both correct.

So why bother with intrinsics? Two reasons. First, -march=native bakes the instruction set into your binary. If your deployment targets vary — and they usually do — you either ship multiple builds, use runtime dispatch, or settle for SSE2. Second, autovectorizers are pattern-matchers. They handle count += (data[i] == target) beautifully. Anything more complex — histogram bins, multi-condition predicates, cross-lane reductions — and you’re writing intrinsics anyway. This scalar loop is the best case for autovectorization. Once you need anything beyond byte equality and summation, you’re on your own.

The unrolling trap

Before reaching for intrinsics, I tried the obvious thing — four independent accumulators to exploit instruction-level parallelism:

size_t count_bytes_scalar_unrolled(const uint8_t* data, size_t n, uint8_t target) {
    size_t c0 = 0, c1 = 0, c2 = 0, c3 = 0;
    size_t i = 0;
    for (; i + 4 <= n; i += 4) {
        c0 += (data[i + 0] == target);
        c1 += (data[i + 1] == target);
        c2 += (data[i + 2] == target);
        c3 += (data[i + 3] == target);
    }
    for (; i < n; ++i) {
        c0 += (data[i] == target);
    }
    return c0 + c1 + c2 + c3;
}

2.51 GB/s. Slower than the plain loop by more than half.

This looks like a benchmark bug until you read the assembly. GCC’s autovectorizer recognizes the simple count += (x == y) reduction and converts it to SSE2 vector code. The four-accumulator version, with its more complex loop structure, defeats the vectorizer’s pattern matching. It compiles to actual scalar code — four cmp/sete/movzx/add sequences per iteration — while the “unoptimized” loop gets 16 bytes of comparison per cycle via pcmpeqb.

I’ve seen this bite people in production code reviews. Someone “optimizes” a loop, the autovectorizer backs off, and throughput drops. Manual microoptimization and compiler autovectorization occupy the same niche. Pick one. Doing both frequently produces something worse than either.

The AVX2 version

The hand-coded version processes 32 bytes per iteration. Four phases.

Alignment prologue

AVX2 aligned loads (_mm256_load_si256) require 32-byte-aligned addresses — on Haswell, vmovdqa on an unaligned address faults. On Skylake and later, vmovdqu has no penalty for crossing cache-line boundaries, so you could skip the prologue entirely. Whether the prologue matters depends on your target microarchitecture. I kept it because I was running on Haswell:

while (p < end && (reinterpret_cast<uintptr_t>(p) & 31u) != 0) {
    count += (*p == target);
    ++p;
}

At most 31 bytes in scalar. Negligible.

The inner loop

__m256i vtarget = _mm256_set1_epi8(static_cast<char>(target));
__m256i vzero   = _mm256_setzero_si256();
__m256i vsum    = _mm256_setzero_si256();

while (p + 32 <= end) {
    __m256i chunk   = _mm256_load_si256(reinterpret_cast<const __m256i*>(p));
    __m256i matches = _mm256_cmpeq_epi8(chunk, vtarget);
    __m256i ones    = _mm256_sub_epi8(vzero, matches);
    __m256i sad     = _mm256_sad_epu8(ones, vzero);
    vsum = _mm256_add_epi64(vsum, sad);
    p += 32;
}

Three instructions do the real work. _mm256_cmpeq_epi8 (vpcmpeqb) compares all 32 bytes against the target in a single cycle, producing 0xFF for matches and 0x00 for mismatches. _mm256_sub_epi8 converts 0xFF to 0x01 via two’s complement: 0 - 0xFF = 0x01 in unsigned byte arithmetic. Then _mm256_sad_epu8 (vpsadbw) sums the bytes within each 8-byte group into a 64-bit lane — four lanes total across the 256-bit register.

The vpsadbw trick is the heart of this. Sum-of-absolute-differences was designed for motion estimation in video codecs, but when one operand is zero, it becomes a horizontal byte sum. That’s exactly what you need after converting matches to 0/1 bytes. I didn’t invent this — it’s been floating around the SIMD folklore for decades — but it’s the kind of trick you only find by reading other people’s intrinsics code or spending quality time with the Intel Intrinsics Guide.

The overflow guard

There’s a subtle bug in the loop above, and it took me longer than I’d like to admit to spot it.

vpsadbw sums bytes into 64-bit lanes, but the intermediate accumulation happens in 8-bit arithmetic within each 8-byte group. If a single 8-byte group accumulates more than 255 matches, it wraps. Every byte is 0 or 1, so the limit is 255 iterations before the worst case (all bytes match) overflows.

The fix is boring but necessary — flush the vector accumulator to a scalar counter every 255 iterations:

if (iters == 255) {
    __m128i hi     = _mm256_extracti128_si256(vsum, 1);
    __m128i lo     = _mm256_castsi256_si128(vsum);
    __m128i sum128 = _mm_add_epi64(lo, hi);
    count += static_cast<size_t>(_mm_extract_epi64(sum128, 0))
           + static_cast<size_t>(_mm_extract_epi64(sum128, 1));
    vsum  = _mm256_setzero_si256();
    iters = 0;
}

255 iterations × 32 bytes = 8,160 bytes between flushes — roughly 8,200 flush cycles in a 64 MB buffer. The cost is a vextracti128, a vpaddq, two vpextrq instructions, and a branch. Negligible against 255 iterations of actual work.

I tested this with three cases: 8,160 bytes (exactly at the limit), 8,192 bytes (one block past), and 1 MB (many flush cycles). All correct.

Scalar tail

After the main loop, any remaining bytes (fewer than 32) handled in scalar:

while (p < end) {
    count += (*p == target);
    ++p;
}

At most 31 iterations. Nothing clever.

What the compiler makes of the intrinsics

The hand-coded version compiles to exactly what you’d expect under GCC 15.2.1 at -O2 -mavx2. The inner loop contains vpcmpeqb, vpsadbw, and vpaddq. The flush path uses vextracti128. The compiler preserved the overflow guard branch — it didn’t try to be clever about it.

This is one advantage of intrinsics over inline assembly: the compiler still handles register allocation, instruction scheduling, and loop alignment. You specify what to compute. It decides when to issue each instruction within the pipeline.

The numbers

All benchmarks: GCC 15.2.1, -O2 -mavx2, 64 MB buffer, Intel i7-4790 (Haswell, 3.60 GHz), five repetitions, median reported.

Version	Throughput	Relative
Scalar (plain loop, SSE2 autovectorized)	5.22 GB/s	1.00x
Scalar (4x unrolled)	2.51 GB/s	0.48x
Hand-coded AVX2	16.86 GB/s	3.23x

That 0.48x for the unrolled version still makes me wince. A “performance optimization” that halves your throughput.

16.86 GB/s on a Haswell part with DDR3 is close to the practical memory bandwidth limit. This algorithm is memory-bound, not compute-bound: the CPU can process 32 bytes in fewer cycles than it takes to fetch the next cache line. On newer DDR5 hardware with AVX-512, the ratio would narrow — bandwidth scales, but so does the autovectorized version’s register width.

Correctness

The AVX2 implementation was tested against the scalar reference across all sizes from 0 to 512 bytes and four target byte values (0x00, 0x41, 0x42, 0xFF). 2,052 test cases. Zero failures.

Those sizes cover what matters: zero-length input, lengths shorter than one vector (under 32 bytes), exact multiples of 32, and tails of 1–31 bytes. The target values exercise unsigned comparison edge cases at 0x00 and 0xFF.

Was it worth it?

For byte counting specifically? Depends on your deployment. If you can ship with -O3 -march=native, the autovectorizer gets you AVX2 without the maintenance burden. The hand-coded version is faster — the autovectorizer generates more complex loop bookkeeping — but the gap isn’t dramatic for this particular pattern.

What I actually got out of this was the vocabulary. vpcmpeqb, vpsadbw, vpaddq — these three instructions are a pattern that shows up in string search, histogram construction, population count. Once you understand SAD-as-horizontal-sum and the flush-every-255 guard, you recognize it in production code, and you know exactly where the overflow bugs hide.

The best intrinsics code is the code that teaches you enough about the hardware that you stop writing intrinsics. Learn the instruction set. Understand what the autovectorizer can and can’t handle. Then make the call — usually -march=native, sometimes not. Knowing when is the whole point.

Tested on Intel i7-4790 (Haswell, AVX2), GCC 15.2.1, Clang 21.1.8, Fedora 43 (kernel 6.17.6). All source code compiled with -std=c++17 -O2 -mavx2. Benchmarks used Google Benchmark 1.9.1 with 5 repetitions.