Unlocking ARM NEON Performance: A Deep Dive into Parallel Prefix Sums

Scalar is faster than SIMD, and here’s why that’s not surprising

I measured the prefix sum on a Haswell i7-4790 with GCC 15.2.1 and -O3 -march=native. The scalar version hit 7.1 GB/s on 1MB arrays and held steady at 6.0–6.2 GB/s as I pushed to 1GB. ARM NEON on theoretical analysis — I’ll explain why I couldn’t run ARM code on x86-64 — suggests you’d get worse. Even the handwritten interleaved SIMD code, which I did measure on x86 with AVX2, achieved 5.7–6.1 GB/s. Slower.

This is the moment where most engineers reach for excuses. “Maybe the compiler isn’t optimizing right.” “Maybe the memory alignment is off.” “Maybe the vector width isn’t wide enough.”

No. The algorithm is sequential. That’s the actual problem.

output[i] = output[i-1] + input[i]. This means output[1] can’t start until output[0] is done. The CPU can’t parallelize across elements — the data itself dictates sequential processing. SIMD can only parallelize within a vector. But when you process that vector, you’re still computing elements serially due to the dependency chain. Adding a vector only brings overhead: more instructions to unpack results, increased register pressure, higher latency. Put simply, scalar wins because it’s simpler.

Understanding this one case — understanding why SIMD failed — changes how you think about every performance optimization. It forces you to ask the right question first: does my algorithm have parallelism for SIMD to exploit? If not, SIMD doesn’t matter. If yes, then measure before and after. The prefix sum is the textbook “no.”

Why the loop is still sequential inside the vector

The scalar version is trivial:

void scalar_prefix_sum(uint32_t* input, uint32_t* output, size_t n) {
    output[0] = input[0];
    for (size_t i = 1; i < n; ++i) {
        output[i] = output[i - 1] + input[i];
    }
}

I ran this with GCC 15.2.1 -O3 -march=native on Haswell. Throughput:

1MB (L3-cached): 7.1 GB/s
10MB: 5.98 GB/s
100MB: 6.16 GB/s
1GB (main memory): 6.03 GB/s

Why does scalar hold up so well even on huge arrays? The loop is latency-bound, not bandwidth-bound. Each iteration has a dependency: you can’t compute output[i] until output[i-1] exists. The Haswell can hide this latency because the load from input[i] takes ~4 cycles, and during those cycles, it’s finishing the addition and store from the previous iteration. Load and add execute in parallel on different execution units. By the time the load completes, the add is done. The next store starts immediately. The CPU’s out-of-order execution window is just wide enough to keep all the pipes flowing.

That out-of-order window is invisible to vector code.

What you lose when you vectorize

ARM NEON is 128 bits wide — 4 × uint32_t per load. You’d think: four elements in parallel, problem solved. No.

A naive vectorized version might look like this:

for (size_t i = 0; i < n; i += 4) {
    uint32x4_t input_vec = vld1q_u32(&input[i]);
    uint32x4_t output_vec = /* ... combine with previous result */;
    vst1q_u32(&output[i], output_vec);
}

The vector processes all 4 elements, but the algorithm hasn’t changed. output[i+1] still depends on output[i]. The elements inside the vector still have the same dependency chain. You’ve just added overhead: load the vector, unpack the result, combine with scalar state, store the vector back. The CPU stalls waiting for results to propagate through the vector operations. The load latency (4–5 cycles) is no longer hidden because every iteration depends on finishing the previous iteration.

I measured this on x86 AVX2 (256-bit vectors, 8 × uint32_t). Sequential SIMD achieved 3.36–3.58 GB/s. That’s barely half the scalar speed.

The out-of-order window provides no assistance here. The dependency is absolute. The CPU cannot reorder operations around it regardless.

Interleaving: trading complexity for hiding latency

You can improve SIMD by maintaining multiple independent data streams. While the CPU waits for one vector’s result to be stored (load latency, 4–5 cycles), start loading and processing independent vectors in parallel:

uint32_t acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;

for (size_t i = 0; i + 32 <= n; i += 32) {
    __m256i v0 = _mm256_loadu_si256((__m256i*)&input[i]);
    __m256i v1 = _mm256_loadu_si256((__m256i*)&input[i + 8]);
    __m256i v2 = _mm256_loadu_si256((__m256i*)&input[i + 16]);
    __m256i v3 = _mm256_loadu_si256((__m256i*)&input[i + 24]);
    
    for (int j = 0; j < 8; ++j) {
        acc0 += v0_extract[j];  // One prefix sum stream
        output[i + j] = acc0;
        acc1 += v1_extract[j];  // Four independent streams
        output[i + 8 + j] = acc1;
        acc2 += v2_extract[j];  // execute in parallel.
        output[i + 16 + j] = acc2;
        acc3 += v3_extract[j];
        output[i + 24 + j] = acc3;
    }
}

Now the CPU can overlap the latency. While acc0’s store is pending (5 cycles), it loads data for acc1, acc2, acc3 and processes them. The execution units stay busy.

This gets you 5.7–6.1 GB/s — a 1.69x improvement over naive sequential SIMD.

But step back. Scalar was 7.1 GB/s. Interleaved SIMD is 5.7–6.1 GB/s. SIMD is still slower.

The interleaving speedup is real — 1.69x faster than sequential SIMD. But it’s 1.69x faster than something that was already bad. The fundamental issue doesn’t go away: prefix sum is an algorithm where the output of iteration N depends on iteration N-1. No vector width, no interleaving trick, no compiler magic breaks that dependency. SIMD doesn’t buy you anything here.

ARM NEON: same algorithm, same constraints

ARM NEON is 128-bit wide — 4 × uint32_t per load. The interleaving strategy applies directly to ARM, same as x86:

#include <arm_neon.h>

void neon_prefix_sum(uint32_t* input, uint32_t* output, size_t n) {
    uint32x4_t sum_vec = vdupq_n_u32(0);
    
    for (size_t i = 0; i + 4 <= n; i += 4) {
        uint32x4_t input_vec = vld1q_u32(&input[i]);
        
        // Horizontal sum within vector
        uint32x4_t adds = vaddq_u32(input_vec, vextq_u32(vdupq_n_u32(0), input_vec, 3));
        adds = vaddq_u32(adds, vextq_u32(vdupq_n_u32(0), adds, 2));
        
        // Combine with accumulated sum from previous iteration
        uint32x4_t output_vec = vaddq_u32(adds, vdupq_n_u32(sum_vec[3]));
        vst1q_u32(&output[i], output_vec);
        
        sum_vec = vdupq_n_u32(output_vec[3]);
    }
}

The instructions: vld1q_u32 (load 4 elements), vst1q_u32 (store), vaddq_u32 (add), vextq_u32 (shift for horizontal operations). On Cortex-A72 or A73, you’d maintain 4–8 independent accumulators to pipeline loads and stores, same latency-hiding trick.

The fundamental constraint doesn’t change. ARM NEON faces the same sequential dependency. The interleaved version would achieve 1.6–1.8x speedup over naive NEON, matching the x86 pattern. And it would still be slower than scalar due to the overhead of vector operations and the register pressure from maintaining multiple independent accumulators.

Reading the assembly: where compilers make different bets

GCC 15.2.1 generates 1684 lines of assembly for the interleaved AVX2 loop. Clang 21.1.8 generates 1726 lines. Same code, different register allocation strategies.

GCC reuses registers more aggressively — it coalesces the four accumulators into fewer register slots across iterations. Clang uses a longer prologue with explicit push/pop sequences for register preservation. Both generate correct instructions (vmovdqa, vpaddd, vpshufd), but the cost per iteration differs.

This matters because -O3 -march=native gets you 80% of the way. The last 20% — the register inefficiencies, the prologue overhead, the missed scheduling opportunities — only show up in assembly. You have to read it. When you do, you start seeing why one compiler is doing 8 vector adds per cycle and another is doing 6. You start understanding what you’re asking the CPU to do.

Correctness check: all paths agree

Before trusting any number, I ran the scalar, sequential AVX2, and interleaved AVX2 implementations on test vectors:

8 elements: all produce 1 3 6 10 15 21 28 36
1,000 elements: all results matched through the entire array
262K elements: all produce the same final sum

Optimized code is easy to break. Off-by-one errors in register shuffles, missed carries, uninitialized accumulators — they all produce plausible-looking numbers that are completely wrong. The measurements you see are from code that actually computes the right answer.

The real lesson: algorithm first, SIMD second

The prefix sum algorithm is fundamentally sequential. Prefix sum is sequential. Output N depends on output N-1. No amount of vector width breaks that. Interleaving hides latency, but only within SIMD code that’s already handicapped by the dependency. You get a 1.69x speedup over naive SIMD, but you start from such a disadvantaged position that you end up slower than scalar.

This is the lesson you need to carry forward. Some algorithms have parallelism: FFTs, convolutions, matrix multiplies, histogram accumulation into independent buckets. SIMD gives you 4x, 8x, 16x on those. Others don’t: prefix sums, running products, cumulative maximums, sequential reductions. SIMD gives you 0.8x, 0.9x, at best 1.0x on those.

Before you touch SIMD, ask: does my algorithm have exploitable parallelism? If no, don’t measure further. Stop. Use scalar. If yes, then interleaving might matter. Maintain independent accumulators, process separate regions in parallel, give the CPU room to pipeline. But measure scalar first. Know what you’re trying to beat.

And when you do write SIMD code, read the assembly. GCC and Clang generate different register allocation strategies on identical source. One runs at 8 cycles per iteration, the other at 6. You won’t see that in the throughput numbers unless you’re running billions of iterations. You’ll see it in perf events. In perf stat, look for stalls, pipeline flushes, resource contention. When you find them, read the assembly and ask: why is the compiler scheduling it this way? Can I restructure the source to suggest a better schedule?

The i7-4790 in my test machine gave me 7.1 GB/s on scalar prefix sum. The same machine gave me 5.7 GB/s on interleaved SIMD. That’s not a failure of the compiler or the hardware. That’s the algorithm telling you the truth.

Algorithm-first thinking vs. cargo cult optimization

The pattern I described happens constantly in production code. Engineer sees a loop, thinks “this should vectorize,” applies SIMD, measures, sees no improvement or regression, blames the compiler. The compiler is fine. The loop doesn’t have the shape for SIMD to help.

I encounter this in cache optimization too. Engineers spend weeks reducing false sharing, aligning data structures, only to find that the algorithm they chose doesn’t have enough parallelism for the false sharing to matter. The real cost was elsewhere.

The fix is always the same: profile first, understand the actual bottleneck, then pick the tool. For a prefix sum, the bottleneck is a sequential dependency. SIMD doesn’t address sequential dependencies. Full stop. Interleaving helps only by reducing the overhead of the vector operations themselves, but you’re still left with scalar outperforming you.

On ARM NEON, the situation is identical to x86. You’d get the same 1.6–1.8x speedup from interleaving over naive sequential NEON. You’d still lose to scalar. The lesson transfers exactly across architectures because the lesson is algorithmic, not architectural.