Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance

Llama.cpp’s PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variations.

Not huge. Not a 10× breakthrough. But reliable—and it emerges from doing exactly what GPU programmers have done for years: recognizing that two operations reading the same data can be restructured into one. On CPU, you’re fighting memory bandwidth, not register pressure. The payoff is identical. The techniques are transportable.

What struck me: this isn’t a GPU trick adapted to CPU. It’s a CPU-specific optimization, though GPU fusion offers the lesson. I spent time with the PR and the investigation, and the pattern here is worth dissecting.

The Memory Bandwidth Bottleneck

Memory-bandwidth-bound. That’s the phrase everyone uses for CPU inference. But what does it actually mean?

A typical consumer CPU — Intel i7, AMD Ryzen — connects to main memory through a finite-bandwidth pipe. DDR4-3200 dual-channel: roughly 50 GB/s. DDR5: 80 GB/s. That’s the ceiling the memory controller enforces, indifferent to core count.

LLM inference involves two activities: read weights and activations, then compute. The ratio is lopsided. A transformer layer reads 2–4× the amount of data it performs operations on. Saturate the memory bus and adding faster CPUs or wider SIMD does nothing. You wait for data.

The roofline model formalizes this constraint: min(peak_compute, peak_bandwidth × arithmetic_intensity). For element-wise operations like normalization, arithmetic intensity is roughly 1 operation per byte (4 FP ops per float; floats are 4 bytes each). At 50 GB/s, that’s a 12.5 billion operations-per-second ceiling. Modern CPU cores handle several operations per cycle per core, which at 4 GHz is orders of magnitude higher. Bandwidth wins. Optimization targets memory, not computation.

RMS normalization followed by a learned scale replaced LayerNorm in modern LLMs. Llama 2, Phi, Llama 3 all use it. RMS is cheaper: no centering, just scale. The operation is:

sum = 0
for each element x[i]:
    sum += x[i]²

rms = sqrt(sum / dimension + eps)

for each element x[i]:
    normalized[i] = x[i] / rms
    output[i] = weight[i] * normalized[i]

Three loops. Three memory passes. The compiler sees function boundaries and stops there.

Why the Compiler Won’t Auto-Fuse

This is where the conservative nature of compilers matters. Modern C++ compilers are aggressive: link-time optimization, inline aggressive, vectorize. Yet they will not merge operations across function boundaries without explicit refactoring.

GCC 15.2.1 with -O3 produces two distinct symbol entries:

_Z8rms_normRKSt6vectorIfSaIfEEffRSt6_vector
_Z3mulRKSt6vectorIfSaIfEES3_RSt6_vector

Two entry points. Call. Return. Call again. No merged loop, even with __attribute__((noinline)) removed and LTO enabled. The compiler respects that boundary because correctness comes first. It cannot assume intermediate results are safe to skip—what if one function logs, profiles, or caches based on that intermediate? What if the vector is aliased? The compiler must be conservative. Prove correctness by refactoring. Then it sees the fusion.

The Refactoring: One Loop Instead of Three

PR #22423 applies the manual refactoring: three loops restructured into one.

void rms_norm_mul_fused(const std::vector<float>& x, float eps, 
                         const std::vector<float>& weight, std::vector<float>& out) {
    // Pass 1: compute RMS
    float sum_sq = 0.0f;
    for (float val : x) {
        sum_sq += val * val;
    }
    float rms = std::sqrt(sum_sq / x.size() + eps);
    
    // Pass 2: normalize and multiply, same loop
    for (size_t i = 0; i < x.size(); ++i) {
        float normalized = x[i] / rms;
        out[i] = normalized * weight[i];
    }
}

The first pass is unavoidable—you need the scalar RMS before normalizing. The second pass merges normalization and multiplication. Each output depends only on the corresponding input, weight, and scalar RMS, so aliasing and hidden dependencies vanish.

Now the compiler sees both operations in the same loop structure and vectorizes, prefetches, unrolls accordingly. The code structure proves it’s safe.

Memory traffic drops from 3N floats to 2N. One full vector read eliminated per layer per forward pass. A 32-layer model with 8 RMS_NORM + MUL pairs per layer saves 256 vector reads per forward pass. At 50 GB/s, that’s milliseconds saved on every inference.

The Data

Benchmarks ran on Intel Core i7-4790, Fedora 43, GCC 15.2.1, -O3. High precision, multiple repetitions, controlled environment. Different dimensions (4096 and 8192 floats) and thread counts (1, 4, 8 cores).

At dimension 4096:

Configuration	Unfused	Fused	Speedup
1 thread	132.86 µs	82.76 µs	1.61×
4 threads	139.54 µs	88.78 µs	1.57×
8 threads	248.63 µs	155.16 µs	1.60×

At dimension 8192:

Configuration	Unfused	Fused	Speedup
1 thread	270.29 µs	169.17 µs	1.60×
4 threads	278.26 µs	177.50 µs	1.57×
8 threads	498.97 µs	314.41 µs	1.59×

1.57–1.61×. Every configuration. Variance under 2% across eight different setups—that consistency is a tell. You’re seeing structural reduction in memory traffic, not a cache artifact that evaporates at different problem sizes. Speedup scales linearly with dimension because both read O(n) data; the unfused version reads 50% more.

Thread contention is real. At 8 threads on a shared bus, total time climbs because seven other cores compete for that 50 GB/s. Unfused drops to 500 µs, fused to 314 µs. You don’t get 8× speedup—the bus is the limit, not core count. But fusion maintains the 1.6× ratio regardless. The savings are proportional and repeatable.

The Pattern and Implications

This is not a one-off. CPU kernel fusion follows one principle: identify operations that read overlapping data, restructure into a single loop, let the compiler optimize.

Other transformer pipeline pairs are candidates:

Activation + Scale: ReLU or GELU, then scaling. Two reads of the same vector—fuse into one loop.
Matrix Multiply + Bias + Activation: Matmul outputs logits. Bias addition. GELU application. All three touch the same intermediate. Fusion keeps it cached.
Attention + Softmax: Logits computed, softmax applied, intermediate written and read back. Fuse them.

Ggml already has a fusion abstraction at the operation level, but recognition is the bottleneck. Developers implement specific fused kernels like ggml_rms_norm_mul or rewrite the compute graph. Auto-fusion on CPU is largely missing from open-source ML frameworks. GPU fusion gets the research attention. CPU fusion is for those focused on edge inference and cost-constrained deployments.

But gains are real. The technique straightforward. It scales. A 13B model on CPU could see the difference between 5 seconds per forward pass and 3 seconds if this pattern applies across natural fusion points.

Refactoring is trivial: move the loop boundary. Benchmarking is the hard part. The technique is not exotic. The payoff compounds across layers.

Profile with perf stat or perf record. If memory bandwidth is the bottleneck, not CPU cycles, this pattern is worth exploring in your code. Find consecutive operations where one’s output feeds the next. Fuse them. Benchmark. If the speedup appears—and it usually does—push it upstream or nearby. The pattern works.

The Memory Bandwidth Bottleneck

Why the Compiler Won’t Auto-Fuse

The Refactoring: One Loop Instead of Three

The Data

The Pattern and Implications

Stay Updated