Profile-Guided Optimization Made Our Code Slower

Configuration	Throughput	vs. Baseline
GCC baseline (`-O2`)	198.1 Mitems/s	—
GCC instrumented	98.3 Mitems/s	−50.4% (profiling cost)
GCC PGO (skewed training)	179.5 Mitems/s	−9.4%
GCC PGO (uniform training)	191.9 Mitems/s	−3.1%
Clang baseline (`-O2`)	183.7 Mitems/s	—
Clang PGO	167.7 Mitems/s	−8.7%
Clang + BOLT	183.2 Mitems/s	−0.3% (noise)
Clang + AutoFDO	188.1 Mitems/s	+2.4%

That’s the whole story. I took a virtual-dispatch interpreter loop — the textbook PGO target — instrumented it, trained it on a representative workload, and recompiled. Both GCC 15.2.1 and Clang 21.1.8 produced binaries that were slower than -O2 alone. Not by a little: GCC lost 9.4%, Clang lost 8.7%.

Everything ran on an i7-4790 (Haswell, 4 cores at 3.6 GHz), Fedora 43 container, Google Benchmark v1.9.1 with 5 repetitions and 2-second minimum run time. Both compilers at -O2. The CVs are under 0.6% — this isn’t noise.

The speculative devirtualization trap

PGO’s pitch is simple enough: instrument, profile, recompile, collect your free 10–20%. I’m not going to walk you through the three-phase workflow — if you’re reading this, you already know it. The part that matters is speculative devirtualization, because that’s the optimization that backfired.

When PGO sees that a virtual call site dispatches to LoadHandler::execute 60% of the time, it rewrites the indirect call into a type guard:

// What PGO generates (conceptually):
if (handler->vptr == &LoadHandler::vtable) {
    // Direct call, inlineable
    acc = static_cast<LoadHandler*>(handler)->execute(operand, acc);
} else {
    // Fallback: original indirect call
    acc = handler->execute(operand, acc);
}

A direct call has its target baked into the instruction stream. No memory load, no BTB lookup, no misprediction. That’s the theory. On a CPU where indirect calls are genuinely expensive, you win.

On Haswell, indirect calls aren’t expensive. The BTB tracks patterns of indirect branch targets indexed by branch history. With a skewed dispatch distribution — 60% one handler, the rest spread across three others — the predictor nails it on most iterations. The cost of a correctly-predicted indirect call is one or two extra cycles, hidden by the pipeline.

PGO’s type guard replaces that single predicted indirect call with a cmp + jne pair that wasn’t there before, plus a conditional branch that mispredicts 40% of the time (every dispatch that isn’t LoadHandler). Those mispredictions cause partial pipeline flushes. The BTB was handling the original dispatch at near-zero marginal cost. PGO replaced it with something worse.

PGO’s speculative devirtualization assumes indirect calls are expensive. On CPUs with strong indirect branch predictors, they aren’t. That’s the entire finding.

The benchmark

Four virtual handler types (LoadHandler, AddHandler, JumpHandler, OtherHandler), skewed distribution: 60% LOAD, 25% ADD, 10% JUMP, 5% OTHER. Each handler does trivial arithmetic so dispatch overhead dominates. 100,000 instructions per iteration, dispatched through a virtual execute method.

This is structurally identical to the hot loop in bytecode interpreters, packet processors, and event-driven state machines.

inline int64_t run_workload(
    const std::array<std::unique_ptr<Handler>, 4>& handlers,
    const std::vector<Instruction>& insns)
{
    int64_t acc = 0;
    for (const auto& i : insns) {
        acc = handlers[static_cast<uint8_t>(i.op)]->execute(i.operand, acc);
    }
    return acc;
}

GCC: −9.4%

GCC’s baseline produces 198.1 Mitems/s on the skewed workload. The instrumented build (-fprofile-generate) drops to 98.3 Mitems/s — a 2.01x slowdown, within the expected range. You pay that during profiling only.

After training on the skewed workload and recompiling with -fprofile-use: 179.5 Mitems/s. CV under 0.4%. I ran it multiple times because I didn’t believe it.

Clang: −8.7%

Clang’s PGO workflow has more steps:

# 1. Instrument
clang++ -fprofile-instr-generate -O2 -o bench-instr bench-pgo.cpp

# 2. Run
./bench-instr --benchmark_min_time=2s

# 3. Merge profiles
llvm-profdata merge -output=default.profdata default.profraw

# 4. Recompile
clang++ -fprofile-instr-use=default.profdata -O2 -o bench-pgo bench-pgo.cpp

167.7 Mitems/s versus a 183.7 baseline — an 8.7% regression.

Two compilers, independently, same outcome: not a compiler bug, a hardware interaction.

The training data surprise

I expected the “wrong training data” result to go one way: train on uniform data, measure on skewed data, get worse numbers. That’s the conventional wisdom.

It went the other way. The uniform-trained binary (25% per handler) runs at 191.9 Mitems/s on the skewed workload — 3.1% below the no-PGO baseline, but faster than the correctly-trained PGO build at 179.5.

Skewed training data makes GCC more confident that LoadHandler is the hot target. Higher confidence means more aggressive speculative devirtualization and more type-guard overhead per dispatch. The “wrong” training data produces less aggressive speculation, which happens to be closer to optimal when the BTB is already handling the dispatch.

So yes, representative training data matters. But on this hardware, “more representative” means “more confidently wrong.”

AutoFDO: +2.4%

AutoFDO samples a production run via hardware performance counters instead of instrumenting the binary. No 2x profiling slowdown. Different profile format — and that format difference is why it works.

# Record with Last Branch Records
perf record -e cycles:u -b -o perf.data -- ./bench-baseline-clang

# Convert to LLVM profile
llvm-profgen --binary=./bench-baseline-clang \
             --perfdata=perf.data \
             --output=autofdo.profdata

# Recompile with sample profile
clang++ -fprofile-sample-use=autofdo.profdata -O2 -o bench-autofdo bench-pgo.cpp

188.1 Mitems/s versus the 183.7 Clang baseline. +2.4%.

Instrumented PGO records exact call targets per site — “60,000 calls to LoadHandler::execute here” — and that per-site type histogram triggers speculative devirtualization. AutoFDO records something different: LBR samples aggregated to control flow graph edges. Hot edge, cold edge. It carries function-level and edge-level hotness, which drives inlining and code layout decisions. It doesn’t carry the per-call-site type breakdown that triggers type guards.

AutoFDO sidesteps the devirtualization trap entirely. The profile data drives code layout — hot basic blocks contiguous, cold error paths pushed out — and inlining decisions the default heuristics would miss. None of that conflicts with what the BTB is already doing.

With 510,096 LBR samples, those layout and inlining improvements are worth 2.4%. Not dramatic. But it’s the only number in this investigation with a plus sign.

BOLT: don’t bother in a container

BOLT takes a compiled binary and a perf profile, then reorders functions and basic blocks for instruction cache utilization. I ran the full pipeline — perf2bolt, then llvm-bolt.

It reordered 3 functions, shortened 21 instructions, applied hot/cold splitting (1,142 hot bytes, 636 cold). Throughput: 183.2 Mitems/s versus 183.7 baseline. 0.997x. Nothing.

The problem: no hardware LBR in the container. perf2bolt fell back to basic sample mode — function-level hotness without edge-level branch frequencies. BOLT’s block reordering, its primary optimization, had nothing to work with. You get pretty dyno-stats output and no measurable improvement.

If you have bare-metal perf access with LBR, BOLT might be worth trying as a post-link pass on top of AutoFDO. In a container or VM without hardware counter passthrough, skip it.

What I’d actually do

Don’t assume PGO helps. I’ve now seen it hurt by 9.4% on a workload that should have been its best case.

If your hot path is virtual-dispatch-heavy and your target hardware is Haswell or newer Intel (or Zen 2+ AMD), the indirect branch predictor is probably already doing the job PGO’s devirtualization is trying to do. Start with perf stat — look at your branch-miss rate on the dispatch site. If it’s low, speculative devirtualization is solving a problem you don’t have.

Try AutoFDO over instrumented PGO. The profile format avoids the devirtualization trap, the profiling overhead is negligible, and it integrates with production monitoring. The gains are modest — 2.4% on my workload — but at least they’re gains.

And if someone tells you PGO is a free 15%, ask them what CPU they measured on.