Profile-Guided Optimization Made Our Code Slower
That's the whole story. I took a virtual-dispatch interpreter loop — the textbook PGO target — instrumented it, trained it on a representative workload, and recompiled. Both GCC 15.2.1 and Clang 21.1.
2 articles
That's the whole story. I took a virtual-dispatch interpreter loop — the textbook PGO target — instrumented it, trained it on a representative workload, and recompiled. Both GCC 15.2.1 and Clang 21.1.
False sharing is a measurable, fixable performance bug that hides in struct layouts. Two atomic counters in the same cache line can cost you 6x throughput — and perf c2c finds it in seconds.