Topic hub

Performance Engineering

Cache lines, SIMD intrinsics, profile-guided optimisation, lock-free data structures, and memory ordering. Every number comes from real measurements on real hardware — assembly output included.

When JSON Schema Crashes Your Inference Server: Regex DoS in C++

Feed `std::regex` a pathological pattern and a crafted input, and watch it spiral. Input length 16 characters: 4.48 milliseconds. Input length 18: 18 milliseconds. Input length 20: over a second. The

Modern C++ // dev May 8, 2026 9 min read

Compilers Performance Data Structures LLVM

LLVM's Flat-Buffer Tree for IR Dominators: O(1) Reads vs O(n) Moves

Compiler optimization passes live and die on tree traversal. LLVM's dominator analysis alone queries ancestor relationships thousands of times per function. A real C++ translation unit with heavy temp

Modern C++ // dev May 5, 2026 9 min read

C++26 Performance Safety Standards

Compile-Time Unsigned Overflow Detection in C++: From `if`-checks to `constexpr` Assertions

Unsigned overflow wraps. That's the contract. C++ won't catch it, the CPU won't trap it. But your size calculation that wraps to a smaller buffer? That's a memory corruption vulnerability, and it's yo

Modern C++ // dev May 1, 2026 9 min read

SIMD Performance Algorithms Benchmarks

Beyond Binary Search: When Interpolation Beats Quaternary, Radix, and SIMD

You have a sorted array. Need to search it. `std::lower_bound`: O(log n), predictable, robust. This is the default solution. But my industry experience breeds skepticism. Every few months, a new paper

Modern C++ // dev Apr 28, 2026 10 min read

C++26 Concurrency Performance Standards

Parallel execution for loops: C++26's work-stealing scheduler under the hood

I've spent enough time staring at `perf stat` output to recognize a pattern: OpenMP's dynamic scheduler measures **55,593 ops/sec** on Zipf-distributed task costs with 512 tasks. That's about 18 micro

Modern C++ // dev Apr 24, 2026 9 min read

Performance AI/ML SIMD Optimization

Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance

Llama.cpp's PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variatio

Modern C++ // dev Apr 21, 2026 7 min read

C++26 Concurrency Standards I/O

P3373R2: The Case for a Standardized Low-Latency I/O API

Here's the uncomfortable truth: modern C++ standard library I/O becomes a bottleneck at scale. Traditional POSIX APIs introduce 1–10 microseconds of latency per operation due to syscall overhead and k

Modern C++ // dev Apr 17, 2026 10 min read

C++26 Move Semantics Standards Performance

C++26 Move Semantics: What's New Since CppCon 2025 Basics Talk

If you watched Ben Saks's CppCon 2025 'Back to Basics: Move Semantics' talk, you know what moves are and why the compiler calls them. That talk is solid. C++26 doesn't contradict it. What it does is t

Modern C++ // dev Apr 10, 2026 8 min read

SIMD Performance x86 Optimization

Designing a SIMD Algorithm from Scratch

I manually unrolled a byte-counting loop with four independent accumulators — the textbook ILP optimization — and it ran 2.08x *slower* than the plain loop. The plain loop that GCC had quietly autovec

Modern C++ // dev Mar 31, 2026 10 min read

Embedded SIMD ARM Performance

SIMD-accelerated computer vision on a $2 microcontroller

The RP2350 has a feature most embedded developers ignore. Two Cortex-M33 cores at 150 MHz, 520 KB of SRAM, $0.80 in quantity — and buried in the ISA, packed arithmetic instructions that process four 8

Modern C++ // dev Mar 24, 2026 13 min read

Performance PGO Compilers Optimization

Profile-Guided Optimization Made Our Code Slower

That's the whole story. I took a virtual-dispatch interpreter loop — the textbook PGO target — instrumented it, trained it on a representative workload, and recompiled. Both GCC 15.2.1 and Clang 21.1.

Modern C++ // dev Mar 10, 2026 8 min read

Concurrency Lock-Free Queues Benchmarks

Lock-Free Queue Implementations Compared: Correctness, Performance, and the Bugs You'll Ship

A `std::mutex`-protected `std::deque` is 12% faster than moodycamel::ConcurrentQueue when contention is low.

Modern C++ // dev Mar 6, 2026 12 min read

performance cache concurrency

Cache-Line Archaeology: Finding and Fixing False Sharing in Production

Your threads are doing independent work on independent data, and yet adding a second thread makes everything six times slower. This is false sharing, and it hides in struct layouts and thread-local counters across more production codebases than anyone wants to admit.

Modern C++ // dev Feb 27, 2026 8 min read