Beyond Binary Search: When Interpolation Beats Quaternary, Radix, and SIMD
You have a sorted array. Need to search it. `std::lower_bound`: O(log n), predictable, robust. This is the default solution. But my industry experience breeds skepticism. Every few months, a new paper
Modern C++ // dev Apr 28, 2026 10 min read
Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance
Llama.cpp's PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variatio
Modern C++ // dev Apr 21, 2026 7 min read
Designing a SIMD Algorithm from Scratch
I manually unrolled a byte-counting loop with four independent accumulators — the textbook ILP optimization — and it ran 2.08x *slower* than the plain loop. The plain loop that GCC had quietly autovec
Modern C++ // dev Mar 31, 2026 10 min read
SIMD-accelerated computer vision on a $2 microcontroller
The RP2350 has a feature most embedded developers ignore. Two Cortex-M33 cores at 150 MHz, 520 KB of SRAM, $0.80 in quantity — and buried in the ISA, packed arithmetic instructions that process four 8
Modern C++ // dev Mar 24, 2026 13 min read