Tag: x86 — Modern C++ // dev

Designing a SIMD Algorithm from Scratch

I manually unrolled a byte-counting loop with four independent accumulators — the textbook ILP optimization — and it ran 2.08x *slower* than the plain loop. The plain loop that GCC had quietly autovec

Modern C++ // dev Mar 31, 2026 10 min read

#x86

Designing a SIMD Algorithm from Scratch