Two atomic counters in a struct. Two threads, each incrementing its own counter.
Adding the second thread made everything six times slower. Not twice as fast.
Not the same speed. Six times slower. The profiler showed no lock contention,
no syscall overhead, nothing obviously wrong. I stared at perf stat output for
longer than I’d like to admit before remembering to check where the data
actually lived.
False sharing. Two logically independent variables sitting in the same 64-byte cache line, bouncing between cores on every write. It hides in struct layouts and thread-local counters across more production codebases than anyone wants to admit.
The Benchmark That Started This
Here’s the struct:
struct SharedCounters {
std::atomic<std::uint64_t> a{0}; // offset 0
std::atomic<std::uint64_t> b{0}; // offset 8
};
static_assert(sizeof(SharedCounters) <= 64, "Both fit in one cache line");
Both atomics sit within one cache line. Every fetch_add from one thread
invalidates the other thread’s copy. The coherence protocol — MESI on single
socket, MESIF/MOESI across sockets — doesn’t know bytes 0 and 8 are
independent. It only sees the line.
Measured on an Intel i7-4790 at 3.60 GHz:
| Threads | Shared (ns/op) | Padded (ns/op) | Slowdown |
|---|---|---|---|
| 1 | 6.56 | 6.56 | 1.0x |
| 2 | 40.00 | 6.65 | 6.0x |
| 4 | 78.25 | 57.32 | 1.4x |
| 8 | 156.33 | 97.66 | 1.6x |
That 6x gap at two threads is not a contrived worst case. It’s two counters in a struct, the kind of thing that shows up in connection managers, stats collectors, and lock-free queues everywhere. Single-threaded, they’re identical. The cache line never bounces when there’s nobody to bounce it to.
At 4 and 8 threads the i7-4790’s 4 physical cores saturate. Even the padded version slows down from core contention, but the false-sharing version is consistently worse.
perf c2c Finds It in Seconds
I wasted time guessing before I learned this tool existed. On Linux, perf c2c
points directly at the offending cache line:
perf c2c record -- ./your-binary
perf c2c report --stdio
I ran it against our test binary with 8 threads, 50 million iterations each on the shared counters:
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 27
Load HITs on shared lines : 112055
Load Local HITM : 7171
=================================================
Shared Data Cache Line Table
=================================================
# ----------- Cacheline ---------- Tot
# Index Address Node PA cnt Hitm
0 0x405080 0 177441 99.61%
One cache line. 99.6% of all HITM events. 7,143 cross-core invalidations. That’s your signal.
The pareto breakdown names the exact accesses:
0 0 7143 214344 0 0 0x405080
---------------------------------------------------------------
0.00% 49.10% ... 0x0 [.] writer_a atomic_base.h:631
0.00% 50.90% ... 0x8 [.] writer_b atomic_base.h:631
Offset 0x0 and offset 0x8, split nearly 50/50. That’s the two
std::atomic members. Symbol name, source file, byte offset. It tells you
everything. On Intel VTune, the equivalent is the “Memory Access” viewpoint
with “Contested Accesses” highlighted. Same data, friendlier GUI.
Alignment Padding: The Obvious Fix
Force each counter onto its own cache line:
struct PaddedCounters {
alignas(64) std::atomic<std::uint64_t> a{0};
alignas(64) std::atomic<std::uint64_t> b{0};
};
static_assert(sizeof(PaddedCounters) >= 128);
You waste 56 bytes per counter. For a handful of hot counters, that’s nothing. For an array of ten thousand objects, you’ve just burned 560 KB on padding. Whether that matters depends on whether those objects are hot enough to false-share in the first place. In my experience, the ones worth padding are almost always in small, long-lived structs, not arrays.
At two threads: 6.65 ns/op versus 40.00 ns/op. Problem gone.
Hot/Cold Restructuring: The Underrated Fix
This one gets less attention than it deserves. When a struct mixes frequently-written fields with read-only configuration, every write invalidates cache lines that readers need. Separate them:
struct GroupedData {
// Hot partition — writers touch this line
alignas(64) std::atomic<std::uint64_t> counter{0};
std::atomic<std::uint64_t> counter2{0};
// Cold partition — readers touch this line
alignas(64) std::uint64_t config_a{42};
std::uint64_t config_b{99};
};
One writer, one reader, two threads:
| Layout | 2 Threads (ns/op) | 8 Threads (ns/op) |
|---|---|---|
| Interleaved | 32.90 | 134.03 |
| Grouped | 7.35 | 82.65 |
| Speedup | 4.5x | 1.6x |
4.5x from reordering fields. No algorithmic change, no new data structure. Just moving a declaration four lines down in the struct definition. I’ve seen this pattern save more real-world throughput than alignment padding because it also improves cache utilization for the read path because readers stop pulling in dirty data they never touch.
Thread-Local Accumulation: The Right Answer
For stats counters specifically, the best fix is structural: don’t share at all.
// Each thread owns a counter in its own cache line
struct alignas(64) ThreadLocalCounter {
std::uint64_t value{0};
};
std::array<ThreadLocalCounter, MAX_THREADS> per_thread_counts;
// Writer: zero contention
per_thread_counts[my_thread_id].value++;
// Reader: sum all thread slots (stale by one increment, usually fine)
auto total = std::accumulate(per_thread_counts.begin(),
per_thread_counts.end(), 0ULL,
[](auto sum, auto& c) { return sum + c.value; });
Each thread accumulates locally; readers merge on demand. Slightly stale reads,
zero coherence traffic. This is the pattern behind jemalloc’s per-thread
arenas and most high-performance counters in production. I didn’t benchmark it
separately because it sidesteps the problem rather than demonstrating it, but
for stats collection, it’s almost always what you actually want.
About std::hardware_destructive_interference_size
C++17 gave us a portable way to say “separate cache line, please”:
#include <new>
// std::hardware_destructive_interference_size == 64 on both
// GCC 15.2.1 and Clang 21.1.8 (x86_64, Fedora 43)
Both GCC 15 and Clang 21 define __cpp_lib_hardware_interference_size to
201703L and report 64. Its complement,
std::hardware_constructive_interference_size (also 64), tells you the line
size for data you want colocated.
struct StdPadded {
alignas(std::hardware_destructive_interference_size)
std::atomic<std::uint64_t> a{0};
alignas(std::hardware_destructive_interference_size)
std::atomic<std::uint64_t> b{0};
};
Performance is identical to alignas(64):
| Variant | 2 Threads (ns/op) | 8 Threads (ns/op) |
|---|---|---|
| No alignment | 38.82 | 143.56 |
alignas(64) | 6.66 | 96.96 |
alignas(std::h…) | 6.90 | 92.34 |
Within measurement noise. On x86_64 the constant is 64 on every implementation I’ve tested.
So why does std::hardware_destructive_interference_size exist if you can’t
rely on it? Because it’s a compile-time constant baked into the binary. Compile
on a machine with 64-byte lines, deploy to a machine with 128-byte lines (some
ARM server cores), and the constant is wrong. Apple set it to 128 on Apple
Silicon specifically to dodge that trap. For x86_64 targets, 64 has been correct
for over two decades. For portable libraries targeting ARM, make the alignment
a build-system parameter instead.
Real Numbers: A Stats Collector
A web server tracking requests, errors, bytes sent, bytes received. Four atomic counters, four threads, each updating its own:
| Layout | 1T (ns/op) | 2T (ns/op) | 4T (ns/op) | 8T (ns/op) |
|---|---|---|---|---|
| Packed | 6.81 | 37.54 | 77.02 | 149.37 |
| Padded | 6.80 | 6.96 | 7.16 | 57.71 |
| Speedup | 1.0x | 5.4x | 10.8x | 2.6x |
At four threads — each counter on its own physical core — the padded version
delivers 139.6 million operations per second versus 13.0 million for packed.
10.8x. From alignas(64) on four struct members. Total memory cost: 192
extra bytes.
At eight threads on four physical cores, hyperthreading muddies things. Two logical threads per core share L1, so even the padded version takes a hit. But packed is still 2.6x worse.
That 10.8x is the number I show people when they ask whether struct layout matters.
When Not to Pad
Measure first. Run perf c2c record on Linux or VTune’s Memory Access analysis
anywhere else. If HITM counts are low, you don’t have false sharing. You have
a different problem.
Only pad fields that get written concurrently. Read-only data shared across threads doesn’t false-share; contention requires concurrent writes to the same line. And watch the memory cost: padding a 16-million-element array of 8-byte counters to 64-byte alignment turns 128 MB into 1 GB. Restructure before you pad. Hot/cold separation often eliminates the problem without wasting memory and improves cache utilization for readers as a bonus.
If the data is only aggregated periodically, per-thread accumulation eliminates coherence traffic entirely. That’s usually the right call for stats counters.
The tooling finds false sharing in seconds. The fixes are mechanical. The only
hard part is knowing to look. Now you know to run perf c2c before you
spend a week staring at flamegraphs.