Cache-Line Archaeology: Finding and Fixing False Sharing in Production

Your threads are doing independent work on independent data, and yet adding a second thread makes everything six times slower. Not twice as fast. Not the same speed. Six times slower. The profiler shows no lock contention, no syscall overhead, nothing obviously wrong. The data structures are correct. The algorithm is correct. The hardware is lying to you about where your data lives.

This is false sharing, and it hides in struct layouts and thread-local counters across more production codebases than anyone wants to admit.

What the Hardware Actually Does

A modern x86 CPU doesn’t read memory one byte at a time. It reads in cache lines — 64-byte chunks on every Intel and AMD processor shipped in the last two decades. When core 0 writes to byte 0 of a cache line, the entire 64-byte line gets marked as modified. If core 1 was reading byte 8 of that same line, the coherence protocol (MESI, or its multi-socket extension MESIF/MOESI) forces core 1 to invalidate its copy and re-fetch the line.

The critical insight: the cores don’t know that bytes 0 and 8 are logically independent. They only see the line. Two threads writing to adjacent members of the same struct will bounce that cache line back and forth between cores on every write. This is false sharing — the data isn’t actually shared, but the hardware treats it as if it were.

Here’s what it looks like in code:

struct SharedCounters {
    std::atomic<std::uint64_t> a{0};  // offset 0
    std::atomic<std::uint64_t> b{0};  // offset 8
};
static_assert(sizeof(SharedCounters) <= 64, "Both fit in one cache line");

Both atomics sit within the same 64-byte cache line. When two threads each increment their own counter, every fetch_add triggers a cross-core invalidation.

The measured cost on an Intel i7-4790 at 3.60 GHz:

Threads	Shared (ns/op)	Padded (ns/op)	Slowdown
1	6.56	6.56	1.0x
2	40.00	6.65	6.0x
4	78.25	57.32	1.4x
8	156.33	97.66	1.6x

At two threads, the false-sharing version is six times slower than the padded version. At single-threaded, they’re identical — the cache line never bounces. The 4-thread and 8-thread numbers show the i7-4790’s 4 physical cores saturating: even the padded version slows down from core contention, but the false-sharing version is consistently worse.

That 6x gap at two threads is not a contrived worst case. It’s two atomic counters in a struct. This pattern shows up in connection managers, stats collectors, and lock-free queues across real codebases.

Detection: perf c2c Finds It in Seconds

You don’t need to guess. Linux ships a tool that points directly at the offending cache line: perf c2c.

perf c2c record -- ./your-binary
perf c2c report --stdio

Run against our false-sharing test binary with 8 threads hammering the shared counters for 50 million iterations each, perf c2c produces this:

=================================================
    Global Shared Cache Line Event Information
=================================================
  Total Shared Cache Lines          :         27
  Load HITs on shared lines         :     112055
  Load Local HITM                   :       7171

=================================================
           Shared Data Cache Line Table
=================================================
#        ----------- Cacheline ----------    Tot
# Index             Address  Node  PA cnt   Hitm
      0            0x405080     0  177441  99.61%

One cache line — 0x405080 — accounts for 99.6% of all HITM (Hit-In-Modified) events. The HITM count is the signal: 7,143 cross-core invalidations, every one a false-sharing penalty.

The pareto breakdown identifies the exact offending accesses:

  0        0     7143   214344        0        0            0x405080
  ---------------------------------------------------------------
     0.00%   49.10%    ...    0x0  [.] writer_a  atomic_base.h:631
     0.00%   50.90%    ...    0x8  [.] writer_b  atomic_base.h:631

Offset 0x0 and offset 0x8 on the same cache line, split nearly 50/50 between writer_a and writer_b. That’s the two std::atomic members of SharedCounters. The tool tells you the symbol name, the source file, and the exact byte offset. On Intel VTune, the equivalent analysis is the “Memory Access” viewpoint with “Contested Accesses” highlighted — same data, friendlier GUI.

Fix Patterns

Three approaches, each with different tradeoffs.

1. Alignment Padding

The direct fix: force each counter onto its own cache line.

struct PaddedCounters {
    alignas(64) std::atomic<std::uint64_t> a{0};
    alignas(64) std::atomic<std::uint64_t> b{0};
};
static_assert(sizeof(PaddedCounters) >= 128);

This wastes 56 bytes per counter (64-byte line minus 8-byte atomic). For a handful of hot counters, that’s nothing. For an array of ten thousand objects, you’ve just blown 560 KB on padding. Whether that matters depends on whether those objects are hot enough to false-share in the first place.

The performance result at two threads: 6.65 ns/op versus 40.00 ns/op for the unpadded layout. The padding eliminates the problem entirely.

2. Hot/Cold Field Restructuring

When a struct mixes frequently-written fields with read-only configuration, the writes invalidate cache lines that readers need. The fix: group the hot fields together on one cache line, cold fields on another.

struct GroupedData {
    // Hot partition — writers touch this line
    alignas(64) std::atomic<std::uint64_t> counter{0};
    std::atomic<std::uint64_t> counter2{0};

    // Cold partition — readers touch this line
    alignas(64) std::uint64_t config_a{42};
    std::uint64_t config_b{99};
};

Measured at two threads (one writer, one reader):

Layout	2 Threads (ns/op)	8 Threads (ns/op)
Interleaved	32.90	134.03
Grouped	7.35	82.65
Speedup	4.5x	1.6x

The 4.5x improvement at two threads comes purely from keeping the reader’s cache line stable. No algorithmic change, no new data structure — just field ordering.

3. Thread-Local Accumulation

For stats counters, the best fix is often structural: don’t share at all. Each thread accumulates into its own counter and the values merge on read. This eliminates coherence traffic entirely at the cost of slightly stale reads.

// Each thread owns a counter in its own cache line
struct alignas(64) ThreadLocalCounter {
    std::uint64_t value{0};
};

std::array<ThreadLocalCounter, MAX_THREADS> per_thread_counts;

// Writer: zero contention
per_thread_counts[my_thread_id].value++;

// Reader: sum all thread slots (stale by one increment, usually fine)
auto total = std::accumulate(per_thread_counts.begin(),
    per_thread_counts.end(), 0ULL,
    [](auto sum, auto& c) { return sum + c.value; });

This is the pattern behind std::execution::parallel_policy reduction, jemalloc’s per-thread arenas, and most high-performance counters in production. We didn’t benchmark it separately because it sidesteps the problem rather than demonstrating it — but it’s the right answer for most stats collection.

`std::hardware_destructive_interference_size`

C++17 gave us a portable way to express “keep these things on separate cache lines”: std::hardware_destructive_interference_size. It’s a constexpr std::size_t defined in <new> that the implementation sets to the cache line size that causes destructive interference (false sharing).

#include <new>
// std::hardware_destructive_interference_size == 64 on both
// GCC 15.2.1 and Clang 21.1.8 (x86_64, Fedora 43)

Both GCC 15 and Clang 21 define __cpp_lib_hardware_interference_size to 201703L and report a value of 64. Its complement, std::hardware_constructive_interference_size (also 64), tells you the line size for data you want to share — useful for packing related fields together.

Using it in practice:

struct StdPadded {
    alignas(std::hardware_destructive_interference_size)
        std::atomic<std::uint64_t> a{0};
    alignas(std::hardware_destructive_interference_size)
        std::atomic<std::uint64_t> b{0};
};

The measured performance is identical to alignas(64):

Variant	2 Threads (ns/op)	8 Threads (ns/op)
No alignment	38.82	143.56
`alignas(64)`	6.66	96.96
`alignas(std::h…)`	6.90	92.34

The numbers overlap within measurement noise. On x86_64, the constant is 64 on every implementation I’ve tested, so the codegen is identical.

The reality check. The constant is a compile-time value baked into the binary. If you compile on a machine with 64-byte cache lines and deploy to a machine with 128-byte lines (some ARM server cores), the constant is wrong. Apple shipped this header with a value of 128 on Apple Silicon specifically to avoid that trap. For x86_64 targets, 64 has been correct for over two decades and isn’t changing soon. For portable libraries targeting ARM, consider making the alignment value a build-system parameter rather than trusting the constant from the build host.

The Production Scenario

A web server’s stats collector tracking requests, errors, bytes sent, and bytes received — four atomic counters, four threads, each updating its own counter:

Layout	1T (ns/op)	2T (ns/op)	4T (ns/op)	8T (ns/op)
Packed	6.81	37.54	77.02	149.37
Padded	6.80	6.96	7.16	57.71
Speedup	1.0x	5.4x	10.8x	2.6x

At four threads — the sweet spot where each thread gets its own counter and its own physical core — the padded version delivers 139.6 million operations per second versus 13.0 million for packed. That’s a 10.8x difference from adding alignas(64) to four struct members. The total memory cost: 192 extra bytes of padding.

At eight threads on four physical cores, hyperthreading muddies the picture: two logical threads per core share the L1 cache, so even the padded version sees some slowdown. But the packed version is still 2.6x worse.

The Checklist

Before you alignas(64) everything in sight:

Measure first. perf c2c record on Linux, VTune’s Memory Access analysis on any platform. If HITM counts are low, you don’t have false sharing.
Only pad hot fields. Read-only data shared across threads doesn’t false-share. Contention requires concurrent writes to the same cache line.
Consider the memory cost. Padding a 16-million-element array of 8-byte counters to 64-byte alignment turns 128 MB into 1 GB. That’s probably not what you want.
Restructure before padding. Hot/cold separation often eliminates the problem without wasting memory — and it improves cache utilization for readers too.
Thread-local first. If the data is only aggregated periodically, per-thread accumulation eliminates coherence traffic entirely.

False sharing is a hardware-level performance bug with a software-level fix. The tooling finds it in seconds. The fixes are mechanical. The only hard part is knowing to look.

publish_date: 2026-04-19