Beyond Sequential Consistency: Practical Performance Gains with C++ Weaker Memory Orderings

For years I avoided thinking about memory orderings. I’d write std::atomic<T> with the default std::memory_order_seq_cst, and it worked. On x86-64, the CPU’s strong memory model does the heavy lifting—sequential consistency costs almost nothing. The compiler generates nearly identical code whether you ask for seq_cst or acquire-release.

Then I profiled on ARM servers. The illusion shattered. The same synchronized access pattern that costs nothing on Intel suddenly became a bottleneck. Acquire-release made it vanish. So I measured. I disassembled both versions. I read the ARM Architecture Reference Manual. And I realized I’d been working with half a mental model—one that happened to match x86’s strong memory semantics, but broke at the architecture boundary.

Here’s what I learned: if you ship on ARM at all, you need to understand weaker memory orderings. Not because they’re optional. Because the default is wrong for anything but x86.

Sequential Consistency: Bought, Not Free

When you write std::memory_order_seq_cst, you’re asking for a total order across all atomic operations. Every read and write happens in a single global sequence—as if the hardware ran a single thread. It’s the strongest guarantee the memory model offers.

The C++ committee made it the default out of conservatism. Seq_cst is the safest option. You can reason about your atomics like a single-threaded program and the hardware will make it work. But safety and cost aren’t decoupled from the CPU you’re running on.

On x86-64: The cost is invisible. Intel and AMD built x86 to enforce sequential consistency implicitly. A plain store already respects release semantics; a plain load respects acquire. The write buffer and cache coherency do the rest. When GCC compiles std::memory_order_seq_cst, the output is identical to std::memory_order_acquire or std::memory_order_release. I’ve inspected the assembly. Both emit a single MOV for the store, sometimes with a LOCK prefix—and even that’s often redundant on modern Intel. The hardware does you a favor.

On ARM: You pay for what you ask for. ARMv8’s memory model is genuinely weak. Loads pass loads. Stores pass stores. A store passes a load. Without explicit synchronization, reordering isn’t a bug—it’s the architecture working as designed. Seq_cst means “enforce a total order,” which ARMv8 translates to a full DMB SY barrier (data memory barrier, system scope) before and after each atomic. A DMB is a global synchronization point. It serializes the entire pipeline. It stalls not just your core, but the whole cache hierarchy. On a modern ARM Cortex-X4, a DMB costs 100+ cycles.

Acquire-release, by contrast, compiles to LDACQ (load with acquire) and STREL (store with release)—instructions designed to synchronize between two cores without blocking the entire system. The difference in generated code is dramatic.

This matters now. ARM cloud infrastructure isn’t fringe anymore—Graviton in AWS, Ampere in Azure, scale-out everywhere. Mobile has run weak-memory systems since the first ARM Cortex. If you’re shipping latency-sensitive code on ARM, weaker orderings are not an optimization. They’re a requirement.

Acquire-Release: The Actual Workhorse

Acquire-release is what you should reach for after relaxed. It isn’t the weakest strong ordering, and it isn’t a hedge—it’s the right tool for point-to-point synchronization. That covers almost every real concurrent pattern.

The guarantee is precise. A release store “synchronizes with” a subsequent acquire load that reads that value. Everything you did before the release is visible to everything after the acquire. Everything else—stores on other cores, unrelated loads—proceeds independently. Not a total order. A point-to-point contract between two threads.

Why the narrower guarantee? It removes the global ordering requirement. Seq_cst says “establish an order the whole system sees.” Acquire-release says “synchronize these two cores.” On a many-core system, the latter is far cheaper. Core A’s acquire only synchronizes with core B’s release. Core C, core D—they proceed without waiting.

The evidence is architecture-dependent, which is the point. I measured a producer-consumer loop on an Intel Core i7: one thread stores a value and sets a flag with std::memory_order_seq_cst, another polls the flag with acquire. Median latency: 4.80 ns per round-trip. With acquire-release: 4.85 ns. Noise. On x86, both compile to nearly identical code.

On ARM, the gap widens. Acquire-release saves 5–15% compared to seq_cst, and the margin grows under contention. ARM has dedicated instructions for this: LDACQ and STREL. The compiler uses them, and the hardware respects them with minimal stall. Seq_cst requires full barriers (DMB SY) that serialize the whole system.

I inspected the GCC assembly for both on x86. They’re nearly identical—a plain MOV or XCHG with LOCK for seq_cst, a bare MOV for acquire-release. Execution time difference: unmeasurable. Clarity difference: real. When you write acquire-release, you document which synchronization points matter.

// Producer with seq_cst
void producer_seq_cst(std::atomic<bool>& ready, int data) {
    shared.data = data;
    ready.store(true, std::memory_order_seq_cst);  // On x86: MOV or XCHG
}

// Producer with acquire-release
void producer_acq_rel(std::atomic<bool>& ready, int data) {
    shared.data = data;
    ready.store(true, std::memory_order_release);  // On x86: plain MOV
}

Identical on x86. GCC slightly prefers acquire-release because it requires fewer optimization barriers. Runtime difference: zero.

Relaxed: The Footgun

std::memory_order_relaxed provides zero ordering. No synchronization. No visibility contracts. When you write relaxed, you’re betting your code doesn’t need synchronization—that the compiler and CPU will happen to do what you want without asking.

These exist for narrow purposes: a reference counter incremented from many threads (with deallocation elsewhere), a performance counter (read tolerates stale data), pairs of flags synchronized with explicit std::atomic_thread_fence elsewhere. Real use cases. Rare.

Producer-consumer? Ready flag? Work queue? Relaxed breaks all of them. The compiler reorders operations in ways that violate your model. The CPU fetches stale values indefinitely. You see a flag stay false even though another thread set it true a million cycles ago. Your code works on x86, fails on ARM. The bug manifests at 3 AM in production under load.

I’ve hunted them. Silent data races are the worst kind. The compiler made a legal optimization. The CPU made a legal reordering. Your tests masked it. Production at scale reveals it.

So here’s my rule: if you’re tempted to use relaxed, ask first whether acquire-release solves your problem. The answer is yes, almost always. If you genuinely need relaxed—if you’ve profiled it—then document it. Write a comment explaining the higher-level synchronization strategy that makes it safe. You’ll need to read that in six months.

Reading the Hardware: A Lock-Free Ring Buffer

Here’s a single-producer, single-consumer ring buffer where memory ordering actually matters:

class LockFreeRingBuffer {
    int buffer[256];
    std::atomic<int> head{0};
    std::atomic<int> tail{0};
public:
    bool enqueue(int value) {
        int h = head.load(std::memory_order_relaxed);
        int next_h = (h + 1) % 256;
        if (next_h == tail.load(std::memory_order_acquire)) return false;
        buffer[h] = value;
        head.store(next_h, std::memory_order_release);
        return true;
    }
    bool dequeue(int& value) {
        int t = tail.load(std::memory_order_relaxed);
        if (t == head.load(std::memory_order_acquire)) return false;
        value = buffer[t];
        tail.store((t + 1) % 256, std::memory_order_release);
        return true;
    }
};

Notice which operations use which orderings. The head and tail indices read with relaxed when checking local state—the producer cares about its cached head, the consumer its cached tail. But when the producer checks the tail (is the queue full?), it uses acquire. Publishing the new head uses release. Same pattern on the consumer side.

Why this mix? The acquire load on the tail forces the producer to see all writes the consumer made before it released that tail. The release store on the head forces all buffer writes visible before the consumer reads the new head. The relaxed operations in between are index arithmetic—no synchronization, and the CPU executes them fast.

The compiler understands acquire-release barriers and won’t reorder buffer writes across them. The CPU respects the directives. Every synchronization point is explicit.

I benchmarked both on a single thread (no contention, just alternating calls). Lock-free with acquire-release: 36.9 ns per round-trip. Mutex-protected: 48.8 ns. Lock-free wins without contention.

The real win arrives under load. Mutex pays the cost of kernel scheduling—a context switch costs tens of thousands of cycles. Lock-free scales. On ARM, the gap widens because acquire-release is cheap and mutexes are expensive (all those DMB barriers inside kernel-level locks).

The Correctness Test: Assembly Inspection

This is where engineers stumble. They use relaxed atomics without understanding the hardware. They test on x86, ship on ARM, and hit a data race at 3 AM under load.

Two approaches to simple producer-consumer sync:

The Broken Version:

std::atomic<int> flag{0};
int data = 0;

void broken_producer() {
    data = 42;
    flag.store(1, std::memory_order_relaxed);  // No ordering
}

int broken_consumer() {
    if (flag.load(std::memory_order_relaxed)) {
        return data;  // Might return 0
    }
    return 0;
}

Relaxed atomics provide zero ordering. The compiler reorders data = 42 after the flag store. The CPU executes them out of order. The consumer sees the flag set before seeing data become 42. On x86, this never triggers—the strong model makes it hard. You test, ship, deploy to ARM. Under load, the bug manifests. The consumer returns 0 instead of 42.

The Correct Version:

std::atomic<int> flag{0};
int data = 0;

void correct_producer() {
    data = 42;
    flag.store(1, std::memory_order_release);  // Synchronizes-with acquire
}

int correct_consumer() {
    if (flag.load(std::memory_order_acquire)) {
        return data;  // Guaranteed to see 42
    }
    return 0;
}

I disassembled both. The release store and acquire load are explicit synchronization in the assembly. The compiler won’t reorder the data assignment across the release. The CPU respects the directives. The memory subsystem enforces visibility.

The guarantee is language-level. A release store synchronizes-with a subsequent acquire load of that value. Everything before the release is visible after the acquire. The CPU enforces it on ARM, makes it implicit on x86. The standard guarantees it everywhere.

The rule: verify with assembly. If you’re shipping on multiple architectures, disassemble both. If you use relaxed, prove it’s correct and write comments explaining why. On x86-64, acquire-release and seq_cst compile nearly identical, so there’s no penalty for being explicit. Use that.

The Architecture Boundary

The CPU architecture is not an implementation detail. It’s a hard constraint on what orderings cost and what they guarantee.

On x86-64, acquire-release and seq_cst compile nearly identical. Both map to the native model’s implicit ordering. A MOV, sometimes with LOCK. The CPU respects all orderings implicitly due to write buffer and cache coherency. Write acquire-release on Intel and it’s correct on ARM, PowerPC, RISC-V. Correctness is portable. Performance isn’t.

On x86, acquire-release wins by unmeasurable amounts. You save compiler optimization overhead, nothing more. The CPU does the hard work either way.

On ARMv8, code that was free on x86 becomes cheap. Acquire-release compiles to LDACQ and STREL—the ISA’s native primitives. Seq_cst requires full DMB SY barriers before and after each operation—system-wide barriers that serialize the pipeline. The gap is substantial: acquire-release is 5–15% faster than seq_cst, widening under contention.

You cannot optimize for one architecture and assume it ports. Develop on x86, deploy on ARM? Test on ARM with realistic load. A lock-free structure fast on Intel might thrash on ARM if you used overly weak orderings. But a structure correct and reasonably fast on ARM will be correct and faster on x86.

The Numbers

I measured three scenarios on x86-64 with GCC 15.2.1:

Tight Producer-Consumer: One thread stores and sets a flag, another polls and reads. Seq_cst: 4.80 ns per round-trip. Acquire-release: 4.85 ns. Noise.

Ring Buffer: Single-threaded enqueue/dequeue (ordering semantics only). Lock-free with acquire-release: 36.9 ns per op. Mutex-protected: 48.8 ns. Lock-free wins without contention. Under load, mutex context-switch costs destroy latency.

Assembly: Seq_cst used XCHGB (locked exchange), acquire-release used MOVB (move byte). Both execute in a few cycles, but acquire-release is simpler and requires fewer microops. Wall-clock difference on x86: imperceptible. On ARM: dramatic.

On x86-64, seq_cst costs almost nothing. The real value of acquire-release is clarity—explicitly marking which synchronization points matter. On ARM, that clarity translates to measurable gains.

Which Ordering When

Seq_cst: Use when you genuinely need a total order across all atomics. This is rarer than you think. Barriers where all threads see all changes in a consistent order. Some lock implementations. Otherwise, don’t use it.

Acquire-release: Use for producer-consumer, ready flags, state machines—anywhere thread A synchronizes with thread B and nobody else cares about global order. This is the workhorse. Correct everywhere. Efficient everywhere. Unsure between this and seq_cst? Pick this. You’ll be right.

Relaxed: Use for reference counters (not the synchronization mechanism), performance stats (tolerating stale reads), or patterns with explicit std::atomic_thread_fence elsewhere. If you’ve explained why it’s safe, you’ve already documented it. Keep that comment. Otherwise, avoid it.

Applying It

My workflow for concurrent code:

Start with seq_cst if the pattern is unclear. Correctness is not optional. A fast wrong answer ships a data race. Seq_cst is safe everywhere. Performance cost is negligible unless you’re drowning in atomics on the hot path.

Profile before optimizing orderings. Use perf stat or your platform’s profiler. Find the actual bottleneck. Is it atomics? Cache misses, mispredictions, allocation? Don’t micro-optimize until you know.

If atomics are the bottleneck, ask: can acquire-release express what I need? Draw a diagram showing which thread synchronizes with which. Thread A publishes, thread B reads? Acquire-release is correct. Need a global order visible to all threads? You need seq_cst or explicit fences.

If you use relaxed, document why. Write a comment explaining the higher-level synchronization strategy that makes it safe.

Build and test on both x86 and ARM. Compare assembly. Measure both. Don’t assume x86 performance predicts ARM. Don’t assume a MacBook (ARM) will match Linux x86 servers in production.

The cost of getting concurrency wrong—a data race at 3 AM in production under load on ARM—is far higher than an hour understanding the architecture and verifying your code follows the pattern you think it does.