Lock-Free Queue Implementations Compared: Correctness, Performance, and the Bugs You'll Ship

A std::mutex-protected std::deque is 12% faster than moodycamel::ConcurrentQueue when contention is low.

I’ll let that sit for a second, because it contradicts the advice I’ve been hearing for a decade. The canonical lock-free queue comparison dates to 2014 and targets C++11. I reran the numbers on GCC 15.2.1 and Clang 21.1.8, targeting an Intel Haswell i7-4790 inside a Fedora 43 container. The results are less flattering for lock-free than I expected.

The mutex is faster than you think

Here’s the number that surprised me most. Under low contention with 100 ns of think-time between operations — a single-threaded producer-consumer pattern — the mutex queue beat moodycamel:

Queue	Throughput (median)	Per-op time
`std::mutex` + deque	7.75M items/s	129 ns
moodycamel	6.79M items/s	147 ns

An uncontended std::mutex on Linux is a single atomic compare-and-swap on the futex word. No syscall. No context switch. Moodycamel’s internal token management, sub-queue selection, and CAS retry loops cost more than that fast path even when nobody else is contending.

The practical consequence: if your system has a producer and consumer running at moderate rates — a network thread enqueuing packets for a processing thread at 1M packets/second — you’re adding complexity and debugging difficulty for worse performance by reaching for a lock-free queue. Profile first. Measure your actual contention rate. If your std::mutex acquisitions are uncontended more than 90% of the time (check /proc/lock_stat or perf lock), the lock-free queue won’t help you.

There’s a memory cost people forget, too. Moodycamel allocates internal sub-queues per producer thread — multiple cache-line-aligned blocks each, plus a reclamation buffer. For dozens of producers, that’s real memory. The mutex queue uses a std::deque, which grows and shrinks dynamically with almost no waste.

Where lock-free pays: tail latency

So when should you use a lock-free queue? Not for throughput. For tail latency.

I measured 1 million ping-pong latency samples on a custom SPSC ring buffer versus a mutex+condvar queue. The SPSC queue’s latency distribution was tight: 103 µs at p50, 126 µs at p99. A 27% spread. The mutex queue’s median was lower — 5.6 µs — but its p99 exploded to 661 µs. p99.9 hit 881 µs. That’s a 157× spread from median to p99.9.

The absolute SPSC median was higher because the benchmark’s fast producer filled the 65,536-slot ring buffer, creating queuing backpressure. In production, where producer rate matches consumer rate more closely, the SPSC median drops. The point isn’t the absolute numbers — it’s the shape of the distribution. One gives you a narrow, predictable band. The other gives you low medians with catastrophic tails whenever the OS scheduler decides to intervene.

The kernel is the culprit. A mutex queue blocks on a futex when contended, and the OS scheduler gets involved. Scheduler decisions happen on millisecond timescales. A lock-free queue spins or retries a CAS, which stays in userspace. The p99 improvement isn’t because lock-free is “faster” — it’s because lock-free keeps the kernel out of the hot path.

If your SLA is “p50 must be fast,” use the mutex. If your SLA is “p99.9 must be bounded,” you need lock-free.

MPMC throughput: the numbers

Four implementations, tested at 1, 2, and 4 threads: std::mutex + std::deque as the baseline; moodycamel::ConcurrentQueue (header-only, wait-free producers); boost::lockfree::queue (fixed-capacity, Michael & Scott design); and Meta’s folly::MPMCQueue with per-slot turn sequencing.

All compiled with -O2 -std=c++23, 5 repetitions, Google Benchmark v1.9.1. Moodycamel is a single header drop-in. Folly required building Meta’s full library stack, which is not a weekend project.

Queue	1 thread	2 threads	4 threads
`std::mutex` + deque	64.1M items/s	4.2M items/s	1.9M items/s
moodycamel	35.8M items/s	9.1M items/s	3.2M items/s
boost::lockfree	23.9M items/s	3.3M items/s	1.6M items/s
folly::MPMCQueue	19.1M items/s	7.4M items/s	4.2M items/s

At 1 thread, the mutex crushes everything. 64M items/s versus moodycamel’s 36M. An uncontended mutex on Linux is a single futex fast-path — the lock-free queues pay for their CAS loops and internal bookkeeping even when nobody else is contending.

At 2 threads, the picture flips. Mutex throughput collapses to 4.2M items/s — a 15× drop. Moodycamel holds at 9.1M (2.2× faster than mutex). Folly reaches 7.4M.

Push to 4 threads and Folly takes the lead at 4.2M items/s. Moodycamel at 3.2M. Mutex at 1.9M. Coefficient of variation across 5 runs was under 5% for all implementations, so these differences are real.

Here’s what I didn’t expect: moodycamel’s advantage over mutex narrows from 2.2× at 2 threads to 1.7× at 4 threads. The “lock-free scales better” narrative doesn’t hold here. Folly’s per-slot turn sequencing is the only implementation that improved its relative position from 2T to 4T. If you’re scaling past 4 threads and can stomach the Folly dependency, that’s the one to bet on.

SPSC: don’t roll your own

For single-producer, single-consumer, I compared a hand-rolled ring buffer against boost::lockfree::spsc_queue:

Queue	Throughput (median)
Custom SPSC	56.5M items/s
boost::lockfree::spsc_queue	56.3M items/s

0.4% difference. Both use the same technique: power-of-two ring buffer with release/acquire index updates. On x86, those release stores are plain mov instructions. There’s no room for a clever custom implementation to find free performance.

Use boost::lockfree::spsc_queue. It’s correct, it’s tested, and it matches custom code to within noise. Roll your own only if you need batch operations, peek, or you’re on an embedded platform where Boost isn’t available.

Correctness: what release/acquire costs on x86-64

Before you roll your own anything, you need to understand what you’re paying for — and on x86, the answer is “almost nothing,” which is exactly the trap.

Both GCC 15 and Clang 21 emit a plain movq for a release store:

; store_release — GCC 15.2.1, -O2
movq    %rdi, _ZL5g_var(%rip)    ; plain store, no fence

A memory_order_seq_cst store emits xchgq, which carries an implicit LOCK prefix — a full memory barrier:

; store_seq_cst — GCC 15.2.1, -O2
xchgq   _ZL5g_var(%rip), %rdi   ; locked exchange = full barrier

On the load side, both memory_order_acquire and memory_order_seq_cst compile to an identical plain movq. x86 loads are acquire-by-default under TSO.

This is why SPSC queues on x86 are so cheap: the release store is one mov with no fence. Producer and consumer synchronize through index updates, and on x86 that costs almost nothing in hardware.

The trap: this lulls you into thinking memory ordering doesn’t matter. It matters enormously on ARM and RISC-V, where a release store emits stlr or an explicit dmb barrier. Code that “works” on x86 with memory_order_relaxed will silently corrupt data on ARM when the store to the payload and the store to the index get reordered by the hardware. The C++ memory model exists precisely because x86’s strong ordering is the exception.

TSan catches what x86 hides

Here’s a deliberately broken SPSC queue. The indices are plain volatile integers instead of std::atomic:

struct BuggySpsc {
    volatile std::size_t write_idx = 0;  // BUG: not atomic
    volatile std::size_t read_idx  = 0;  // BUG: not atomic
    uint64_t data[kN]{};

    bool push(uint64_t v) {
        std::size_t w = write_idx;
        std::size_t r = read_idx;
        if (w - r >= kN) return false;
        data[w & kMask] = v;          // non-atomic write
        write_idx = w + 1;            // BUG: should be release store
        return true;
    }

    bool pop(uint64_t& v) {
        std::size_t r = read_idx;
        std::size_t w = write_idx;    // BUG: should be acquire load
        if (r == w) return false;
        v = data[r & kMask];          // non-atomic read
        read_idx = r + 1;
        return true;
    }
};

On x86, this will often appear to work. TSO hides the race. Compile with -fsanitize=thread and TSan reports two data races: one on the payload array data[] in pop(), one on read_idx in pop().

TSan is conservative with relaxed atomics. Replace volatile with std::atomic using memory_order_relaxed, and TSan may not fire — it creates implicit happens-before edges when a relaxed load observes a relaxed store’s value. That’s stronger than the C++ standard requires. The non-atomic version above simulates what a weakly-ordered architecture actually sees: a real data race with no synchronization.

Run TSan on your lock-free code. Then run it on ARM. Then file the bugs.

CppMem and the herd7 memory model checker can verify your ordering choices against the C++ memory model formally. They’re slow on large programs, but for a 50-line queue, they’ll enumerate every possible execution and tell you whether your release/acquire pairs establish the happens-before edges you think they do.

Benchmark methodology

Microbenchmarks of concurrent data structures lie in predictable ways. The main failure mode: zero think-time, every thread hammering the queue nonstop, measuring contention behavior rather than queuing behavior.

I ran two configurations:

Saturated throughput — pure push+pop in a tight loop. The queue’s raw ceiling under maximum contention.
Low-contention throughput — 100 ns of simulated work between operations. Closer to real producer/consumer patterns.

Both used Google Benchmark with --benchmark_repetitions=5 and real-time measurement. Container ran with 8 vCPUs (Intel Haswell, 3.6 GHz, 32 KB L1d, 8 MB L3).

The latency benchmark is separate — not Google Benchmark. Latency measurement requires timestamping each individual item, which Google Benchmark’s statistical model doesn’t support. The producer enqueues steady_clock::now(), the consumer dequeues and measures the delta, 1M samples sorted for percentile extraction. Closer to how you’d measure latency in production.

What I’d actually use

After running all of this, my recommendations are blunt.

std::mutex + std::deque for most things. It’s the fastest at 1 thread, competitive under low contention, and every tool supports it — TSan, Helgrind, gdb, your colleagues’ ability to read the code. It handles unbounded growth gracefully. Start here and move away only when you have profiling data showing contention is your bottleneck.

moodycamel::ConcurrentQueue when you’ve measured contention as the bottleneck at ≥2 threads. Header-only, well-tested, leads at 2T. Watch its memory model — it allocates sub-queues per producer and uses a complex reclamation scheme.

folly::MPMCQueue when scaling to ≥4 threads and you can tolerate the Folly dependency. Its per-slot turn sequencing scaled better than anything else I tested. The build system cost is real.

boost::lockfree::spsc_queue for single-producer, single-consumer. Matches hand-rolled performance, correct, in a library you probably already link.

Skip boost::lockfree::queue for MPMC. It was the slowest lock-free option at every thread count. The Michael & Scott design uses a linked list with CAS on head/tail pointers — every enqueue allocates a node, every dequeue frees one. That allocation traffic thrashes the allocator and pollutes L1 with pointer-chasing. Moodycamel avoids this with pre-allocated sub-queue blocks; Folly avoids it with a flat array of per-slot atomic state. The linked-list lock-free queue is a textbook algorithm, not a production one.

The correctness tax

Every lock-free queue here requires you to understand memory ordering, ABA problems, and reclamation. Get any of those wrong and you’ll ship a bug that works on x86, passes your tests, and explodes on ARM or under high load.

The mutex version requires you to understand std::unique_lock.

If you’re reaching for a lock-free queue because you read that it’s faster, I’d ask: faster at what? Two out of three scenarios I tested, the mutex won — on this hardware, with these compilers.

The strongest argument for lock-free isn’t performance — it’s progress guarantees. A mutex holder that gets preempted blocks every other thread. A lock-free queue guarantees some thread makes progress even when others stall. In real-time audio, market data handlers, anything where priority inversion kills you — that guarantee matters more than throughput numbers. But most server-side C++ doesn’t have real-time constraints. It has throughput targets and latency SLAs, and for those, the numbers above tell the story.

These are Haswell numbers from a containerized workload. Newer CPUs with larger store buffers and faster CAS operations — Zen 5, Raptor Lake — will shift the absolute numbers. The relative story should hold across modern x86: mutex wins at low contention, lock-free wins at tail latency, Folly scales best at high thread counts. ARM is a different conversation, and one I want to run separately.

All benchmarks ran on Intel i7-4790 (Haswell, 4C/8T, 3.6 GHz), Fedora 43, GCC 15.2.1, Clang 21.1.8, Google Benchmark v1.9.1.