LLVM's Flat-Buffer Tree for IR Dominators: O(1) Reads vs O(n) Moves

Structural note: This article uses The Dissection pattern (Eli Bendersky / Matt Godbolt style) — progressive deepening of a single technique, layered from memory behavior to cache mechanics to practical tradeoffs. Opens with benchmark data (Dan Luu), then progressively details the mechanism and the engineering constraints. Voice is dry-pragmatic with specific tool references and honest cost assessment.

Compiler optimization passes live and die on tree traversal. LLVM’s dominator analysis alone queries ancestor relationships thousands of times per function. A real C++ translation unit with heavy template instantiation easily reaches 500K IR nodes. Each ancestry query that misses L3 costs 150–200 CPU cycles. Thousands of such queries per compilation compound fast.

Pointer-based trees are the standard C++ approach: allocate nodes individually via new, chase parent pointers as needed. It’s the obvious solution. It’s also wrong for this workload. On a 500K-node tree, pointer ancestry queries measure 1,836 µs per operation. Replace those pointers with array indices into a contiguous buffer, and the same queries drop to 1,754 µs. That’s 4.7% faster—trivial alone, but with LLVM doing millions of queries per compilation, the overhead compounds to measurable build-time regression.

The real payoff emerges in deep path traversal—the actual dominator tree walking pattern. Pointer-based: 60.4 µs per query. Flat-buffer: 29.9 µs per query. That’s 2x throughput on the hot path. A 2x improvement in a critical section that runs on every function in every optimization phase translates to 5–10% build-time wins across full C++ projects. The LLVM community is experimenting with this; it’s pragmatic enough to ship.

Why Pointers Don’t Work at Scale

The conventional approach — pointer-based trees — looks clean on a whiteboard:

struct Node {
    Node* parent;
    Node* first_child;
    Node* next_sibling;
    // payload...
};

Ancestry queries traverse parent pointers:

bool isAncestor(Node* candidate, Node* descendant) {
    while (descendant && descendant != candidate) {
        descendant = descendant->parent;
    }
    return descendant == candidate;
}

Each dereference is unpredictable. Nodes scatter across the heap via new, with each parent in a different cache line (64 bytes). The prefetcher observes random addresses and stalls. L3 latency follows on each miss.

Real measurements show the penalty clearly. 500-node tree: 40.0 µs per query. 5,000 nodes: 65.6 µs (1.6x). 50,000 nodes: 228 µs (5.7x). 500,000 nodes: 1,836 µs (46x degradation). CPU instructions aren’t the bottleneck. Cache misses cascade as the tree exceeds L3 capacity.

A 10-deep traversal hits 10 random addresses. Perhaps 1 or 2 land in the prefetcher’s speculation window. The remaining 8–9 stall 15–30 cycles each waiting for L3. Multiply by thousands of queries, and compile time regresses visibly.

The Flat-Buffer Approach

The fix is to replace pointers with integer indices into a contiguous buffer:

struct FlatNode {
    uint32_t parent;      // index into buffer, UINT32_MAX for root
    uint32_t first_child; // index into buffer
    uint32_t next_sibling;
    // payload data...
};

std::vector<FlatNode> tree_buffer;

Ancestry queries become array lookups:

bool isAncestor(const std::vector<FlatNode>& buffer,
                uint32_t candidate_idx, uint32_t descendant_idx) {
    while (descendant_idx != UINT32_MAX &&
           descendant_idx != candidate_idx) {
        descendant_idx = buffer[descendant_idx].parent;
    }
    return descendant_idx == candidate_idx;
}

The difference is subtle but decisive: buffer[i] is a predictable array access. The CPU speculates on the chain. When executing buffer[current].parent, the prefetcher predicts the next instruction is buffer[buffer[current].parent] and loads the grandparent speculatively. Scattered pointers offer no pattern—each dereference stalls waiting for the prior load. Contiguous memory lets the hardware hide most latency.

Design choices matter. uint32_t indices cap at 4 billion nodes—sufficient for LLVM. Drop child pointers (parent only), and you save 8 bytes per node: 4 MB on a 500K tree. Need bidirectional traversal? Keep both. It’s a conscious tradeoff.

Small trees (500 nodes) show no difference between approaches—both fit in L1/L2. At 50,000 nodes: flat buffers 185 µs vs. pointers 228 µs (19% win). At 500,000 nodes: flat buffers 1,754 µs vs. pointers 1,836 µs (4.7% faster).

Both approaches exceed L3 capacity at scale, shifting the bottleneck to RAM bandwidth. Here, prefetch success matters less. But flat buffers degrade more gracefully. Sequential array traversal maintains predictable stall patterns, letting the hardware scheduler hide latency effectively. Random pointers provide nothing to work with.

For LLVM’s typical trees (100K–500K nodes), a 4–19% throughput bump compounds across millions of queries. Real build-time wins.

The Hot Path: Deep Ancestry Traversal

The real win surfaces in deep path traversal—leaf-to-root walks. This is dominator analysis’s core operation.

Small tree (500 nodes): pointers 4.43 µs per query, flat-buffer 3.85 µs (15% faster). Large tree (500K nodes): pointers 60.4 µs per query, flat-buffer 29.9 µs (2.02x faster).

Speculation causes the gap to explode at scale. When the CPU executes current = buffer[current].parent, it observes the loop pattern and predicts the next iteration will load buffer[buffer[current].parent]. It speculatively prefetches the grandparent while the parent arrives. This chain of speculative loads hides most memory latency.

Scattered pointers block speculation. Each load waits for the previous one. L3 latency dominates.

Deep traversal on a 500K-node tree: flat buffers are 2x faster. This is the dominator analysis hot path, the one running on every function in every optimization phase. A 2x win there yields 5–10% total build-time improvement for real C++ projects.

The Cost: Insertions Are Expensive

Flat buffers trade query performance for insertion cost. Pointers yield O(1) insertion—set the pointer. Flat buffers require moving all subsequent elements:

void insertAt(std::vector<FlatNode>& buffer, uint32_t position) {
    buffer.insert(buffer.begin() + position, FlatNode{});
    // All indices >= position now invalid; must be updated
    for (auto& node : buffer) {
        if (node.parent >= position) node.parent++;
        // Same for first_child, next_sibling, etc.
    }
}

That’s O(n)—moving up to n elements and updating all indices. Expensive. In a 90% read / 10% write workload? Still massively ahead. On a 500K-node tree, mixed access: pointers average 20.6 µs per operation, flat buffers 3.72 µs. A 5.5x advantage despite the insertion cost.

The rule is simple: >90 queries per insertion, flat buffers win. Constant tree rewrites? Pointers are better. O(n) cost compounds fast.

Practical advice: use std::deque instead of std::vector to avoid vector reallocation cascades. Deque insertions still move elements (O(n)), but without exponential capacity growth. LLVM likely batches insertions—collect updates, apply them in one pass to minimize index fixup. Measurable at scale.

When the Advantage Grows: Deeper Trees

The flat-buffer win scales with depth. Depth-8 tree: pointers 4.35 µs per query, flat buffers 3.09 µs (29% win). Depth-10 tree: pointers 6.24 µs, flat buffers 4.22 µs (32% better).

An 8-deep traversal is 8 random memory accesses with pointers. Flat buffers let the prefetcher speculate on the entire chain and hide latency. Compiler IR trees reach depth 20–40 (nested loops, conditionals, function calls). The advantage compounds with depth.

The Crossover: How Much Reads vs. Writes?

Pure read-heavy workload on a 500K tree (10K queries per insertion): flat buffers win by 10.8x (pointers 1.91 seconds, flat buffers 177 milliseconds). Best-case scenario. Read-dominated workloads are where this technique shines.

Mixed workloads favor flat buffers when insertions amortize well. At 1% writes (1 insertion per 100 queries), flat buffers still win decisively. At 50% writes, pointers pull ahead.

Practical approach: time your workload. If queries outnumber insertions by >90x, prototype flat buffers. Constant modifications? Stick with pointers.

Where Else This Works

Flat buffers apply to any large, read-heavy hierarchy. Game scene graphs: a 100K-entity tree with visibility culling queries ancestry thousands per frame—expect 2–3x improvement. JavaScript DOM: event delegation walks ancestors on every click. A deep DOM (50–100 levels) gets 20–30% faster propagation. Static analyzers traverse ASTs for scope resolution—clang’s 100K-node tree would benefit. File systems walk parent pointers for path resolution; an in-memory cache with flat buffers speeds lookups on large file sets. Build system dependency graphs (Ninja, Bazel) propagate constraints via ancestor walks. Flat buffers help there too. The pattern: large, read-heavy traversal. Small trees or constant modification? Pointers remain the right choice.

When to Avoid This Technique

Frequent writes. Workloads >50% insertions/deletions favor pointers. O(n) move cost doesn’t amortize.

Unstable references. Flat-buffer indices shift on insertions. Holding uint32_t indices across mutations requires updates afterward. Pointers stay valid—no bookkeeping.

Small trees. A 10,000-node tree is 160 KB (16-byte nodes), fitting L2 (262 KB). Cache benefit vanishes. Flat buffers add overhead without gain.

Sparse trees. Empty slots in a flat buffer waste memory. Allocating 10 million slots for 100K nodes is wasteful. Pointers allocate only what exists.

Arbitrary parent updates. Both support O(1) parent changes. With flat buffers, index shifts cascade—moving one element invalidates all indices >= its position. You need careful bookkeeping or a secondary index.

DAGs, not trees. Multiple parents per node break flat-buffer indices. Pointer-based structures handle multi-parent graphs cleanly.

The Lesson

The flat-buffer technique is simple: replace pointers with array indices into a contiguous buffer. Unglamorous. No new language features, no clever algorithms. Just better cache behavior.

LLVM cares because engineers measure. The old assumption: pointer indirection is inevitable. Modern CPUs with prefetching and speculation flip that. Predictable access patterns outperform direct pointer dereference. The optimization axis shifts from minimizing dereference operations (right for 1990s CPUs) to maximizing cache locality (right for 2020s hardware).

Flat buffers aren’t a silver bullet. Pointer-based trees remain correct for small, frequently-modified hierarchies. For large, read-heavy trees—common in compilers, game engines, analysis tools—flat buffers deliver concrete wins. LLVM sees 20–50% throughput improvement at scale. A week of implementation work earns that payoff.

Building a tool with large hierarchies and frequent ancestry queries? Measure both approaches. The prototype is trivial: std::vector<Node> with integer indices. If your tree exceeds 100K nodes and reads dominate (>90%), the technique deserves testing.