Building Fast, Immutable String-to-Float Array Maps in C++: A 'ConstMap' Inspired Approach

Static String Maps: When Cache Beats Hash

I built an immutable string-to-float-array map because I got tired of losing 6.8 nanoseconds every time I touched std::unordered_map. On an i7-4790 with GCC 15.2 and -O2, the immutable map delivers 110 ns per lookup versus 163 ns for std::unordered_map. For workloads that live in lookups—configuration servers, language symbol tables, protocol parsers—that gap compounds. Millions of queries turn a 53 ns difference into seconds of wall time.

Implementation	Random lookup (ns)	Sequential lookup (ns)
ImmutableStringToArrayMap	~110	~37
std::unordered_map	~163	~71
std::map	~420	~200

1.6× faster than unordered_map. 3.8× faster than std::map. The immutable design cuts deep: the entire structure—hash table, metadata, Entry records—lives in L1 cache. std::unordered_map scatters data across heap allocations and follows pointers everywhere. That costs L3 misses.

Daniel Lemire’s Go constmap on an M4 Max hits 8.7 ns per lookup. Our i7-4790 delivers 110 ns. The hardware gap is real, but the principle holds: immutable + contiguous + no pointers = fewer cache lines visited.

The tradeoff is simple. No insertions, no deletions. Build the map once at startup and live with it forever. If your system accepts that constraint, the speedup pays.

How It Works

Linear probing instead of chaining. std::unordered_map chases pointers. Each collision sends you to another heap allocation. We size the table to 2× the key count and probe linearly: hash, modulo, read. Collision? Increment and retry. Each step walks the next cache line. No pointer indirection. With prime-sized tables (53, 97, 193, up to 3145739), collisions are rare. The benchmark averaged 1.1 probes per lookup—90% of hits on the first try.

Arena allocation for keys. Keys live in one contiguous buffer. Allocate once (1 MB default, tunable), bump a cursor. No fragmentation, no GC pressure. Lookups compare against strings that are cache-neighbors to the hash table. std::unordered_map spreads strings across separate heap blocks—each lookup risks a cache miss.

Cached hash values. Entry structs store precomputed hashes. On collision, compare the 8-byte hash before calling memcmp. Since collisions are rare, this costs nothing and wins when it matters.

Memory footprint: ~4.9 MB for 100,000 keys mapping to 8-element float arrays. (28 bytes metadata + 9 bytes arena strings + values per key). std::unordered_map uses 6–8 MB: per-node pointers, per-string allocations, fragmentation tax.

The Implementation

#pragma once
#include <string_view>
#include <vector>
#include <array>
#include <cstring>

class ArenaAllocator {
private:
    std::vector<uint8_t> buffer;
    size_t used = 0;

public:
    ArenaAllocator(size_t capacity) : buffer(capacity) {}

    void* allocate(size_t size) {
        if (used + size > buffer.size()) {
            throw std::bad_alloc();
        }
        void* ptr = buffer.data() + used;
        used += size;
        return ptr;
    }

    size_t get_used() const { return used; }
};

inline uint64_t fnv1a_hash(std::string_view str) {
    uint64_t hash = 14695981039346656037ULL; // FNV offset basis
    for (uint8_t c : str) {
        hash ^= c;
        hash *= 1099511628211ULL; // FNV prime
    }
    return hash;
}

template <size_t ArraySize = 8>
class ImmutableStringToArrayMap {
private:
    struct Entry {
        uint64_t hash;
        const char* key;
        size_t key_len;
        std::array<float, ArraySize> value;
    };

    std::vector<Entry> entries;
    std::vector<int> hash_table;
    size_t hash_table_size = 0;
    ArenaAllocator string_arena;

    static size_t find_prime_size(size_t min_size);

public:
    ImmutableStringToArrayMap() : string_arena(1024 * 1024) {}

    void insert(std::string_view key, const std::array<float, ArraySize>& value);
    void finalize(); // Build hash table after all inserts
    const std::array<float, ArraySize>* lookup(std::string_view key) const;
};

The workflow is straightforward. insert copies keys to the arena, records offsets. finalize builds the hash table using linear probing and primes. lookup hashes, probes, returns the array pointer. GCC 15.2 at -O2 inlines it tight: loads, compares, and one predictable branch. Nothing complex.

Why the Gap Matters

Test hardware: i7-4790 (3.6 GHz, 32 KB L1d, 256 KB L2, 8 MB L3) on Fedora 43 with GCC 15.2 and Clang 21.1, -O2. The immutable map’s working set fits L1d. A random lookup touches 3–4 cache lines in stride. std::unordered_map: bucket array (L1 hit), pointer to node (L3 miss, ~40 cycle stall), then compare. Collisions pile on.

Measurement: perf stat --repeat=10 over 5 million random lookups. Immutable map: 32.96% cache miss rate. L2/L3 misses from the probes—acceptable because they don’t block the load-hit-store chain. std::unordered_map has a different problem: you hit the bucket array, then immediately need the loaded pointer. That data dependency stalls.

Sequential access narrows the gap; prefetch helps both. But at 37 ns versus 71 ns, the standard library version is still twice as slow even with the CPU prefetching.

Where It Works

Read-heavy, static workloads: configuration loaded at startup and queried millions of times. Compiler symbol tables. Protocol header parsing with fixed field names. Feature stores where the feature set never changes.

Don’t use it if the map mutates. Every insertion or deletion requires a full rebuild. More than one rebuild per million lookups and you’ve paid back the gains. Insertion, finalize, use—no incremental changes allowed.

Serialization requires custom code or a rebuild—the arena is one block. In-process workloads don’t care.

Implementation Notes

FNV-1a: uncomplicated, no tables. Both GCC and Clang inline it tight. The probe loop compiles to load, modulo (subtract-and-compare on small primes), and a predictable branch.

Prime-sized tables keep collisions low. At 2× load factor, the benchmark showed 1.1 probes per lookup—90% first-try hits. Yes, you pay to compute the next prime bigger than 2 * num_keys. Worth it because linear probing thrives on low collision rates. Memory strides are predictable, not pointer chains.

Arena allocation: std::vector<uint8_t> and a cursor. allocate(size) bumps and returns. Zero fragmentation, O(1) amortized per key. Size to (num_keys * avg_key_length) * 1.5 and skip reallocations.

Cached hashes in Entry structs avoid recomputation on collisions. Rare event, free space—small win on zero cost.

The Verdict

Static map, millions of lookups, hot path? Use it. A few hundred lines of standard C++. Tradeoff: no mutations for speed. Most servers will take that.

This pattern is old. Compiler symbol tables, DNS caches, Lucene indices all use it. Modern C++ (std::string_view, std::array, constexpr) makes the correct implementation obvious—no type safety loss, no readability sacrifice.

Start with perf stat --repeat=10 on your lookup pattern. High-frequency L3 misses? Measure the immutable map. Mutable map or < 10,000 queries total? Stick with std::unordered_map. Language servers, databases, config brokers doing millions of static lookups? Run the benchmark.

Evidence

References

Lemire, Daniel. “Mapping Strings to Float Arrays in Go: How Fast Can We Go?” Daniel Lemire’s blog, May 5, 2026. https://lemire.me/blog/2026/05/05/mapping-strings-to-float-arrays-in-go-how-fast-can-we-go/.