std::expected on Bare Metal: Error Handling Without Exceptions

The -fno-exceptions build flag and int return codes. Every embedded C++ codebase I’ve worked on has both, and the pattern is always the same: return an error code, take an output pointer, hope the caller checks. The compiler won’t stop them if they don’t.

std::expected<T, E> is supposed to fix this. I wanted to know if it actually works on a Cortex-M4 with 256 KB of flash — or if it drags in exception tables and bloats .text.

So I compiled the same sensor pipeline both ways and examined every byte. Computation identical, monadic chains gone at -O2, runtime overhead in the noise — and .text grows 29%. Whether that matters depends on your flash budget.

The type

std::expected<int, int> stores either the value or the error plus a discriminant byte. On ARM that’s 8 bytes: 4 for the payload, 1 for the flag, 3 for padding. The API biases toward the success case:

std::expected<int, int> parse_sensor(int raw) {
    if (raw < 0 || raw > 4095)
        return std::unexpected(-1);
    return raw * 3300 / 4095;
}

has_value(), value(), error(), value_or() — none require exceptions, none allocate. The monadic operations — transform(), and_then(), or_else() — chain computations without the if (!result) return staircase. Whether the compiler sees through them is a separate question.

-fno-exceptions: does it actually compile?

Both GCC 15 and Clang 21 compile the full std::expected API — value(), error(), value_or(), has_value(), transform(), and_then(), or_else() — cleanly with -fno-exceptions -fno-rtti -std=c++23.

Neither object file contains __gxx_personality_v0, _Unwind_Resume, or __cxa_throw. No exception-handling runtime.

One wrinkle worth knowing about: Clang emits three weak std::exception class symbols — vtable, constructor, destructor — from the bad_expected_access<E> class hierarchy. The class is defined in <expected> even though it’s never instantiated under -fno-exceptions. GCC elides these entirely. Neither compiler emits actual unwinding infrastructure; the linker discards the dead class metadata on a bare-metal target.

ARM codegen: same arithmetic, different ABI

I compiled an ADC sensor parser both ways for Cortex-M4 (arm-linux-gnueabihf-gcc -mcpu=cortex-m4 -mthumb -O2 -fno-exceptions -fno-rtti -std=c++23). The std::expected version:

[[gnu::noinline]]
std::expected<int, int> expected_parse_sensor(int raw) {
    if (raw < 0 || raw > 4095)
        return std::unexpected(-1);
    return raw * 3300 / 4095;
}

The error-code equivalent:

[[gnu::noinline]]
int errcode_parse_sensor(int raw, int* out) {
    if (raw < 0 || raw > 4095)
        return -1;
    *out = raw * 3300 / 4095;
    return 0;
}

The arithmetic is byte-for-byte identical. Both emit umull + lsrs #11 — the compiler’s magic-number division for * 3300 / 4095. The computation doesn’t change.

What changes is the calling convention. std::expected<int, int> is an 8-byte aggregate, so ARM returns it via a hidden pointer in r0. The success path writes the value with str and the discriminant with strb; the error path collapses both fields into a single strd. The error-code version returns int directly in r0 and writes through the explicit pointer.

At the call site, the check is structurally identical but encoded differently:

; std::expected version
ldrb    r3, [sp, #4]    @ load has_value flag
cbz     r3, .error      @ branch if no value

; error-code version
cbnz    r0, .error      @ branch if return != 0

One tests a byte in memory, the other tests a register. On a Cortex-M4 with zero-wait-state SRAM, that’s a cycle or two. Measurable in code size, not in runtime.

Monadic chains compile away

The manual version:

auto r0 = read_raw(input);
if (!r0) return std::unexpected(r0.error());
auto r1 = scale_value(*r0);
if (!r1) return std::unexpected(r1.error());
auto r2 = clamp_output(*r1);
if (!r2) return std::unexpected(r2.error());
return *r2;

The monadic version:

return read_raw(input)
    .and_then(scale_value)
    .and_then(clamp_output);

On GCC 15 with -O2 -std=c++23, these produce structurally equivalent x86-64 assembly. Both call the same three leaf functions in the same order. The discriminant check — shrq $32, %rax; testb %al, %al, branch on error — is identical in both. The only differences are branch polarity and return-value packing — btsq vs salq+orq — both stuffing the value and discriminant into a single 64-bit register.

The compiler sees through and_then completely — the monadic syntax is a source-level improvement that costs nothing at runtime.

Code size: where “zero overhead” breaks down

I compiled a four-stage sensor pipeline — ADC read, noise filter, calibration, millivolt conversion — both ways for Cortex-M4 and measured .text with arm-linux-gnueabihf-size:

Metric	`std::expected`	Error codes	Delta
`.text`	264 bytes	204 bytes	+60 bytes (+29%)
`.bss`	24 bytes	24 bytes	0

60 bytes more. About 15 bytes per function.

The reason is mechanical. std::expected<int, int> is a two-word aggregate: each function takes a hidden pointer, writes both the value and discriminant byte, then returns. The error-code version puts a single int in r0. The extra code is strd instructions for both aggregate fields, ldrb to check the discriminant instead of cbnz on a register, and 16 extra stack bytes per call frame for the temporary struct.

On a Cortex-M4 with 256 KB of flash, 60 bytes is a rounding error. On a Cortex-M0+ with 16 KB, it starts to bite when you have dozens of these pipelines — the cost scales linearly with std::expected-returning functions in the call graph, predictable enough to measure once and budget for.

Runtime: within noise

Google Benchmark on x86-64 (GCC 15, -O2, i7-4790 @ 3.60 GHz, 5 repetitions):

Path	`std::expected` (median)	Error codes (median)	Delta
Happy	1.854 ns	1.769 ns	+4.8%
Error	1.541 ns	1.716 ns	−10.2%

The happy-path difference is within the coefficient of variation (CV 1.0–1.4%). Noise. The error-path advantage for std::expected is outside the noise band — the error case skips the out-pointer write, just sets the discriminant and returns.

Containerized benchmark on an i7-4790 — Cortex-M4 timings from flash with wait states will look different. The relative ordering holds: the two patterns produce equivalent machine work.

Patterns I’d use in production

value_or() over value(). With -fno-exceptions, calling .value() on an error-state expected is undefined behavior — the throw in value() can’t execute. Guard with has_value() first, or use value_or(sentinel):

int mv = read_sensor(channel).value_or(-1);
if (mv < 0) enter_safe_state();

Typed errors. std::expected<int, int> works, but std::expected<Millivolts, SensorError> catches bugs at compile time. The discriminant byte costs the same.

and_then for multi-stage pipelines. The assembly is equivalent to manual checks, and the data flow is visible in the source:

auto result = read_adc(channel)
    .and_then(filter_noise)
    .and_then(calibrate)
    .and_then(to_millivolts);
return result.value_or(-1);

Watch aggregate sizes. std::expected<T, E> stores both types plus padding. A 64-byte T means you’re copying 64 bytes through every and_then link. For large payloads, std::expected<T*, E> with a static buffer, or restructure to operate in place.

What you get for 60 bytes

The compiler tracks error state in the type system — you can’t silently ignore a failed std::expected the way you ignore a returned -1. The monadic API compiles away entirely, and the runtime cost is in the noise.

The cost is 29% more .text for small functions, driven entirely by the two-word aggregate ABI. On a 256 KB part, I wouldn’t think twice. On a 16 KB part, I’d measure my specific call graph first — but I’d still reach for std::expected before going back to output pointers.

*All assembly and benchmarks: GCC 15.2.1 / Clang 21.1.8 on Fedora 43. ARM cross-compilation via gcc-c++-arm-linux-gnu targeting Cortex-M4 (-mcpu=cortex-m4 -mthumb). x86-64 benchmarks on i7-4790 @ 3.60 GHz, Google Benchmark v1.9.1, 5 repetitions. *