The -fno-exceptions build flag and int return codes. Every embedded C++ codebase I’ve worked on has both, and the pattern is always the same: return an error code, take an output pointer, hope the caller checks. The compiler won’t stop them if they don’t.
std::expected<T, E> is supposed to fix this. I wanted to know if it actually works on a Cortex-M4 with 256 KB of flash — or if it drags in exception tables and bloats .text.
So I compiled the same sensor pipeline both ways and examined every byte. Computation identical, monadic chains gone at -O2, runtime overhead in the noise — and .text grows 29%. Whether that matters depends on your flash budget.
The type
std::expected<int, int> stores either the value or the error plus a discriminant byte. On ARM that’s 8 bytes: 4 for the payload, 1 for the flag, 3 for padding. The API biases toward the success case:
std::expected<int, int> parse_sensor(int raw) {
if (raw < 0 || raw > 4095)
return std::unexpected(-1);
return raw * 3300 / 4095;
}
has_value(), value(), error(), value_or() — none require exceptions, none allocate. The monadic operations — transform(), and_then(), or_else() — chain computations without the if (!result) return staircase. Whether the compiler sees through them is a separate question.
-fno-exceptions: does it actually compile?
Both GCC 15 and Clang 21 compile the full std::expected API — value(), error(), value_or(), has_value(), transform(), and_then(), or_else() — cleanly with -fno-exceptions -fno-rtti -std=c++23.
Neither object file contains __gxx_personality_v0, _Unwind_Resume, or __cxa_throw. No exception-handling runtime.
One wrinkle worth knowing about: Clang emits three weak std::exception class symbols — vtable, constructor, destructor — from the bad_expected_access<E> class hierarchy. The class is defined in <expected> even though it’s never instantiated under -fno-exceptions. GCC elides these entirely. Neither compiler emits actual unwinding infrastructure; the linker discards the dead class metadata on a bare-metal target.
ARM codegen: same arithmetic, different ABI
I compiled an ADC sensor parser both ways for Cortex-M4 (arm-linux-gnueabihf-gcc -mcpu=cortex-m4 -mthumb -O2 -fno-exceptions -fno-rtti -std=c++23). The std::expected version:
[[gnu::noinline]]
std::expected<int, int> expected_parse_sensor(int raw) {
if (raw < 0 || raw > 4095)
return std::unexpected(-1);
return raw * 3300 / 4095;
}
The error-code equivalent:
[[gnu::noinline]]
int errcode_parse_sensor(int raw, int* out) {
if (raw < 0 || raw > 4095)
return -1;
*out = raw * 3300 / 4095;
return 0;
}
The arithmetic is byte-for-byte identical. Both emit umull + lsrs #11 — the compiler’s magic-number division for * 3300 / 4095. The computation doesn’t change.
What changes is the calling convention. std::expected<int, int> is an 8-byte aggregate, so ARM returns it via a hidden pointer in r0. The success path writes the value with str and the discriminant with strb; the error path collapses both fields into a single strd. The error-code version returns int directly in r0 and writes through the explicit pointer.
At the call site, the check is structurally identical but encoded differently:
; std::expected version
ldrb r3, [sp, #4] @ load has_value flag
cbz r3, .error @ branch if no value
; error-code version
cbnz r0, .error @ branch if return != 0
One tests a byte in memory, the other tests a register. On a Cortex-M4 with zero-wait-state SRAM, that’s a cycle or two. Measurable in code size, not in runtime.
Monadic chains compile away
The manual version:
auto r0 = read_raw(input);
if (!r0) return std::unexpected(r0.error());
auto r1 = scale_value(*r0);
if (!r1) return std::unexpected(r1.error());
auto r2 = clamp_output(*r1);
if (!r2) return std::unexpected(r2.error());
return *r2;
The monadic version:
return read_raw(input)
.and_then(scale_value)
.and_then(clamp_output);
On GCC 15 with -O2 -std=c++23, these produce structurally equivalent x86-64 assembly. Both call the same three leaf functions in the same order. The discriminant check — shrq $32, %rax; testb %al, %al, branch on error — is identical in both. The only differences are branch polarity and return-value packing — btsq vs salq+orq — both stuffing the value and discriminant into a single 64-bit register.
The compiler sees through and_then completely — the monadic syntax is a source-level improvement that costs nothing at runtime.
Code size: where “zero overhead” breaks down
I compiled a four-stage sensor pipeline — ADC read, noise filter, calibration, millivolt conversion — both ways for Cortex-M4 and measured .text with arm-linux-gnueabihf-size:
| Metric | std::expected | Error codes | Delta |
|---|---|---|---|
.text | 264 bytes | 204 bytes | +60 bytes (+29%) |
.bss | 24 bytes | 24 bytes | 0 |
60 bytes more. About 15 bytes per function.
The reason is mechanical. std::expected<int, int> is a two-word aggregate: each function takes a hidden pointer, writes both the value and discriminant byte, then returns. The error-code version puts a single int in r0. The extra code is strd instructions for both aggregate fields, ldrb to check the discriminant instead of cbnz on a register, and 16 extra stack bytes per call frame for the temporary struct.
On a Cortex-M4 with 256 KB of flash, 60 bytes is a rounding error. On a Cortex-M0+ with 16 KB, it starts to bite when you have dozens of these pipelines — the cost scales linearly with std::expected-returning functions in the call graph, predictable enough to measure once and budget for.
Runtime: within noise
Google Benchmark on x86-64 (GCC 15, -O2, i7-4790 @ 3.60 GHz, 5 repetitions):
| Path | std::expected (median) | Error codes (median) | Delta |
|---|---|---|---|
| Happy | 1.854 ns | 1.769 ns | +4.8% |
| Error | 1.541 ns | 1.716 ns | −10.2% |
The happy-path difference is within the coefficient of variation (CV 1.0–1.4%). Noise. The error-path advantage for std::expected is outside the noise band — the error case skips the out-pointer write, just sets the discriminant and returns.
Containerized benchmark on an i7-4790 — Cortex-M4 timings from flash with wait states will look different. The relative ordering holds: the two patterns produce equivalent machine work.
Patterns I’d use in production
value_or() over value(). With -fno-exceptions, calling .value() on an error-state expected is undefined behavior — the throw in value() can’t execute. Guard with has_value() first, or use value_or(sentinel):
int mv = read_sensor(channel).value_or(-1);
if (mv < 0) enter_safe_state();
Typed errors. std::expected<int, int> works, but std::expected<Millivolts, SensorError> catches bugs at compile time. The discriminant byte costs the same.
and_then for multi-stage pipelines. The assembly is equivalent to manual checks, and the data flow is visible in the source:
auto result = read_adc(channel)
.and_then(filter_noise)
.and_then(calibrate)
.and_then(to_millivolts);
return result.value_or(-1);
Watch aggregate sizes. std::expected<T, E> stores both types plus padding. A 64-byte T means you’re copying 64 bytes through every and_then link. For large payloads, std::expected<T*, E> with a static buffer, or restructure to operate in place.
What you get for 60 bytes
The compiler tracks error state in the type system — you can’t silently ignore a failed std::expected the way you ignore a returned -1. The monadic API compiles away entirely, and the runtime cost is in the noise.
The cost is 29% more .text for small functions, driven entirely by the two-word aggregate ABI. On a 256 KB part, I wouldn’t think twice. On a 16 KB part, I’d measure my specific call graph first — but I’d still reach for std::expected before going back to output pointers.
*All assembly and benchmarks: GCC 15.2.1 / Clang 21.1.8 on Fedora 43. ARM cross-compilation via gcc-c++-arm-linux-gnu targeting Cortex-M4 (-mcpu=cortex-m4 -mthumb). x86-64 benchmarks on i7-4790 @ 3.60 GHz, Google Benchmark v1.9.1, 5 repetitions. *