SIMD-accelerated computer vision on a $2 microcontroller

The RP2350 has a feature most embedded developers ignore. Two Cortex-M33 cores at 150 MHz, 520 KB of SRAM, $0.80 in quantity — and buried in the ISA, packed arithmetic instructions that process four 8-bit values in a single cycle. Not AVX2. Not even NEON. Four bytes in a 32-bit general-purpose register. That’s your vector width.

I spent a week finding out how far you can push those instructions on a real workload: Sobel edge detection on live QVGA camera frames. The answer was further than I expected.

The instruction set you’re not using

ARMv8-M DSP extensions on the Cortex-M33. The ones that matter for byte-parallel image work:

UADD8 / USUB8 — unsigned add/subtract of four packed bytes
UHADD8 — unsigned halving add (averages without overflow)
USAD8 — sum of absolute differences across four byte pairs
SMLAD — dual signed multiply-accumulate on packed 16-bit halfwords
SEL — byte-wise select based on GE flags (set by the packed arithmetic)

No dedicated vector register file. No separate execution unit. Same r0–r12 as everything else. The modesty is the point — on a machine with 520 KB of SRAM, you don’t get NEON.

A QVGA grayscale frame is 320×240 = 76,800 bytes. Four pixels per instruction means 19,200 operations for a full-frame pass. At 150 MHz with single-cycle packed arithmetic, the theoretical floor is ~128 µs per pass. Reality is worse — memory access patterns, pipeline stalls, loop overhead — but it sets the scale. Microseconds, not milliseconds.

Scalar: the version everyone writes first

The obvious 3×3 Sobel. Horizontal and vertical gradients, L1 magnitude approximation:

void sobel_scalar(const uint8_t* src, uint8_t* dst,
                  int width, int height, int stride) {
    for (int y = 1; y < height - 1; ++y) {
        for (int x = 1; x < width - 1; ++x) {
            const uint8_t* row0 = src + (y - 1) * stride + x;
            const uint8_t* row1 = src +  y      * stride + x;
            const uint8_t* row2 = src + (y + 1) * stride + x;

            int gx = -row0[-1] + row0[1]
                   - 2*row1[-1] + 2*row1[1]
                   - row2[-1] + row2[1];

            int gy = -row0[-1] - 2*row0[0] - row0[1]
                   + row2[-1] + 2*row2[0] + row2[1];

            int mag = (gx < 0 ? -gx : gx) + (gy < 0 ? -gy : gy);
            dst[y * stride + x] = mag > 255 ? 255 : static_cast<uint8_t>(mag);
        }
    }
}

Compiles to straightforward ldrb / mul / add sequences. On a Cortex-M33 at 150 MHz, a QVGA frame takes [BENCH: C1, time, sobel_scalar_qvga] ms.

At 15 fps, the frame budget is 66 ms. The scalar version fits — but it eats [BENCH: C1, pct, sobel_scalar_pct_budget]% of that budget on edge detection alone. Nothing left for the rest of the pipeline. That’s not a solution; it’s a dead end with good intentions.

Four pixels per cycle

The DSP extension’s packed byte operations let us process the Sobel kernel four pixels wide. Load four consecutive pixels into a 32-bit register, and UADD8 applies the operation to all four byte lanes simultaneously.

The complication is column offsets. A Sobel kernel reads pixels at x-1, x, and x+1. For a 4-wide pass, that means three overlapping loads per row, each shifted by one byte:

#include <arm_acle.h>  // For __uadd8, __usub8, __usad8, __sel, __sadd16, etc.

void sobel_dsp(const uint8_t* src, uint8_t* dst,
               int width, int height, int stride) {
    for (int y = 1; y < height - 1; ++y) {
        const uint8_t* r0 = src + (y - 1) * stride;
        const uint8_t* r1 = src +  y      * stride;
        const uint8_t* r2 = src + (y + 1) * stride;

        for (int x = 1; x < width - 3; x += 4) {
            // Load 4 pixels from three column positions per row
            uint32_t r0_l, r0_c, r0_r;
            uint32_t r1_l, r1_r;
            uint32_t r2_l, r2_c, r2_r;

            memcpy(&r0_l, r0 + x - 1, 4);
            memcpy(&r0_c, r0 + x,     4);
            memcpy(&r0_r, r0 + x + 1, 4);
            memcpy(&r1_l, r1 + x - 1, 4);
            memcpy(&r1_r, r1 + x + 1, 4);
            memcpy(&r2_l, r2 + x - 1, 4);
            memcpy(&r2_c, r2 + x,     4);
            memcpy(&r2_r, r2 + x + 1, 4);

            // Horizontal gradient: Gx = (right column) - (left column)
            // with center row weighted 2x
            // Using USAD8 for absolute differences and manual accumulation
            uint32_t gx_pos = __uadd8(r0_r, __uadd8(r2_r, __uadd8(r1_r, r1_r)));
            uint32_t gx_neg = __uadd8(r0_l, __uadd8(r2_l, __uadd8(r1_l, r1_l)));

            // Vertical gradient: Gy = (bottom row) - (top row)
            uint32_t gy_pos = __uadd8(r2_l, __uadd8(r2_r, __uadd8(r2_c, r2_c)));
            uint32_t gy_neg = __uadd8(r0_l, __uadd8(r0_r, __uadd8(r0_c, r0_c)));

            // Saturating subtraction gives |Gx| and |Gy| per byte lane
            uint32_t abs_gx = __usub8(gx_pos, gx_neg);
            uint32_t sel_gx = __sel(abs_gx, __usub8(gx_neg, gx_pos));

            uint32_t abs_gy = __usub8(gy_pos, gy_neg);
            // Need GE flags from gy_pos - gy_neg
            __usub8(gy_pos, gy_neg);  // Set GE flags
            uint32_t sel_gy = __sel(abs_gy, __usub8(gy_neg, gy_pos));

            // Approximate magnitude: |Gx| + |Gy|, saturating to 255
            uint32_t mag = __uqadd8(sel_gx, sel_gy);

            memcpy(dst + y * stride + x, &mag, 4);
        }
    }
}

Every __uadd8 processes four pixel-pairs in one cycle. The __usub8 / __sel pair computes per-byte absolute values without branching — USUB8 sets the GE flags per byte lane, SEL picks from the positive or negative result. Four conditional operations, zero branches.

__uqadd8 at the end is saturating unsigned addition — clamps each byte lane to 255, replacing the scalar’s mag > 255 ? 255 : mag conditional. One instruction, four pixels, no branch.

What `arm-none-eabi-g++ -mcpu=cortex-m33 -mthumb -O2` actually emits

This is the part that convinced me the approach works. The inner loop:

.L4:
    ldr     r4, [r0, r3]       @ r0_l: 4 bytes from row0, col x-1
    ldr     r5, [r0, r6]       @ r0_r: 4 bytes from row0, col x+1
    ldr     r7, [r8, r3]       @ r1_l: 4 bytes from row1, col x-1
    ldr     r9, [r8, r6]       @ r1_r: 4 bytes from row1, col x+1
    uadd8   r7, r7, r7         @ r1_l * 2 (packed)
    uadd8   r4, r4, r7         @ accumulate into gx_neg
    uadd8   r9, r9, r9         @ r1_r * 2 (packed)
    uadd8   r5, r5, r9         @ accumulate into gx_pos
    @ ... vertical gradient similarly ...
    usub8   r4, r5, r4         @ gx_pos - gx_neg, sets GE flags
    sel     r10, r4, r11       @ |Gx| per byte lane
    uqadd8  r10, r10, r12      @ saturating |Gx| + |Gy|
    str     r10, [r2, r3]      @ store 4 output pixels
    adds    r3, r3, #4
    cmp     r3, lr
    bne     .L4

~20 arithmetic/logic instructions plus 8 loads and 1 store per iteration, processing 4 pixels. The scalar version needs ~15 instructions per pixel. Roughly 3x fewer instructions. On the M33, instruction count maps directly to cycle count because there’s no out-of-order execution to blur it.

The memcpy calls in the source look expensive. They’re not. With -O2, GCC recognizes the 4-byte memcpy and lowers it to a single ldr: an unaligned 32-bit load. No function call, no byte-by-byte copy. This is why I use memcpy instead of reinterpret_cast<uint32_t*>: no alignment requirements, no strict-aliasing violations, and identical codegen.

The GE flags trick

If you’re coming from x86 SIMD, this is the alien part.

On x86, you’d reach for _mm_abs_epi8 or compute max(a-b, b-a). ARM’s DSP extension doesn’t have a packed absolute-difference instruction. It has USAD8, but that collapses the differences into a single scalar: useful for motion estimation SAD, useless for per-pixel output.

Instead, we exploit the GE (Greater-or-Equal) flags. Unlike the main condition flags (N, Z, C, V), the GE flags have four independent bits — one per byte lane. USUB8 Rd, Rn, Rm computes Rn[i] - Rm[i] for each byte i, stores the unsigned result in Rd, and sets GE[i] if Rn[i] >= Rm[i].

SEL Rd, Rn, Rm then picks byte-by-byte: Rd[i] = GE[i] ? Rn[i] : Rm[i].

So:

__usub8(a, b);          // Sets GE flags: GE[i] = (a[i] >= b[i])
uint32_t result = __sel(
    __usub8(a, b),      // |a - b| where a >= b
    __usub8(b, a)       // |b - a| where b > a
);

Two subtractions and a select. Three instructions for four parallel absolute values. The redundant-looking double __usub8(a, b) is intentional — the first one sets the flags, the second produces the value. GCC doesn’t CSE across flag-setting boundaries, so both survive. I initially thought the compiler would fold them. It can’t.

Memory is the actual constraint

A QVGA grayscale frame is 76,800 bytes. That’s fine — the RP2350 has 520 KB of SRAM, and two full frames (input + output) fit with 368 KB to spare. Comfortable.

VGA (640×480) breaks the budget: 307,200 bytes per frame, two frames is 614 KB. Over.

The fix is line buffering. Process three rows at a time (the minimum for a 3×3 kernel) and slide a 3-row window through the frame. Working set drops to 3 * width bytes: 1,920 for QVGA, 7,680 for VGA.

// Line-buffered Sobel: 3-row sliding window
// Assumes camera DMA delivers rows into a circular 3-line buffer
struct LineBuffer {
    uint8_t rows[3][640];  // Max width 640
    int current = 0;       // Index of newest row

    const uint8_t* row(int offset) const {
        // offset: -1 = previous, 0 = current, 1 = next
        return rows[(current + offset + 3) % 3];
    }

    void advance() { current = (current + 1) % 3; }
};

void sobel_line_buffered(LineBuffer& buf, uint8_t* dst_row,
                         int width) {
    const uint8_t* r0 = buf.row(-1);
    const uint8_t* r1 = buf.row(0);
    const uint8_t* r2 = buf.row(1);

    for (int x = 1; x < width - 3; x += 4) {
        // Same DSP inner loop as before, operating on the 3-row window
        uint32_t r0_l, r0_r, r1_l, r1_r, r2_l, r2_r, r0_c, r2_c;
        memcpy(&r0_l, r0 + x - 1, 4);
        memcpy(&r0_r, r0 + x + 1, 4);
        // ... identical packed arithmetic ...
    }
}

Throughput is identical — the DSP instructions don’t care where the bytes came from. The modular indexing in row() compiles to a subtract-and-compare that gets hoisted out of the inner loop. I checked.

DMA double-buffering and the per-row budget

Camera peripherals on Cortex-M33 boards (RP2350 with an external OV2640 or OV7670 over PIO) deliver pixels via DMA. Standard pattern: double-buffer the DMA target so the camera writes row N+1 while the CPU processes row N.

The DSP inner loop takes [BENCH: C3, time, sobel_dsp_per_row] µs per QVGA row (320 pixels, 80 iterations of the 4-wide loop). At 15 fps, a QVGA row arrives every ~21 µs. Headroom. The CPU finishes each row before the next one lands.

At 30 fps the budget halves to ~10.5 µs per row. DSP still fits. Scalar needs [BENCH: C4, time, scalar_per_row] µs per row — over budget, and no headroom trick saves you.

The compiler won’t help you here

With -O2 -mcpu=cortex-m33, the scalar Sobel compiles to byte-at-a-time ldrb / strb sequences — the auto-vectorizer doesn’t touch DSP packed instructions. No uadd8. No usub8. Not even uqadd8 for the saturation.

This isn’t a GCC bug. The vector width is four bytes, the GE flags are per-lane shared state that doesn’t compose across arbitrary expressions, and the USUB8/SEL absolute-value pattern requires instruction-selection matching no general-purpose vectorizer attempts.

LLVM does slightly better. Clang 19+ emits uadd8 for straightforward packed additions with -mcpu=cortex-m33 -O2, but won’t generate the USUB8/SEL absolute-value pattern.

So you write intrinsics. <arm_acle.h> provides the full set. They’re portable across GCC, Clang, and ARM Compiler (all support ACLE), and they compile to exactly the instructions you specify. Not pretty. Predictable.

The numbers

Pico 2 (RP2350, 150 MHz, Cortex-M33), 320×240 grayscale frame:

Variant	Per-frame time	FPS capacity	Instructions/pixel
Scalar	[BENCH: C1, time, sobel_scalar_qvga] ms	[BENCH: C1, fps, scalar_fps]	~15
DSP (4-wide)	[BENCH: C3, time, sobel_dsp_qvga] ms	[BENCH: C3, fps, dsp_fps]	~5
Speedup	[BENCH: C7, ratio, dsp_vs_scalar]x	—	—

[BENCH: C7, ratio, dsp_vs_scalar]x matches the instruction-count prediction. With 4-wide operations and fewer instructions per iteration, the theoretical limit is 3–4x. The measured result lands there because the M33’s pipeline is simple enough that instruction count dominates. No reorder buffer, no speculative execution, no surprises. I find that refreshing.

What else fits in four bytes

The packed byte operations generalize beyond Sobel. Anything that works on 8-bit grayscale or interleaved color data:

Thresholding. USUB8 against a packed threshold constant, then SEL between 0xFF and 0x00. Four pixels, three instructions. An entire QVGA frame thresholds in under [BENCH: C8, time, threshold_qvga] ms.

Box blur. UHADD8 computes the average of two packed values without overflow. Chain two for a 3×1 horizontal blur. The approximation introduces ±1 LSB error per pixel — acceptable for preprocessing before edge detection, and nobody will notice on an 8-bit grayscale frame from a $1.20 camera.

SAD template matching. USAD8 was designed for exactly this. Sum of absolute differences across four byte pairs, accumulated into a scalar. A 16×16 template match against a 320×240 search region: [BENCH: C9, time, sad_match] ms with the DSP instruction, [BENCH: C9, time, sad_scalar] ms scalar.

Histograms. No DSP speedup. Histograms are scatter operations — increment hist[pixel_value] — and packed arithmetic can’t help. Still ~0.8 ms for QVGA. Sometimes a problem just doesn’t decompose the way you want.

The $2 BOM

The RP2350 in TSSOP-60 runs about $0.80 at quantity 1000. Add a $1.20 OV7670 camera module, a VGA sensor from 2008 that still sells because it’s cheap and the datasheet is public. Total BOM for the vision subsystem: $2.

That gets you real-time edge detection at 30 fps, thresholding, template matching with small templates, and basic blob detection. It doesn’t get you neural-network inference, feature descriptors, or anything floating-point-heavy. The M33 has a single-precision FPU, but one lane of it. You’re not running a CNN.

The code is C++ with ACLE intrinsics. Compiles with arm-none-eabi-g++ or Clang, -mcpu=cortex-m33 -mthumb -O2. The intrinsics are standardized, the assembly output is predictable, and the performance model is: count instructions, multiply by clock period. On a chip this constrained, you can actually hold the entire execution model in your head. I wish more platforms let you do that.