A minimal, end-to-end tutorial for deploying a neural network on the RP2350 microcontroller using PyTorch and ExecuTorch. This project bridges the gap between cloud-based ML frameworks and bare-metal embedded systems.
By completing this tutorial, you will understand:
- PyTorch Fundamentals — How tensors flow through a computational graph
- Model Export — Converting dynamic Python to a static, inspectable graph
- ExecuTorch Lowering — Transforming that graph for embedded execution
- Memory Planning — Pre-calculating every byte of RAM before runtime
- Embedded Deployment — Running inference without an OS or heap allocator
Most ML tutorials stop at model.predict(). On embedded systems, we must go much further. Here's the complete journey:
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEVELOPMENT MACHINE (Python) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PyTorch │ │ Export │ │ ExecuTorch │ │
│ │ Model │ ──▶ │ Graph │ ──▶ │ Lowering │ │
│ │ (Dynamic) │ │ (Static) │ │ (Optimize) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ nn.Module │ .pte file │
│ │ Python code │ (Flatbuffer) │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Training │ │ C Header │ │
│ │ Loop │ │ (bytes[]) │ │
│ └──────────────┘ └──────────────┘ │
│ │ │
└─────────────────────────────────────────────────────│───────────────────────┘
│
│ Cross-compile
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PICO 2 W (Bare Metal C++) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ExecuTorch │ │ Method │ │ Output │ │
│ │ Runtime │ ──▶ │ Execute │ ──▶ │ (sin(x)) │ │
│ │ (~50 KB) │ │ (Kernels) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
PyTorch operates in two fundamentally different modes. Understanding when and why to use each is essential.
Eager Mode (default) — Operations execute immediately as Python runs:
x = torch.tensor([1.0, 2.0])
y = torch.tensor([3.0, 4.0])
z = x + y # Addition happens RIGHT NOW, result stored in zGraph Mode — Operations are recorded, then executed later:
Eager Mode Graph Mode
┌───────────────┐ ┌───────────────────────┐
│ Python │ │ Python │
│ x = ... │ │ x = placeholder │
│ y = ... │ │ y = placeholder │
│ z = x + y │◀─ executes │ z = add(x, y) │◀─ records
│ print(z) │ instantly │ return z │ graph
└───────────────┘ └───────────────────────┘
│
▼ export
┌───────────────────────┐
│ Graph IR │
│ ┌───┐ ┌───┐ │
│ │ x │───▶│add│──▶ z │
│ └───┘ ▲ └───┘ │
│ ┌───┐ │ │
│ │ y │──┘ │
│ └───┘ │
└───────────────────────┘
Why Graph Mode for Embedded?
- No Python interpreter needed at runtime
- The graph can be analyzed, optimized, and memory-planned ahead of time
- All operations are known before execution begins
A tensor is a multi-dimensional array with metadata. Every tensor carries:
┌─────────────────────────────────────────────────────────────┐
│ Tensor │
├─────────────────────────────────────────────────────────────┤
│ Data Buffer: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0] │
│ ▲ │
│ │ (contiguous memory) │
├─────────────────┼───────────────────────────────────────────┤
│ Shape: │ (2, 3) ─▶ 2 rows, 3 columns │
│ Stride: │ (3, 1) ─▶ jump 3 to next row │
│ Dtype: │ float32 ─▶ 4 bytes per element │
│ Device: │ cpu ─▶ where data lives │
└─────────────────┴───────────────────────────────────────────┘
Logical View (2x3 matrix): Memory Layout (flat):
┌─────┬─────┬─────┐ ┌───┬───┬───┬───┬───┬───┐
│ 1.0 │ 2.0 │ 3.0 │ row 0 │1.0│2.0│3.0│4.0│5.0│6.0│
├─────┼─────┼─────┤ └───┴───┴───┴───┴───┴───┘
│ 4.0 │ 5.0 │ 6.0 │ row 1 0 1 2 3 4 5
└─────┴─────┴─────┘
Stride determines memory traversal: To access element [1, 2] (value 6.0):
- Row index: 1 × stride[0] = 1 × 3 = 3
- Col index: 2 × stride[1] = 2 × 1 = 2
- Final offset: 3 + 2 = 5 → buffer[5] = 6.0
torch.export traces your model by running it with example inputs and recording every operation:
torch.export.export(model, example_inputs)
│
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ ATen Dialect │
│ (All 2000+ PyTorch operators available) │
├────────────────────────────────────────────────────────────────────────────┤
│ graph(): │
│ %x : Tensor = placeholder[target=x] │
│ %weight : Tensor = get_attr[target=fc1.weight] │
│ %bias : Tensor = get_attr[target=fc1.bias] │
│ %linear : Tensor = call_function[target=aten.linear](%x, %weight, %bias)
│ %relu : Tensor = call_function[target=aten.relu](%linear) │
│ return (%relu,) │
└────────────────────────────────────────────────────────────────────────────┘
What gets captured:
- The exact sequence of tensor operations
- All weight tensors and their values (frozen at export time)
- Shape information for every intermediate tensor
What does NOT get captured:
- Python control flow (if/else based on tensor values)
- Side effects (print statements, file I/O)
- Dynamic Python logic
ATen (the "A TENsor" library) has thousands of operators. Many are "convenience" operators built from simpler primitives. Decomposition breaks them down:
Before Decomposition: After Decomposition:
┌─────────────────────┐ ┌─────────────────────────────────┐
│ aten.linear │ ──▶ │ aten.t (transpose) │
│ (W, x, bias) │ │ aten.mm (matrix multiply) │
│ │ │ aten.add (add bias) │
└─────────────────────┘ └─────────────────────────────────┘
Operator Count: ~2000 Operator Count: ~180 (Core ATen)
Why decompose?
- Smaller operator set = smaller runtime binary
- Backend developers implement fewer kernels
- More optimization opportunities (fuse patterns across decomposed ops)
to_edge() converts the graph for edge devices. Key changes:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Edge Dialect │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DTYPE SPECIALIZATION │
│ ───────────────────── │
│ Operators become dtype-specific: │
│ │
│ aten.add(Tensor, Tensor) ──▶ aten.add.float32(Tensor, Tensor) │
│ │
│ 2. SCALAR TO TENSOR │
│ ──────────────────── │
│ All scalars become 0-dimensional tensors: │
│ │
│ x + 2.0 ──▶ x + tensor([2.0]) │
│ │
│ 3. OUT-VARIANT CONVERSION │
│ ───────────────────────── │
│ Operators take pre-allocated output buffers: │
│ │
│ z = add(x, y) ──▶ add(x, y, out=z_buffer) │
│ (allocates z) (writes to existing buffer) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
This is where ExecuTorch differs fundamentally from PyTorch. Before any code runs on the device, the memory planner:
- Analyzes the lifetime of every tensor (when it's created, when it's last used)
- Calculates the size of every tensor
- Assigns offsets into a pre-allocated arena
Timeline of Tensor Lifetimes:
Operation: Linear1 ReLU1 Linear2 ReLU2 Output
────────────────────────────────────────────▶ time
│ │ │ │
Tensor t1: [=========] │ │ created by Linear1
Tensor t2: [=========] │ created by ReLU1
Tensor t3: [=========] created by Linear2
Tensor t4: [=========] created by ReLU2
Memory Arena Layout (offsets pre-computed):
Offset: 0 256 512 768 1024
├────────┼────────┼────────┼────────┤
│ t1 │ t2 │ t3 │ t4 │
│ reuse │ reuse │ │ │
└────────┴────────┴────────┴────────┘
│
▼
t1 and t2 can share memory!
(non-overlapping lifetimes)
The Memory Plan is Embedded in the .pte File:
┌─────────────────────────────────────────────────────────────┐
│ .pte File Contents │
├─────────────────────────────────────────────────────────────┤
│ • Instructions (which operators to call, in what order) │
│ • Weight data (frozen model parameters) │
│ • Memory map (offset assignments for every tensor) │
│ • Kernel references (which C++ functions to invoke) │
└─────────────────────────────────────────────────────────────┘
The RP2350 has severe constraints. Understanding them is essential:
┌─────────────────────────────────────────────────────────────────────────────┐
│ RP2350 MEMORY MAP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ FLASH (4MB) SRAM (520KB) │
│ ──────────── ──────────── │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Program Code │ │ Stack │ ← grows down │
│ │ (.text section) │ │ │ │
│ │ ~50KB runtime │ ├────────────────────┤ │
│ ├────────────────────┤ │ Heap (UNUSED) │ ← we avoid this │
│ │ Read-Only Data │ │ │ │
│ │ (.rodata) │ ├────────────────────┤ │
│ │ - Model weights │─────────────│►Planned Memory │ ← our arena │
│ │ - .pte payload │ XIP │ Arena │ │
│ │ ~3KB for this model│ (execute │ ~2KB activations │ │
│ └────────────────────┘ in place) ├────────────────────┤ │
│ │ Input/Output │ │
│ │ Buffers │ │
│ │ (provided by app) │ │
│ └────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
XIP = Execute In Place: Code runs directly from Flash, no copy to RAM needed
ExecuTorch uses a hierarchical memory allocation system:
┌─────────────────────────────────┐
│ MemoryManager │
└─────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Method │ │ Planned │ │ Temporary │
│ Allocator │ │ Memory │ │ Allocator │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Metadata & Activation Scratch space
control flow tensors for kernels
In practice (main.cpp):
// Fixed-size buffers, no malloc() at runtime
static uint8_t method_pool[32 * 1024]; // 32KB for method metadata
static uint8_t activation_pool[16 * 1024]; // 16KB for tensor activations
MemoryAllocator method_alloc(sizeof(method_pool), method_pool);
HierarchicalAllocator planned({activation_pool, sizeof(activation_pool)});
MemoryManager memory_manager(&method_alloc, &planned);The runtime must know how to execute each operator. This mapping is the Kernel Registry:
┌────────────────────────────────────────────────────────────────────────────┐
│ Kernel Registry │
├──────────────────────┬─────────────────────────────────────────────────────┤
│ Operator Name │ C++ Implementation │
├──────────────────────┼─────────────────────────────────────────────────────┤
│ aten.linear.out │ → torch::executor::linear_out(...) │
│ aten.relu.out │ → torch::executor::relu_out(...) │
│ aten.add.out │ → torch::executor::add_out(...) │
│ ... │ ... │
└──────────────────────┴─────────────────────────────────────────────────────┘
Selective Build Problem:
- Full kernel library = megabytes of code
- Your model uses 3 operators = you only need 3 kernels
- CMake
EXECUTORCH_SELECT_OPS_LISTbuilds only what's needed
# CMakeLists.txt
set(EXECUTORCH_SELECT_OPS_LIST "aten::linear.out,aten::relu.out,aten::add.out")Our model learns to approximate sin(x) for x ∈ [0, 2π]:
Input Hidden Layer 1 Hidden Layer 2 Output
│ (16 neurons) (16 neurons) │
│ │
x ──┬──▶ [Linear] ──▶ [ReLU] ──▶ [Linear] ──▶ [ReLU] ──▶ [Linear] ──▶ ŷ
│ │ │ │
│ ▼ ▼ ▼
│ W₁: (1,16) W₂: (16,16) W₃: (16,1)
│ b₁: (16,) b₂: (16,) b₃: (1,)
│ │
└──────────────────────────────────────────────────────┘
Total: 337 parameters
Universal Approximation Theorem: A neural network with at least one hidden layer and sufficient neurons can approximate any continuous function to arbitrary precision. We're using this to learn the sine function.
ReLU (Rectified Linear Unit) outputs max(0, x). A piecewise linear function:
ReLU(x): sin(x): Approximation:
│ / │ ╭──╮ │ /\
│ / │ / \ │ / \
────┼──/─── ─┼──/──────\── ─┼──/────\──
│ │ \ │ \
│ │ ╰── │ \/
With multiple ReLUs and learned weights, we can construct a piecewise linear approximation that closely follows the sine curve.
executorch-sine-pico2w/
├── pyproject.toml # Python dependencies (managed by uv)
├── README.md # This document
├── bootstrap.py # Downloads ARM toolchain, sets up environment
├── setup_toolchain.py # Configures cross-compilation paths
├── train_and_export.py # Defines model, trains, exports to .pte
├── build_firmware.py # Cross-compiles C++ runtime + model
├── main.cpp # Pico 2 W entry point, runs inference loop
├── CMakeLists.txt # Build configuration for embedded target
├── pte_to_header.py # Converts .pte binary to C byte array
└── serial_plotter.py # Visualizes predictions over USB serial
train_and_export.py build_firmware.py
│ │
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ 1. Define MLP │ │ 4. CMake config │
│ 2. Train loop │ │ (ARM target) │
│ 3. Export: │ │ 5. Compile: │
│ - capture │ │ - runtime │
│ - to_edge │ │ - kernels │
│ - to_exec │ │ - main.cpp │
│ 4. Serialize │ │ 6. Link .uf2 │
└───────┬───────┘ └────────┬─────────┘
│ │
▼ ▼
sine_model.pte ──▶ pte_to_header.py ──▶ model_data.h
│ │
│ ▼
│ sine_predictor.uf2
│ │
└─────────────────────────────────────┘
│
▼
Flash to Pico 2 W
- Python 3.12+ with uv package manager
- Raspberry Pi Pico 2 W
- USB cable
# Step 1: Bootstrap environment and toolchain
uv run python bootstrap.py
# Step 2: Train model and export to .pte
uv run python train_and_export.py
# Step 3: Build firmware
uv run python build_firmware.py
# Step 4: Flash to device
# Hold BOOTSEL button, connect USB, release
# Copy build/sine_predictor.uf2 to RPI-RP2 drive
# Step 5: View results
uv run python serial_plotter.pySymptom: The firmware compiles, but crashes with a missing operator error.
Cause: The C++ kernel for that operator wasn't included in the selective build.
Fix: Add the operator to CMakeLists.txt:
set(EXECUTORCH_SELECT_OPS_LIST
"aten::linear.out,aten::relu.out,aten::missing_op.out"
)How to find the missing operator: Run train_and_export.py with verbose logging to see the full operator list.
Symptom: Build succeeds, but device hangs or crashes immediately.
Cause: Activation memory exceeds the 520KB SRAM budget.
Diagnostic Questions:
- What is the activation memory requirement? (Check
to_executorch()output) - Can you reduce hidden layer sizes?
- Can you use int8 quantization?
Cause: USB CDC (serial) initialization takes time.
Fix in main.cpp:
stdio_init_all();
sleep_ms(2000); // Wait for USB enumerationCheck: Correct serial port (/dev/ttyACM0 on Linux, /dev/cu.usbmodem* on macOS)
If your model uses an operator not in the default kernel library:
- Check if it's in the Core ATen Op Set
- If yes, add to
EXECUTORCH_SELECT_OPS_LIST - If no, you'll need to write a custom kernel (advanced)
For larger models, 8-bit quantization reduces memory 4x:
from executorch.exir.backend.backend_api import to_backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
# After to_edge(), before to_executorch()
edge_program = to_backend(edge_program, XnnpackPartitioner())ExecuTorch supports multiple embedded targets:
| Target | CPU | Memory | Toolchain |
|---|---|---|---|
| Pico 2 W | Cortex-M33 | 520KB SRAM | arm-none-eabi-gcc |
| STM32H7 | Cortex-M7 | 1MB SRAM | arm-none-eabi-gcc |
| ESP32-S3 | Xtensa LX7 | 512KB SRAM | xtensa-esp32s3-elf-gcc |
| Concept | Definition | Why It Matters |
|---|---|---|
| Eager Mode | Operations execute immediately | Development/debugging |
| Graph Mode | Operations recorded, executed later | Deployment/optimization |
| ATen | PyTorch's tensor operation library | Foundation for all ops |
| Core ATen | Minimal subset of ATen (~180 ops) | What backends must implement |
| Edge Dialect | Graph specialized for edge devices | Dtype-specific, out-variants |
| Out-Variant | Operator writes to pre-allocated buffer | Enables memory planning |
| Memory Plan | Pre-computed tensor offset assignments | Zero allocation at runtime |
| Selective Build | Include only needed kernels | Minimize binary size |
| .pte File | Serialized ExecuTorch program | Flatbuffer format, portable |
| XIP | Execute In Place (from Flash) | Saves RAM on embedded |
Use this to estimate if your model fits on Pico 2 W:
Available SRAM: 520 KB
- ExecuTorch runtime: - 50 KB (approx)
- Method allocator: - 32 KB (configurable)
- Stack + globals: - 20 KB (approx)
────────
Remaining for activations: ~418 KB
Your model's activation size: ______ KB
(from to_executorch() output)
Fits? [ ] Yes [ ] No
If No, consider:
[ ] Reduce hidden layer sizes
[ ] Apply int8 quantization
[ ] Use operator fusion
[ ] Stream inputs in chunks