ExecuTorch Sine Wave Predictor for Raspberry Pi Pico 2 W

A minimal, end-to-end tutorial for deploying a neural network on the RP2350 microcontroller using PyTorch and ExecuTorch. This project bridges the gap between cloud-based ML frameworks and bare-metal embedded systems.

What This Project Teaches

By completing this tutorial, you will understand:

PyTorch Fundamentals — How tensors flow through a computational graph
Model Export — Converting dynamic Python to a static, inspectable graph
ExecuTorch Lowering — Transforming that graph for embedded execution
Memory Planning — Pre-calculating every byte of RAM before runtime
Embedded Deployment — Running inference without an OS or heap allocator

The Big Picture: From Python to Silicon

Most ML tutorials stop at model.predict(). On embedded systems, we must go much further. Here's the complete journey:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DEVELOPMENT MACHINE (Python)                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐              │
│  │   PyTorch    │      │    Export    │      │  ExecuTorch  │              │
│  │    Model     │ ──▶  │    Graph     │ ──▶  │   Lowering   │              │
│  │  (Dynamic)   │      │   (Static)   │      │  (Optimize)  │              │
│  └──────────────┘      └──────────────┘      └──────────────┘              │
│         │                                           │                       │
│         │ nn.Module                                 │ .pte file             │
│         │ Python code                               │ (Flatbuffer)          │
│         ▼                                           ▼                       │
│  ┌──────────────┐                           ┌──────────────┐               │
│  │   Training   │                           │  C Header    │               │
│  │    Loop      │                           │  (bytes[])   │               │
│  └──────────────┘                           └──────────────┘               │
│                                                     │                       │
└─────────────────────────────────────────────────────│───────────────────────┘
                                                      │
                                                      │ Cross-compile
                                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                       PICO 2 W (Bare Metal C++)                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐              │
│  │  ExecuTorch  │      │    Method    │      │    Output    │              │
│  │   Runtime    │ ──▶  │   Execute    │ ──▶  │   (sin(x))   │              │
│  │   (~50 KB)   │      │  (Kernels)   │      │              │              │
│  └──────────────┘      └──────────────┘      └──────────────┘              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Part 1: Understanding PyTorch's Execution Model

Eager Mode vs. Graph Mode

PyTorch operates in two fundamentally different modes. Understanding when and why to use each is essential.

Eager Mode (default) — Operations execute immediately as Python runs:

x = torch.tensor([1.0, 2.0])
y = torch.tensor([3.0, 4.0])
z = x + y  # Addition happens RIGHT NOW, result stored in z

Graph Mode — Operations are recorded, then executed later:

       Eager Mode                         Graph Mode
    ┌───────────────┐               ┌───────────────────────┐
    │   Python      │               │   Python              │
    │   x = ...     │               │   x = placeholder     │
    │   y = ...     │               │   y = placeholder     │
    │   z = x + y   │◀─ executes    │   z = add(x, y)       │◀─ records
    │   print(z)    │   instantly   │   return z            │   graph
    └───────────────┘               └───────────────────────┘
                                              │
                                              ▼ export
                                    ┌───────────────────────┐
                                    │      Graph IR         │
                                    │  ┌───┐    ┌───┐       │
                                    │  │ x │───▶│add│──▶ z  │
                                    │  └───┘  ▲ └───┘       │
                                    │  ┌───┐  │             │
                                    │  │ y │──┘             │
                                    │  └───┘                │
                                    └───────────────────────┘

Why Graph Mode for Embedded?

No Python interpreter needed at runtime
The graph can be analyzed, optimized, and memory-planned ahead of time
All operations are known before execution begins

What is a Tensor?

A tensor is a multi-dimensional array with metadata. Every tensor carries:

┌─────────────────────────────────────────────────────────────┐
│                         Tensor                              │
├─────────────────────────────────────────────────────────────┤
│  Data Buffer:  [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]              │
│                 ▲                                           │
│                 │ (contiguous memory)                       │
├─────────────────┼───────────────────────────────────────────┤
│  Shape:        │ (2, 3)      ─▶ 2 rows, 3 columns          │
│  Stride:       │ (3, 1)      ─▶ jump 3 to next row         │
│  Dtype:        │ float32     ─▶ 4 bytes per element        │
│  Device:       │ cpu         ─▶ where data lives           │
└─────────────────┴───────────────────────────────────────────┘

Logical View (2x3 matrix):         Memory Layout (flat):
┌─────┬─────┬─────┐                ┌───┬───┬───┬───┬───┬───┐
│ 1.0 │ 2.0 │ 3.0 │   row 0       │1.0│2.0│3.0│4.0│5.0│6.0│
├─────┼─────┼─────┤                └───┴───┴───┴───┴───┴───┘
│ 4.0 │ 5.0 │ 6.0 │   row 1         0   1   2   3   4   5
└─────┴─────┴─────┘

Stride determines memory traversal: To access element [1, 2] (value 6.0):

Row index: 1 × stride[0] = 1 × 3 = 3
Col index: 2 × stride[1] = 2 × 1 = 2
Final offset: 3 + 2 = 5 → buffer[5] = 6.0

Part 2: The Export Pipeline

torch.export: Capturing the Graph

torch.export traces your model by running it with example inputs and recording every operation:

                    torch.export.export(model, example_inputs)
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────────┐
│                            ATen Dialect                                     │
│  (All 2000+ PyTorch operators available)                                   │
├────────────────────────────────────────────────────────────────────────────┤
│  graph():                                                                  │
│      %x : Tensor = placeholder[target=x]                                   │
│      %weight : Tensor = get_attr[target=fc1.weight]                        │
│      %bias : Tensor = get_attr[target=fc1.bias]                            │
│      %linear : Tensor = call_function[target=aten.linear](%x, %weight, %bias)
│      %relu : Tensor = call_function[target=aten.relu](%linear)             │
│      return (%relu,)                                                       │
└────────────────────────────────────────────────────────────────────────────┘

What gets captured:

The exact sequence of tensor operations
All weight tensors and their values (frozen at export time)
Shape information for every intermediate tensor

What does NOT get captured:

Python control flow (if/else based on tensor values)
Side effects (print statements, file I/O)
Dynamic Python logic

Core ATen Decomposition

ATen (the "A TENsor" library) has thousands of operators. Many are "convenience" operators built from simpler primitives. Decomposition breaks them down:

Before Decomposition:                After Decomposition:
┌─────────────────────┐              ┌─────────────────────────────────┐
│  aten.linear        │     ──▶      │  aten.t (transpose)             │
│  (W, x, bias)       │              │  aten.mm (matrix multiply)      │
│                     │              │  aten.add (add bias)            │
└─────────────────────┘              └─────────────────────────────────┘

Operator Count:   ~2000              Operator Count:   ~180 (Core ATen)

Why decompose?

Smaller operator set = smaller runtime binary
Backend developers implement fewer kernels
More optimization opportunities (fuse patterns across decomposed ops)

Part 3: ExecuTorch Lowering

The to_edge Transformation

to_edge() converts the graph for edge devices. Key changes:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Edge Dialect                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. DTYPE SPECIALIZATION                                                    │
│     ─────────────────────                                                   │
│     Operators become dtype-specific:                                        │
│                                                                             │
│     aten.add(Tensor, Tensor) ──▶ aten.add.float32(Tensor, Tensor)          │
│                                                                             │
│  2. SCALAR TO TENSOR                                                        │
│     ────────────────────                                                    │
│     All scalars become 0-dimensional tensors:                               │
│                                                                             │
│     x + 2.0 ──▶ x + tensor([2.0])                                          │
│                                                                             │
│  3. OUT-VARIANT CONVERSION                                                  │
│     ─────────────────────────                                               │
│     Operators take pre-allocated output buffers:                            │
│                                                                             │
│     z = add(x, y)  ──▶  add(x, y, out=z_buffer)                            │
│     (allocates z)       (writes to existing buffer)                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Memory Planning: The Zero-Allocation Guarantee

This is where ExecuTorch differs fundamentally from PyTorch. Before any code runs on the device, the memory planner:

Analyzes the lifetime of every tensor (when it's created, when it's last used)
Calculates the size of every tensor
Assigns offsets into a pre-allocated arena

Timeline of Tensor Lifetimes:
                    
Operation:   Linear1    ReLU1    Linear2    ReLU2    Output
             ────────────────────────────────────────────▶ time
                │         │         │         │
Tensor t1:     [=========]          │         │         created by Linear1
Tensor t2:               [=========]          │         created by ReLU1
Tensor t3:                         [=========]          created by Linear2
Tensor t4:                                   [=========] created by ReLU2

Memory Arena Layout (offsets pre-computed):

Offset:  0        256      512      768     1024
         ├────────┼────────┼────────┼────────┤
         │   t1   │   t2   │   t3   │   t4   │
         │ reuse  │ reuse  │        │        │
         └────────┴────────┴────────┴────────┘
                      │
                      ▼
            t1 and t2 can share memory!
            (non-overlapping lifetimes)

The Memory Plan is Embedded in the .pte File:

┌─────────────────────────────────────────────────────────────┐
│                     .pte File Contents                       │
├─────────────────────────────────────────────────────────────┤
│  • Instructions (which operators to call, in what order)    │
│  • Weight data (frozen model parameters)                    │
│  • Memory map (offset assignments for every tensor)         │
│  • Kernel references (which C++ functions to invoke)        │
└─────────────────────────────────────────────────────────────┘

Part 4: The Embedded Runtime

Runtime Architecture on Pico 2 W

The RP2350 has severe constraints. Understanding them is essential:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         RP2350 MEMORY MAP                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  FLASH (4MB)                        SRAM (520KB)                            │
│  ────────────                       ────────────                            │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ Program Code       │             │ Stack              │  ← grows down    │
│  │ (.text section)    │             │                    │                  │
│  │ ~50KB runtime      │             ├────────────────────┤                  │
│  ├────────────────────┤             │ Heap (UNUSED)      │  ← we avoid this │
│  │ Read-Only Data     │             │                    │                  │
│  │ (.rodata)          │             ├────────────────────┤                  │
│  │ - Model weights    │─────────────│►Planned Memory     │  ← our arena     │
│  │ - .pte payload     │  XIP        │  Arena             │                  │
│  │ ~3KB for this model│  (execute   │  ~2KB activations  │                  │
│  └────────────────────┘   in place) ├────────────────────┤                  │
│                                     │ Input/Output       │                  │
│                                     │ Buffers            │                  │
│                                     │ (provided by app)  │                  │
│                                     └────────────────────┘                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

XIP = Execute In Place: Code runs directly from Flash, no copy to RAM needed

The MemoryManager Hierarchy

ExecuTorch uses a hierarchical memory allocation system:

                    ┌─────────────────────────────────┐
                    │         MemoryManager           │
                    └─────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
     ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
     │   Method    │  │   Planned   │  │  Temporary  │
     │  Allocator  │  │   Memory    │  │  Allocator  │
     └─────────────┘  └─────────────┘  └─────────────┘
           │               │                 │
           ▼               ▼                 ▼
     Metadata &       Activation        Scratch space
     control flow     tensors           for kernels

In practice (main.cpp):

// Fixed-size buffers, no malloc() at runtime
static uint8_t method_pool[32 * 1024];     // 32KB for method metadata
static uint8_t activation_pool[16 * 1024]; // 16KB for tensor activations

MemoryAllocator method_alloc(sizeof(method_pool), method_pool);
HierarchicalAllocator planned({activation_pool, sizeof(activation_pool)});
MemoryManager memory_manager(&method_alloc, &planned);

Kernel Registry and Selective Build

The runtime must know how to execute each operator. This mapping is the Kernel Registry:

┌────────────────────────────────────────────────────────────────────────────┐
│                           Kernel Registry                                   │
├──────────────────────┬─────────────────────────────────────────────────────┤
│  Operator Name       │  C++ Implementation                                 │
├──────────────────────┼─────────────────────────────────────────────────────┤
│  aten.linear.out     │  → torch::executor::linear_out(...)                 │
│  aten.relu.out       │  → torch::executor::relu_out(...)                   │
│  aten.add.out        │  → torch::executor::add_out(...)                    │
│  ...                 │  ...                                                │
└──────────────────────┴─────────────────────────────────────────────────────┘

Selective Build Problem:

Full kernel library = megabytes of code
Your model uses 3 operators = you only need 3 kernels
CMake EXECUTORCH_SELECT_OPS_LIST builds only what's needed

# CMakeLists.txt
set(EXECUTORCH_SELECT_OPS_LIST "aten::linear.out,aten::relu.out,aten::add.out")

Part 5: The Sine Wave Model

Network Architecture

Our model learns to approximate sin(x) for x ∈ [0, 2π]:

Input          Hidden Layer 1      Hidden Layer 2      Output
  │            (16 neurons)        (16 neurons)          │
  │                                                      │
  x ──┬──▶ [Linear] ──▶ [ReLU] ──▶ [Linear] ──▶ [ReLU] ──▶ [Linear] ──▶ ŷ
      │        │                       │                       │
      │        ▼                       ▼                       ▼
      │   W₁: (1,16)              W₂: (16,16)             W₃: (16,1)
      │   b₁: (16,)               b₂: (16,)               b₃: (1,)
      │                                                      │
      └──────────────────────────────────────────────────────┘
                        Total: 337 parameters

Universal Approximation Theorem: A neural network with at least one hidden layer and sufficient neurons can approximate any continuous function to arbitrary precision. We're using this to learn the sine function.

Why ReLU Works for Sine Approximation

ReLU (Rectified Linear Unit) outputs max(0, x). A piecewise linear function:

ReLU(x):           sin(x):              Approximation:
    │      /           │    ╭──╮             │    /\
    │    /             │   /    \            │   /  \
────┼──/───           ─┼──/──────\──        ─┼──/────\──
    │                  │          \          │        \
    │                  │           ╰──       │         \/

With multiple ReLUs and learned weights, we can construct a piecewise linear approximation that closely follows the sine curve.

Part 6: Project Files Explained

executorch-sine-pico2w/
├── pyproject.toml          # Python dependencies (managed by uv)
├── README.md               # This document
├── bootstrap.py            # Downloads ARM toolchain, sets up environment
├── setup_toolchain.py      # Configures cross-compilation paths
├── train_and_export.py     # Defines model, trains, exports to .pte
├── build_firmware.py       # Cross-compiles C++ runtime + model
├── main.cpp                # Pico 2 W entry point, runs inference loop
├── CMakeLists.txt          # Build configuration for embedded target
├── pte_to_header.py        # Converts .pte binary to C byte array
└── serial_plotter.py       # Visualizes predictions over USB serial

Data Flow Through the Pipeline

train_and_export.py                    build_firmware.py
        │                                      │
        ▼                                      ▼
┌───────────────┐                    ┌──────────────────┐
│ 1. Define MLP │                    │ 4. CMake config  │
│ 2. Train loop │                    │    (ARM target)  │
│ 3. Export:    │                    │ 5. Compile:      │
│    - capture  │                    │    - runtime     │
│    - to_edge  │                    │    - kernels     │
│    - to_exec  │                    │    - main.cpp    │
│ 4. Serialize  │                    │ 6. Link .uf2     │
└───────┬───────┘                    └────────┬─────────┘
        │                                     │
        ▼                                     ▼
   sine_model.pte ──▶ pte_to_header.py ──▶ model_data.h
        │                                     │
        │                                     ▼
        │                              sine_predictor.uf2
        │                                     │
        └─────────────────────────────────────┘
                         │
                         ▼
                   Flash to Pico 2 W

Part 7: Running the Project

Prerequisites

Python 3.12+ with uv package manager
Raspberry Pi Pico 2 W
USB cable

Quick Start

# Step 1: Bootstrap environment and toolchain
uv run python bootstrap.py

# Step 2: Train model and export to .pte
uv run python train_and_export.py

# Step 3: Build firmware
uv run python build_firmware.py

# Step 4: Flash to device
#   Hold BOOTSEL button, connect USB, release
#   Copy build/sine_predictor.uf2 to RPI-RP2 drive

# Step 5: View results
uv run python serial_plotter.py

Part 8: Troubleshooting

"Operator not found" at Runtime

Symptom: The firmware compiles, but crashes with a missing operator error.

Cause: The C++ kernel for that operator wasn't included in the selective build.

Fix: Add the operator to CMakeLists.txt:

set(EXECUTORCH_SELECT_OPS_LIST 
    "aten::linear.out,aten::relu.out,aten::missing_op.out"
)

How to find the missing operator: Run train_and_export.py with verbose logging to see the full operator list.

Model Too Large for SRAM

Symptom: Build succeeds, but device hangs or crashes immediately.

Cause: Activation memory exceeds the 520KB SRAM budget.

Diagnostic Questions:

What is the activation memory requirement? (Check to_executorch() output)
Can you reduce hidden layer sizes?
Can you use int8 quantization?

Serial Output Not Appearing

Cause: USB CDC (serial) initialization takes time.

Fix in main.cpp:

stdio_init_all();
sleep_ms(2000);  // Wait for USB enumeration

Check: Correct serial port (/dev/ttyACM0 on Linux, /dev/cu.usbmodem* on macOS)

Part 9: Extending the Project

Adding a New Operator

If your model uses an operator not in the default kernel library:

Check if it's in the Core ATen Op Set
If yes, add to EXECUTORCH_SELECT_OPS_LIST
If no, you'll need to write a custom kernel (advanced)

Quantization (Reducing Model Size)

For larger models, 8-bit quantization reduces memory 4x:

from executorch.exir.backend.backend_api import to_backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# After to_edge(), before to_executorch()
edge_program = to_backend(edge_program, XnnpackPartitioner())

Different Target Boards

ExecuTorch supports multiple embedded targets:

Target	CPU	Memory	Toolchain
Pico 2 W	Cortex-M33	520KB SRAM	arm-none-eabi-gcc
STM32H7	Cortex-M7	1MB SRAM	arm-none-eabi-gcc
ESP32-S3	Xtensa LX7	512KB SRAM	xtensa-esp32s3-elf-gcc

References

Appendix A: Key Concepts Quick Reference

Concept	Definition	Why It Matters
Eager Mode	Operations execute immediately	Development/debugging
Graph Mode	Operations recorded, executed later	Deployment/optimization
ATen	PyTorch's tensor operation library	Foundation for all ops
Core ATen	Minimal subset of ATen (~180 ops)	What backends must implement
Edge Dialect	Graph specialized for edge devices	Dtype-specific, out-variants
Out-Variant	Operator writes to pre-allocated buffer	Enables memory planning
Memory Plan	Pre-computed tensor offset assignments	Zero allocation at runtime
Selective Build	Include only needed kernels	Minimize binary size
.pte File	Serialized ExecuTorch program	Flatbuffer format, portable
XIP	Execute In Place (from Flash)	Saves RAM on embedded

Appendix B: Memory Budget Worksheet

Use this to estimate if your model fits on Pico 2 W:

Available SRAM:                520 KB
  - ExecuTorch runtime:        - 50 KB (approx)
  - Method allocator:          - 32 KB (configurable)
  - Stack + globals:           - 20 KB (approx)
                               ────────
Remaining for activations:     ~418 KB

Your model's activation size:  ______ KB
  (from to_executorch() output)

Fits? [ ] Yes  [ ] No

If No, consider:
  [ ] Reduce hidden layer sizes
  [ ] Apply int8 quantization
  [ ] Use operator fusion
  [ ] Stream inputs in chunks

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
bootstrap.py		bootstrap.py
build_firmware.py		build_firmware.py
main.cpp		main.cpp
pte_to_header.py		pte_to_header.py
pyproject.toml		pyproject.toml
serial_plotter.py		serial_plotter.py
setup_toolchain.py		setup_toolchain.py
train_and_export.py		train_and_export.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ExecuTorch Sine Wave Predictor for Raspberry Pi Pico 2 W

What This Project Teaches

The Big Picture: From Python to Silicon

Part 1: Understanding PyTorch's Execution Model

Eager Mode vs. Graph Mode

What is a Tensor?

Part 2: The Export Pipeline

torch.export: Capturing the Graph

Core ATen Decomposition

Part 3: ExecuTorch Lowering

The to_edge Transformation

Memory Planning: The Zero-Allocation Guarantee

Part 4: The Embedded Runtime

Runtime Architecture on Pico 2 W

The MemoryManager Hierarchy

Kernel Registry and Selective Build

Part 5: The Sine Wave Model

Network Architecture

Why ReLU Works for Sine Approximation

Part 6: Project Files Explained

Data Flow Through the Pipeline

Part 7: Running the Project

Prerequisites

Quick Start

Part 8: Troubleshooting

"Operator not found" at Runtime

Model Too Large for SRAM

Serial Output Not Appearing

Part 9: Extending the Project

Adding a New Operator

Quantization (Reducing Model Size)

Different Target Boards

References

Appendix A: Key Concepts Quick Reference

Appendix B: Memory Budget Worksheet

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages