Run the smallest llama2.c model (stories260K) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with llvm2scratch.
If everything is working, the sprite will start generating the familiar opening:
Once upon a time, ... (streamed into the speech bubble token-by-token).
- Scratch project: https://scratch.mit.edu/projects/1277883263
This repo vendors two upstream projects in-tree for reproducibility:
llama2.cby Andrej Karpathy (MIT). Source:llama2.c/andllama2.c/LICENSE.llvm2scratchby Classfied3D (MIT). Source:llvm2scratch/andllvm2scratch/LICENSE.
The model/tokenizer artifacts in artifacts/ come from the llama2.c ecosystem.
High-level pipeline:
scratch_llama2/build_stories260k_sprite3.pyreads:artifacts/stories260K.bin(the smallest llama2.c checkpoint)artifacts/tok512.bin(tokenizer vocabulary)
- It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one
u32. - It lays out everything into a single Scratch list
!stack:- packed weights + per-group scales
- RMSNorm weights
- RoPE cos/sin tables (for a reduced
SEQ_LEN) - runtime buffers (x/xb/hb/q/att + KV cache)
- It writes
scratch_llama2/generated_layout.hwith 1-indexed addresses into!stack. - It compiles
scratch_llama2/llama2_scratch.cto LLVM IR (scratch_llama2/llama2_scratch.ll) using:clang --target=i386-none-elf(keeps pointers as 32-bit ints)
- It runs
llvm2scratchto turn LLVM IR into Scratch blocks, then exports.sprite3and.sb3outputs.
Runtime UI:
!!output(list) stores generated token IDs.!!vocab(list) stores token pieces (strings).!!text(variable) accumulates decoded text; the spritesays it continuously.!!resets(variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).!!status(variable) shows a high-level state machine (Edit params...->Running...->Done.).ui_*variables let you adjust sampling/generation settings from TurboWarp/Scratch UI.
Requires:
clanguv(and Python >= 3.12;llvm2scratchrequires it)
Command:
# If you don't have a usable Python yet:
# uv python install 3.12
#
# Optional: tune stack reset frequency for TurboWarp stability/perf.
# Lower = more stable (less likely to hit "Maximum call stack size exceeded"), but slower.
# Higher = faster, but can crash in TurboWarp.
# MAX_BRANCH_RECURSION=200 is the default.
MAX_BRANCH_RECURSION=200 \\
# Optional: number of tokens to generate (upper bound). Defaults to 20.
# (Must be <= SEQ_LEN, currently 32.)
GEN_STEPS=20 \\
# llvm2scratch requires Python >= 3.12; pin via `--python` to avoid uv picking an older system Python.
uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.pyOutputs:
scratch_llama2/stories260k_inference.sprite3: sprite, blocks hidden (fast editor/import)scratch_llama2/stories260k_inference_visible.sprite3: sprite, blocks visible (debug)scratch_llama2/stories260k_inference_visible.sb3: standalone project wrapper around the visible spritescratch_llama2/stories260k_inference_visible_scratch.sprite3: Scratch-compatible sprite (no TurboWarp-only blocks)scratch_llama2/stories260k_inference_visible_scratch.sb3: Scratch-compatible standalone project
Sprite workflow:
- Import
scratch_llama2/stories260k_inference_visible.sprite3into TurboWarp (File -> Upload spriteor drag/drop). - Select the sprite.
- Click the green flag.
- Edit
ui_*variables (Variables panel). - Press
space(or click the sprite) to start.
Project workflow:
- Open
scratch_llama2/stories260k_inference_visible.sb3in TurboWarp (File -> Load from your computer). - Click the green flag.
- Use the sliders/monitors on the stage to edit params.
- Press
space(or click the sprite) to start.
What you should see:
!!statusupdates:Edit params...->Running...->Done.!!resetsincrements periodically (a "still alive" indicator during long runs).- As tokens are generated, the sprite streams decoded text into its speech bubble (
!!text). - For debugging, generated token IDs are appended to the
!!outputlist.
Sampling UI:
ui_steps: max tokens to generate (<= 32).ui_temperature:0=> greedy;>0=> sampling.ui_top_k:1=> greedy;>1=> top-k sampling.ui_top_p: nucleus cutoff in(0, 1](use1to disable).ui_seed: nonzero => deterministic;0=> pick a random seed at start.ui_prompt_preset:0=> start from BOS;1=> force the token prefixOnce upon a time,(demo).
Use the *_scratch.* outputs:
scratch_llama2/stories260k_inference_visible_scratch.sb3(recommended)
Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.
scratch_llama2/llama2_scratch.cis inference-only and uses a reducedSEQ_LENfor Scratch feasibility.llvm2scratchis vendored here and patched to support pre-seeding!stackand a few extra IR patterns.- Official Scratch does not support TurboWarp's hacked counter opcodes. Use the
*_scratch.*outputs for scratch.mit.edu.
These are the key changes that made llama2_scratch.c viable:
- Preseeded memory: skip generating huge “initializer” scripts by directly injecting
!stackat export time. - i8 pointer arithmetic fix: clang emits
getelementptr i8using byte offsets (4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (i8_gep_div=4). - Stack reset progress: optional
!!resetscounter to confirm the VM is still working during long runs (we keep the speech bubble for generated text). - Token streaming:
SB3_emit_token_dbllogs token IDs to!!output, decodes through!!vocab, appends into!!text, and continuously updates the sprite speech bubble. - Added intrinsic support: clang can emit
llvm.umin/umax/smin/smax; llvm2scratch now translates these so-O2IR compiles.
@misc{andrews2026llm_from_scratch,
author = {Andrews, David},
title = {llm\_from\_scratch},
year = {2026},
howpublished = {\\url{https://github.com/broyojo/llm_from_scratch}}
}