Skip to content

Broyojo/llm_from_scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM From Scratch

Run the smallest llama2.c model (stories260K) inside Scratch/TurboWarp by compiling C inference code to Scratch blocks with llvm2scratch.

If everything is working, the sprite will start generating the familiar opening: Once upon a time, ... (streamed into the speech bubble token-by-token).

TurboWarp running stories260K inference

Live Demo

Credits (Upstream)

This repo vendors two upstream projects in-tree for reproducibility:

  • llama2.c by Andrej Karpathy (MIT). Source: llama2.c/ and llama2.c/LICENSE.
  • llvm2scratch by Classfied3D (MIT). Source: llvm2scratch/ and llvm2scratch/LICENSE.

The model/tokenizer artifacts in artifacts/ come from the llama2.c ecosystem.

How It Works

High-level pipeline:

  1. scratch_llama2/build_stories260k_sprite3.py reads:
    • artifacts/stories260K.bin (the smallest llama2.c checkpoint)
    • artifacts/tok512.bin (tokenizer vocabulary)
  2. It quantizes the weight matrices to Q8_0 (group size 4) and packs 4 signed int8 values into one u32.
  3. It lays out everything into a single Scratch list !stack:
    • packed weights + per-group scales
    • RMSNorm weights
    • RoPE cos/sin tables (for a reduced SEQ_LEN)
    • runtime buffers (x/xb/hb/q/att + KV cache)
  4. It writes scratch_llama2/generated_layout.h with 1-indexed addresses into !stack.
  5. It compiles scratch_llama2/llama2_scratch.c to LLVM IR (scratch_llama2/llama2_scratch.ll) using:
    • clang --target=i386-none-elf (keeps pointers as 32-bit ints)
  6. It runs llvm2scratch to turn LLVM IR into Scratch blocks, then exports .sprite3 and .sb3 outputs.

Runtime UI:

  • !!output (list) stores generated token IDs.
  • !!vocab (list) stores token pieces (strings).
  • !!text (variable) accumulates decoded text; the sprite says it continuously.
  • !!resets (variable) increments when the compiler triggers a broadcast-based “stack reset” (progress indicator + avoids JS call stack blowups).
  • !!status (variable) shows a high-level state machine (Edit params... -> Running... -> Done.).
  • ui_* variables let you adjust sampling/generation settings from TurboWarp/Scratch UI.

Build

Requires:

  • clang
  • uv (and Python >= 3.12; llvm2scratch requires it)

Command:

# If you don't have a usable Python yet:
# uv python install 3.12
#
# Optional: tune stack reset frequency for TurboWarp stability/perf.
# Lower = more stable (less likely to hit "Maximum call stack size exceeded"), but slower.
# Higher = faster, but can crash in TurboWarp.
# MAX_BRANCH_RECURSION=200 is the default.
MAX_BRANCH_RECURSION=200 \\
# Optional: number of tokens to generate (upper bound). Defaults to 20.
# (Must be <= SEQ_LEN, currently 32.)
GEN_STEPS=20 \\
# llvm2scratch requires Python >= 3.12; pin via `--python` to avoid uv picking an older system Python.
uv run --python 3.12 --no-project --with-editable ./llvm2scratch python scratch_llama2/build_stories260k_sprite3.py

Outputs:

  • scratch_llama2/stories260k_inference.sprite3: sprite, blocks hidden (fast editor/import)
  • scratch_llama2/stories260k_inference_visible.sprite3: sprite, blocks visible (debug)
  • scratch_llama2/stories260k_inference_visible.sb3: standalone project wrapper around the visible sprite
  • scratch_llama2/stories260k_inference_visible_scratch.sprite3: Scratch-compatible sprite (no TurboWarp-only blocks)
  • scratch_llama2/stories260k_inference_visible_scratch.sb3: Scratch-compatible standalone project

Run (TurboWarp)

Sprite workflow:

  1. Import scratch_llama2/stories260k_inference_visible.sprite3 into TurboWarp (File -> Upload sprite or drag/drop).
  2. Select the sprite.
  3. Click the green flag.
  4. Edit ui_* variables (Variables panel).
  5. Press space (or click the sprite) to start.

Project workflow:

  1. Open scratch_llama2/stories260k_inference_visible.sb3 in TurboWarp (File -> Load from your computer).
  2. Click the green flag.
  3. Use the sliders/monitors on the stage to edit params.
  4. Press space (or click the sprite) to start.

What you should see:

  • !!status updates: Edit params... -> Running... -> Done.
  • !!resets increments periodically (a "still alive" indicator during long runs).
  • As tokens are generated, the sprite streams decoded text into its speech bubble (!!text).
  • For debugging, generated token IDs are appended to the !!output list.

Sampling UI:

  • ui_steps: max tokens to generate (<= 32).
  • ui_temperature: 0 => greedy; >0 => sampling.
  • ui_top_k: 1 => greedy; >1 => top-k sampling.
  • ui_top_p: nucleus cutoff in (0, 1] (use 1 to disable).
  • ui_seed: nonzero => deterministic; 0 => pick a random seed at start.
  • ui_prompt_preset: 0 => start from BOS; 1 => force the token prefix Once upon a time, (demo).

Run (Scratch)

Use the *_scratch.* outputs:

  • scratch_llama2/stories260k_inference_visible_scratch.sb3 (recommended)

Scratch is significantly slower than TurboWarp, and does not support TurboWarp-only “hacked counter” blocks.

Notes

  • scratch_llama2/llama2_scratch.c is inference-only and uses a reduced SEQ_LEN for Scratch feasibility.
  • llvm2scratch is vendored here and patched to support pre-seeding !stack and a few extra IR patterns.
  • Official Scratch does not support TurboWarp's hacked counter opcodes. Use the *_scratch.* outputs for scratch.mit.edu.

Notable llvm2scratch Patches (For This Project)

These are the key changes that made llama2_scratch.c viable:

  • Preseeded memory: skip generating huge “initializer” scripts by directly injecting !stack at export time.
  • i8 pointer arithmetic fix: clang emits getelementptr i8 using byte offsets (4/8/12/...), but our “memory” is list-indexed; we scale i8 GEP indices back into 32-bit cells (i8_gep_div=4).
  • Stack reset progress: optional !!resets counter to confirm the VM is still working during long runs (we keep the speech bubble for generated text).
  • Token streaming: SB3_emit_token_dbl logs token IDs to !!output, decodes through !!vocab, appends into !!text, and continuously updates the sprite speech bubble.
  • Added intrinsic support: clang can emit llvm.umin/umax/smin/smax; llvm2scratch now translates these so -O2 IR compiles.

Citation

@misc{andrews2026llm_from_scratch,
  author       = {Andrews, David},
  title        = {llm\_from\_scratch},
  year         = {2026},
  howpublished = {\\url{https://github.com/broyojo/llm_from_scratch}}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors