EdgeRunner

Fast, local LLM inference for Apple Silicon. Built in Swift/Metal from the ground up.

import EdgeRunner

let runner = try await EdgeRunner(modelPath: "Qwen3-0.6B-Q8_0.gguf")

// Streaming
for try await token in runner.stream("Once upon a time") {
    print(token, terminator: "")
}

// Or one-shot
let text = try await runner.generate("What is Swift?", maxTokens: 100)

Features

~230+ tok/s median decode on Qwen3-0.6B (Apple M3 Max, publishable benchmark)
Metal-native: Custom compute kernels optimized for Apple Silicon
GGUF support: Load quantized models directly (Q8_0, Q4_0, Q4_K_M, Q2_K, Q3_K, Q5_0, Q5_1, Q5_K, Q6_K)
Memory-mapped loading: Instant startup, minimal memory pressure
Private: Runs entirely on-device, no network required
iOS + macOS: Drop into any Swift project as a package dependency

Requirements

macOS 26.0+ or iOS 26.0+
Apple Silicon (M1 or later)
Swift 6.2+
Xcode 26 beta or newer

Installation

Swift Package Manager

Add EdgeRunner to your Package.swift:

dependencies: [
    .package(url: "https://github.com/christopherkarani/EdgeRunner.git", from: "0.1.0")
]

Or add it in Xcode: File → Add Package Dependencies...

Quick Start

1. Download a Model

EdgeRunner uses GGUF format models. Download a compatible model:

# Qwen3 0.6B (recommended for testing)
curl -L -o Qwen3-0.6B-Q8_0.gguf \
  https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

Supported models:

Qwen3 0.6B, 1.7B, 4B (Q8_0)
More models coming soon (Llama 3, Mistral, Phi-3)

2. Load and Run

import EdgeRunner

@main
struct MyApp {
    static func main() async throws {
        let runner = try await EdgeRunner(
            modelPath: "Qwen3-0.6B-Q8_0.gguf"
        )

        // Streaming output
        for try await token in runner.stream("Hello, world!", maxTokens: 50) {
            print(token, terminator: "")
        }
    }
}

Architecture

EdgeRunner is organized into layered modules:

EdgeRunnerMetal    # Low-level Metal compute kernels
EdgeRunnerIO       # Model loading (GGUF, SafeTensors)
EdgeRunnerCore     # Sampling, tokenization, graph execution
EdgeRunner         # Public API facade

Key Components

Component	Purpose
`EdgeRunner`	Simple entry point — load, stream, generate
`LlamaLanguageModel`	Full inference engine (advanced use)
`KVCache`	Efficient autoregressive generation
`GGUFLoader`	Memory-mapped model loading
`SamplingPipeline`	Temperature, top-p, top-k, repetition penalty

Performance

Benchmarks on Apple M3 Max (128-token decode):

Model	EdgeRunner	llama.cpp	MLX
Qwen3-0.6B-Q8_0	~230+ tok/s	~198 tok/s	~286 tok/s
Time to First Token	3.5 ms	n/a in `llama-bench`	n/a in `mlx_lm.benchmark`

Memory usage:

Qwen3-0.6B: ~700MB (with memory mapping)
Qwen3-4B: ~4.5GB

Advanced Usage

Custom Sampling

let text = try await runner.generate(
    "Write a story about",
    maxTokens: 200,
    sampling: SamplingConfiguration(
        temperature: 0.7,
        topP: 0.9,
        topK: 40,
        repetitionPenalty: 1.1
    )
)

Low-Level Control

For fine-grained control over the inference loop, use LlamaLanguageModel directly:

let model = try await LlamaLanguageModel.load(
    from: modelURL,
    configuration: ModelConfiguration(contextWindowSize: 2048)
)

var tokens = [model.bosTokenID ?? 1]
for _ in 0..<100 {
    let next = try await model.nextToken(
        for: tokens,
        sampling: SamplingConfiguration(temperature: 0)
    )
    guard next != model.eosTokenID else { break }
    tokens.append(next)
}
print(model.detokenize(tokens))

Tool Calling

Tool calling is available through concrete EdgeRunnerTool conformers and streaming/generation sessions. See Sources/EdgeRunner/ToolCalling/EdgeRunnerTool.swift for the protocol surface and the streaming APIs for orchestration.

Building from Source

git clone https://github.com/christopherkarani/EdgeRunner.git
cd EdgeRunner
swift build

# Run tests
swift test

# Run benchmark (publishable, release build)
swift test -c release --filter "PublishableBenchmark/fullBenchmark"

Project Status

EdgeRunner is in beta. Core features work well:

✅ Fast inference (~230+ tok/s publishable decode on Qwen3-0.6B)
✅ GGUF Q8_0 quantization
✅ KV cache for efficient generation
⚠️ Tokenizer (BPE) - basic implementation
🚧 Multi-model support (Qwen only for now)

See ROADMAP.md for planned features.

Troubleshooting

See TROUBLESHOOTING.md for common issues and solutions.

License

MIT License. See LICENSE for details.

Acknowledgments

Inspired by llama.cpp and MLX
GGUF format by Georgi Gerganov
Qwen models by Alibaba Cloud

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
Examples/EdgeRunnerChat		Examples/EdgeRunnerChat
Sources		Sources
Tests		Tests
benchmarks		benchmarks
docs		docs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Package.swift		Package.swift
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EdgeRunner

Features

Requirements

Installation

Swift Package Manager

Quick Start

1. Download a Model

2. Load and Run

Architecture

Key Components

Performance

Advanced Usage

Custom Sampling

Low-Level Control

Tool Calling

Building from Source

Project Status

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EdgeRunner

Features

Requirements

Installation

Swift Package Manager

Quick Start

1. Download a Model

2. Load and Run

Architecture

Key Components

Performance

Advanced Usage

Custom Sampling

Low-Level Control

Tool Calling

Building from Source

Project Status

Troubleshooting

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages