Fast, local LLM inference for Apple Silicon. Built in Swift/Metal from the ground up.
import EdgeRunner
let runner = try await EdgeRunner(modelPath: "Qwen3-0.6B-Q8_0.gguf")
// Streaming
for try await token in runner.stream("Once upon a time") {
print(token, terminator: "")
}
// Or one-shot
let text = try await runner.generate("What is Swift?", maxTokens: 100)- ~230+ tok/s median decode on Qwen3-0.6B (Apple M3 Max, publishable benchmark)
- Metal-native: Custom compute kernels optimized for Apple Silicon
- GGUF support: Load quantized models directly (Q8_0, Q4_0, Q4_K_M, Q2_K, Q3_K, Q5_0, Q5_1, Q5_K, Q6_K)
- Memory-mapped loading: Instant startup, minimal memory pressure
- Private: Runs entirely on-device, no network required
- iOS + macOS: Drop into any Swift project as a package dependency
- macOS 26.0+ or iOS 26.0+
- Apple Silicon (M1 or later)
- Swift 6.2+
- Xcode 26 beta or newer
Add EdgeRunner to your Package.swift:
dependencies: [
.package(url: "https://github.com/christopherkarani/EdgeRunner.git", from: "0.1.0")
]Or add it in Xcode: File β Add Package Dependencies...
EdgeRunner uses GGUF format models. Download a compatible model:
# Qwen3 0.6B (recommended for testing)
curl -L -o Qwen3-0.6B-Q8_0.gguf \
https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.ggufSupported models:
- Qwen3 0.6B, 1.7B, 4B (Q8_0)
- More models coming soon (Llama 3, Mistral, Phi-3)
import EdgeRunner
@main
struct MyApp {
static func main() async throws {
let runner = try await EdgeRunner(
modelPath: "Qwen3-0.6B-Q8_0.gguf"
)
// Streaming output
for try await token in runner.stream("Hello, world!", maxTokens: 50) {
print(token, terminator: "")
}
}
}EdgeRunner is organized into layered modules:
EdgeRunnerMetal # Low-level Metal compute kernels
EdgeRunnerIO # Model loading (GGUF, SafeTensors)
EdgeRunnerCore # Sampling, tokenization, graph execution
EdgeRunner # Public API facade
| Component | Purpose |
|---|---|
EdgeRunner |
Simple entry point β load, stream, generate |
LlamaLanguageModel |
Full inference engine (advanced use) |
KVCache |
Efficient autoregressive generation |
GGUFLoader |
Memory-mapped model loading |
SamplingPipeline |
Temperature, top-p, top-k, repetition penalty |
Benchmarks on Apple M3 Max (128-token decode):
| Model | EdgeRunner | llama.cpp | MLX |
|---|---|---|---|
| Qwen3-0.6B-Q8_0 | ~230+ tok/s | ~198 tok/s | ~286 tok/s |
| Time to First Token | 3.5 ms | n/a in llama-bench |
n/a in mlx_lm.benchmark |
Memory usage:
- Qwen3-0.6B: ~700MB (with memory mapping)
- Qwen3-4B: ~4.5GB
let text = try await runner.generate(
"Write a story about",
maxTokens: 200,
sampling: SamplingConfiguration(
temperature: 0.7,
topP: 0.9,
topK: 40,
repetitionPenalty: 1.1
)
)For fine-grained control over the inference loop, use LlamaLanguageModel directly:
let model = try await LlamaLanguageModel.load(
from: modelURL,
configuration: ModelConfiguration(contextWindowSize: 2048)
)
var tokens = [model.bosTokenID ?? 1]
for _ in 0..<100 {
let next = try await model.nextToken(
for: tokens,
sampling: SamplingConfiguration(temperature: 0)
)
guard next != model.eosTokenID else { break }
tokens.append(next)
}
print(model.detokenize(tokens))Tool calling is available through concrete EdgeRunnerTool conformers and streaming/generation sessions.
See Sources/EdgeRunner/ToolCalling/EdgeRunnerTool.swift for the protocol surface and the streaming APIs for orchestration.
git clone https://github.com/christopherkarani/EdgeRunner.git
cd EdgeRunner
swift build
# Run tests
swift test
# Run benchmark (publishable, release build)
swift test -c release --filter "PublishableBenchmark/fullBenchmark"EdgeRunner is in beta. Core features work well:
- β Fast inference (~230+ tok/s publishable decode on Qwen3-0.6B)
- β GGUF Q8_0 quantization
- β KV cache for efficient generation
β οΈ Tokenizer (BPE) - basic implementation- π§ Multi-model support (Qwen only for now)
See ROADMAP.md for planned features.
See TROUBLESHOOTING.md for common issues and solutions.
MIT License. See LICENSE for details.