higgs

A model router and inference server for Apple Silicon that optimizes model serving using unified memory. Serve local MLX models and proxy to remote providers (OpenAI, Anthropic, Ollama, etc.) through a single endpoint with automatic format translation. Single static Rust binary, no Python runtime. Built on mlx-rs.

Install

brew install panbanda/brews/higgs

Or build from source (Rust 1.87.0+, Xcode CLI Tools):

cargo build --release

Quick Start

Simple mode (no config file)

higgs serve --model mlx-community/Llama-3.2-1B-Instruct-4bit
higgs serve --model mlx-community/Llama-3.2-1B-Instruct-4bit --model mlx-community/Qwen3-1.7B-4bit

Accepts HuggingFace model IDs (resolved from ~/.cache/huggingface/hub/) or local paths. Prompts to download if not cached. Models must be MLX safetensors format from mlx-community.

Gateway mode (config file)

higgs init        # create ~/.config/higgs/config.toml
higgs serve       # start with config
higgs start       # start as background daemon
higgs attach      # attach TUI dashboard to running daemon
higgs stop        # stop daemon

Profiles

Named profiles let you maintain multiple configurations and run multiple instances simultaneously:

higgs init --profile dev              # create config.dev.toml
higgs init --profile prod             # create config.prod.toml
higgs serve --profile dev             # foreground with dev config
higgs start --profile dev             # daemon with dev config (separate PID/log)
higgs start --profile prod            # daemon with prod config (different port)
higgs attach --profile dev            # attach TUI to dev instance
higgs stop --profile dev              # stop only the dev instance
higgs doctor --profile prod           # validate prod config

Each profile gets isolated runtime files (higgs.<profile>.pid, higgs.<profile>.log, metrics.<profile>.jsonl). Profiles must use different ports (configured in each profile's config file). --profile and --config are mutually exclusive.

Features

Local inference

OpenAI + Anthropic APIs -- chat completions, text completions, embeddings, messages
Structured output -- json_schema response format (100% schema compliance)
Reasoning models -- <think> tag extraction to reasoning_content
Continuous batching -- 755 tok/s aggregate at 8 concurrent requests
Radix tree prefix cache -- shared prefix reuse across requests
Vision -- multimodal image+text (LLaVA-Qwen2)
11 architectures -- LLaMA, Mistral, Qwen2/3, Qwen3-MoE, Qwen3-Next, Gemma 2, Phi-3, Starcoder2, DeepSeek-V2, LLaVA-Qwen2

Gateway

Remote providers -- proxy requests to OpenAI, Anthropic, Ollama, or any OpenAI-compatible API
Format translation -- send OpenAI requests to Anthropic providers (and vice versa) with automatic conversion of request/response formats, including streaming
Pattern routing -- regex-based model name matching to route requests to the right provider
Model rewriting -- map model aliases to upstream model names
Auto-router -- classify requests using a local LLM to pick the best provider
Metrics dashboard -- TUI with live request rates, latency, token throughput, and error tracking
Daemon mode -- higgs start/stop/attach for background operation
Config management -- higgs config get/set, higgs doctor for validation

Configuration

Simple mode (CLI flags)

CLI Flag	Env Variable	Default	Description
`--model`	`HIGGS_MODELS`	(required)	Model path or HF ID (repeatable)
`--host`	`HIGGS_HOST`	`0.0.0.0`	Bind address
`--port`	`HIGGS_PORT`	`8000`	Bind port
`--max-tokens`	`HIGGS_MAX_TOKENS`	`32768`	Max generation tokens
`--api-key`	`HIGGS_API_KEY`	(none)	Bearer token for auth
`--rate-limit`	`HIGGS_RATE_LIMIT`	`0`	Requests/min per client
`--timeout`	`HIGGS_TIMEOUT`	`300`	Request timeout (seconds)
`--batch`	--	`false`	Enable continuous batching

Gateway mode (config file)

Run higgs init to create ~/.config/higgs/config.toml:

[server]
host = "0.0.0.0"
port = 8000
# max_tokens = 32768
# timeout = 300.0
# api_key = "sk-..."

# --- Local models ---
[[models]]
path = "mlx-community/Llama-3.2-1B-Instruct-4bit"
# name = "llama"     # optional friendly name (used as engine key and for auto_router lookup)
# batch = false

# --- Remote providers ---
[provider.anthropic]
url = "https://api.anthropic.com"
format = "anthropic"

[provider.openai]
url = "https://api.openai.com"
format = "openai"

[provider.ollama]
url = "http://localhost:11434"
strip_auth = true

# --- Routes ---
# First regex match wins. Requests matching a local model name are served locally.

[[routes]]
pattern = "claude-.*"
provider = "anthropic"

[[routes]]
pattern = "gpt-.*"
provider = "openai"

# Model rewriting: requests for "my-alias" are sent to the provider as "actual-model-name"
# [[routes]]
# pattern = "my-alias"
# provider = "openai"
# model = "gpt-4o"

# --- Default route ---
[default]
provider = "higgs"   # "higgs" = local models only; set to a provider name to proxy unmatched requests

# --- Auto router (optional) ---
# Classify requests with a local LLM to pick the best provider automatically.
# The model field can reference a model by name or path.
# [auto_router]
# enabled = true
# model = "llama"    # matches [[models]] name or path
# timeout_ms = 2000

# --- Metrics & dashboard ---
[retention]
enabled = true
minutes = 60

[logging.metrics]
enabled = true
# path = "~/.config/higgs/logs/metrics.jsonl"
# max_size_mb = 50
# max_files = 5

Provider options

Field	Type	Default	Description
`url`	string	(required)	Base URL of the upstream API
`format`	`"openai"` or `"anthropic"`	`"openai"`	API format the provider speaks
`api_key`	string	(none)	API key to inject into proxied requests
`strip_auth`	bool	`false`	Remove the client's Authorization header before proxying
`stub_count_tokens`	bool	`false`	Return a stub for `/v1/messages/count_tokens`

Route options

Field	Type	Description
`pattern`	regex	Match against the `model` field in requests
`provider`	string	Provider name to forward to
`model`	string	Rewrite the model field before forwarding
`name`	string	Human label (used by auto-router)
`description`	string	Route description (used by auto-router for classification)

API

OpenAI: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models Anthropic: /v1/messages, /v1/messages/count_tokens Metrics: /metrics (JSON) Health: /health

Format translation works transparently: send an OpenAI-format request to higgs and it will translate to Anthropic format if the matched route points to an Anthropic provider (and vice versa), including streaming responses.

# Local model
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
       "messages": [{"role": "user", "content": "Hello!"}]}'

# Proxied to Anthropic (translated from OpenAI format automatically)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ANTHROPIC_API_KEY" \
  -d '{"model": "claude-sonnet-4-6",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Shell integration

Add to your shell profile to point AI tools at higgs:

eval "$(higgs shellenv)"
# Exports ANTHROPIC_BASE_URL and OPENAI_BASE_URL when the server is reachable

Or run a single command with the env vars set:

higgs exec -- claude
higgs exec -- aider --model openai/gpt-4o

higgs exec verifies the server is reachable, sets ANTHROPIC_BASE_URL and OPENAI_BASE_URL, then execs the command.

CLI Commands

Command	Description
`higgs serve`	Start the server in the foreground
`higgs start`	Start as a background daemon
`higgs stop`	Stop a running daemon
`higgs attach`	Attach TUI dashboard to a running daemon
`higgs init`	Create default config at `~/.config/higgs/config.toml`
`higgs shellenv`	Print `export` lines for `ANTHROPIC_BASE_URL` / `OPENAI_BASE_URL`
`higgs exec -- <cmd>`	Set env vars and exec a command (replaces `eval "$(higgs shellenv)"`)
`higgs config get <key>`	Read a config value (dot-separated key)
`higgs config set <key> <value>`	Write a config value
`higgs config path`	Print the resolved config file path
`higgs doctor`	Validate config, check model paths, probe providers

Global flags

Flag	Description
`--config <FILE>`	Path to config file (conflicts with `--profile`)
`--profile <NAME>`	Named profile, resolves to `config.<NAME>.toml` (conflicts with `--config`)
`--verbose`	Enable debug logging

Supported Architectures

Architecture	`model_type`	Examples
LLaMA	`llama`	Llama 3/3.1, CodeLlama
Mistral	`mistral`	Mistral 7B
Qwen2	`qwen2`	Qwen2, Qwen2.5
Qwen3	`qwen3`	Qwen3
Qwen3-Next	`qwen3_next`	Qwen3-Coder (SSM hybrid)
Qwen3-MoE	`qwen3_moe`	Qwen3-30B-A3B (sparse MoE)
Gemma 2	`gemma2`	Gemma 2 2B/9B/27B
Phi-3	`phi3`	Phi-3 Mini/Small/Medium
Starcoder2	`starcoder2`	Starcoder2 3B/7B/15B
DeepSeek-V2	`deepseek_v2`	DeepSeek-V2-Lite (MLA + MoE)
LLaVA-Qwen2	`llava-qwen2`	nanoLLaVA-1.5 (vision)

Performance

All benchmarks on M4 Max 128GB. Temperature=0, warmup pass excluded.

Decode throughput (tok/s)

Single request, 500 generated tokens, median of 3 runs.

Model	higgs	mlx_lm	vllm-mlx	llama.cpp	Ollama
Llama-3.2-1B-4bit	448	421	433	314	305
Mistral-7B-v0.3-4bit	103	103	--	87	85
Qwen3-1.7B-4bit	305	293	300	216	183
Qwen3-30B-A3B-8bit	75	86	87	83	73
Gemma-2-2B-4bit	163	185	91	--	--
Phi-3-mini-4bit	171	170	95	--	--
Starcoder2-3B-4bit	107	176	165	--	--
DeepSeek-V2-Lite-4bit	140	174	99	--	--

MLX models use 4-bit (8-bit for MoE). llama.cpp/Ollama use Q4_K_M (Q8_0 for MoE).

Continuous batching (Llama-1B)

Concurrent requests	higgs tok/s	vllm-mlx tok/s
1	280	250
2	585	459
4	698	510
8	755	646

Memory (RSS in MB)

Model	higgs	mlx_lm	vllm-mlx
Llama-3.2-1B-4bit	974	1,356	1,380
Mistral-7B-v0.3-4bit	3,965	4,384	--
Qwen3-1.7B-4bit	1,127	1,609	1,641
Qwen3-30B-A3B-8bit	31,139	31,640	31,658
Gemma-2-2B-4bit	1,645	2,329	2,350
Phi-3-mini-4bit	2,126	2,548	2,573
DeepSeek-V2-Lite-4bit	8,528	8,972	8,998

Feature comparison

	higgs	vllm-mlx
Structured output (10 prompts, JSON schema)	100%	0%
Reasoning extraction (5 questions, Qwen3)	5/5	4/5
All architectures produce coherent output	Yes	Yes

Development

cargo test -- --test-threads=1
cargo clippy
cargo fmt --check

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
lefthook.yml		lefthook.yml
omen.toml		omen.toml
release-please-config.json		release-please-config.json
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

higgs

Install

Quick Start

Simple mode (no config file)

Gateway mode (config file)

Profiles

Features

Local inference

Gateway

Configuration

Simple mode (CLI flags)

Gateway mode (config file)

Provider options

Route options

API

Shell integration

CLI Commands

Global flags

Supported Architectures

Performance

Decode throughput (tok/s)

Continuous batching (Llama-1B)

Memory (RSS in MB)

Feature comparison

Development

License

About

Uh oh!

Releases 33

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

higgs

Install

Quick Start

Simple mode (no config file)

Gateway mode (config file)

Profiles

Features

Local inference

Gateway

Configuration

Simple mode (CLI flags)

Gateway mode (config file)

Provider options

Route options

API

Shell integration

CLI Commands

Global flags

Supported Architectures

Performance

Decode throughput (tok/s)

Continuous batching (Llama-1B)

Memory (RSS in MB)

Feature comparison

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 33

Contributors

Uh oh!

Languages