A custom CrewAI LLM provider that speaks native Ollama protocol —
POST /api/chat with NDJSON streaming. No OpenAI shim, no LiteLLM, no proxy needed.
Works with local Ollama, self-hosted instances, and ollama.com Cloud API.
CrewAI's built-in Ollama support routes through the OpenAI-compatible shim
(/v1/chat/completions). This provider talks the real Ollama protocol —
/api/chat with native JSON, NDJSON streaming, and Ollama's native tool
calling and thinking formats.
If you're running Ollama Cloud models (gpt-oss:120b-cloud, kimi-k2.6-cloud,
etc.) or just want direct API access without translation layers, this is for you.
| Feature | Support |
|---|---|
Native /api/chat |
✅ real Ollama protocol, not OpenAI-compatible |
| NDJSON streaming | ✅ token-by-token, thinking/reasoning tokens |
| Tool calling | ✅ native Ollama tool calls (v0.3+) |
| Structured output | ✅ JSON schema via format parameter |
| Thinking models | ✅ think parameter for DeepSeek-R1, Kimi, etc. |
| Cloud auth | ✅ Authorization: Bearer for ollama.com |
| Model discovery | ✅ list_ollama_models() |
| Config overrides | ✅ runtime temperature, max_tokens, etc. |
| Context windows | ✅ auto-detection for popular models |
| Stop words | ✅ options.stop |
| Keep alive | ✅ keep_alive parameter |
| Multimodal | ✅ image support for vision models |
| CrewAI events | ✅ full observability integration |
pip install crewai-ollama-cloudRequires: Python ≥3.10, CrewAI ≥0.80.0, httpx ≥0.25.0
# Optional: set your Ollama Cloud API key
export OLLAMA_API_KEY="sk-xxxx"For local Ollama, no API key is needed.
from crewai import Agent, Task, Crew
from crewai_ollama_cloud import OllamaCloudProvider
# Ollama Cloud
llm = OllamaCloudProvider(
model="deepseek-v4-flash",
base_url="https://ollama.com",
api_key="sk-xxxx", # or set OLLAMA_API_KEY env var
temperature=0.7,
stream=True,
)
# Or local Ollama
# llm = OllamaCloudProvider(model="llama3.1:8b", base_url="http://localhost:11434")
agent = Agent(role="Analyst", goal="Analyze data", backstory="Expert", llm=llm)
task = Task(description="Summarize Q1 report", expected_output="Summary")
crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()
print(result)| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
(required) | Ollama model name (e.g. "llama3.1:8b", "deepseek-v4-flash") |
base_url |
str |
"http://localhost:11434" |
Ollama host URL (no trailing /v1) |
api_key |
str or None |
env OLLAMA_API_KEY |
API key for cloud instances |
temperature |
float or None |
None |
Sampling temperature (0–2) |
max_tokens |
int or None |
None |
Max tokens to generate |
top_p |
float or None |
None |
Nucleus sampling |
top_k |
int or None |
None |
Top-k sampling |
stop |
list[str] |
[] |
Stop sequences |
stream |
bool |
False |
Enable NDJSON streaming |
timeout |
float |
120.0 |
HTTP timeout in seconds |
keep_alive |
str |
"5m" |
Model keep-alive duration |
think |
bool |
False |
Enable thinking/reasoning tokens |
additional_params |
dict |
{} |
Extra parameters merged into request body |
When calling the API, CrewAI parameters are mapped to Ollama's native format:
| CrewAI field | Ollama request field |
|---|---|
temperature |
options.temperature |
max_tokens |
options.num_predict |
top_p |
options.top_p |
top_k |
options.top_k |
stop |
options.stop |
think |
think (top-level) |
response_model |
format (JSON schema) |
keep_alive |
keep_alive (top-level) |
All configuration fields can be changed at runtime between calls:
llm = OllamaCloudProvider(model="llama3.1:8b", temperature=0.3)
# Warm up: creative mode
llm.temperature = 0.9
result = llm.call("Write a poem")
# Switch to precise mode for next call
llm.temperature = 0.1
llm.top_p = 0.95
result = llm.call("Calculate 2+2")from crewai_ollama_cloud import list_ollama_models, OllamaModelInfo
# List models on a local GPU rig
models = list_ollama_models("http://localhost:11434")
# List cloud models
models = list_ollama_models("https://ollama.com", api_key="sk-xxxx")
for m in models:
print(f"{m.name:35s} | {m.parameter_size:6s} | {m.family:10s} | {m.size_gb:5.1f} GB")
# Output:
# llama3.1:8b | 8b | llama | 4.7 GB
# mistral:7b | 7b | mistral | 4.1 GB
# deepseek-v4-flash | 70b | deepseek | 40.5 GBThe OllamaModelInfo object contains:
| Attribute | Type | Description |
|---|---|---|
name |
str |
Full model name |
digest |
str |
SHA256 digest |
size |
int |
Size in bytes |
modified_at |
str or None |
Last modified timestamp |
family |
str |
Inferred model family |
parameter_size |
str |
Parameter count (e.g. "8b", "70b") |
size_gb |
float |
Size in gigabytes |
| Variable | Description |
|---|---|
OLLAMA_API_KEY |
API key for authenticated Ollama instances (e.g. cloud) |
When stream=True, the provider uses Ollama's native NDJSON streaming.
Tokens are emitted via CrewAI's LLMStreamChunkEvent:
llm = OllamaCloudProvider(model="llama3.1:8b", stream=True)
# Each token triggers a stream chunk event
result = llm.call("Tell me about black holes")
# Events:
# chunk: "Black"
# chunk: " holes"
# chunk: " are"
# ...For thinking models (think=True, like deepseek-r1), reasoning tokens
are separated from final output and emitted as thinking chunk events.
Ollama v0.3+ supports native tool calling. The provider converts CrewAI
BaseTool objects to Ollama's native tool format:
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
}Tool execution results are returned directly.
To get JSON responses, use response_model:
from pydantic import BaseModel
class Summary(BaseModel):
key_points: list[str]
sentiment: str
llm = OllamaCloudProvider(model="llama3.1:8b", temperature=0)
result = llm.call("Analyze Q3 results", response_model=Summary)
# result.key_points = ["Revenue up 15%", ...]
# result.sentiment = "positive"The provider auto-detects context window sizes for known models:
| Model | Context Size |
|---|---|
| llama3:70b | 8,192 |
| llama3.1:8b | 131,072 |
| llama3.1:70b | 131,072 |
| llama3.1:405b | 131,072 |
| llama3.2:1b/3b | 131,072 |
| llama3.3:70b | 131,072 |
| mistral:7b | 8,192 |
| mixtral:8x7b | 32,768 |
| qwen2.5:7b/32b | 32,768 |
| deepseek-r1:7b/8b | 131,072 |
| Unknown models | 4,096 (default) |
| Error | Provider Behavior |
|---|---|
| HTTP 4xx/5xx | HTTPStatusError → LLMCallFailedEvent |
| Context overflow | LLMContextLengthExceededError (CrewAI native) |
| Connection failure | Exception → LLMCallFailedEvent |
┌────────────────┐
│ CrewAI Agent │
└───────┬────────┘
│ Agent.llm.call(messages, tools, ...)
▼
┌─────────────────────────────┐
│ OllamaCloudProvider │
│ (extends BaseLLM) │
│ │
│ call() / acall() │
│ ├─ _format_messages() │
│ ├─ _build_body() │
│ ├─ BEFORE hooks │
│ ├─ httpx POST /api/chat │───────┐
│ ├─ _process_response() │ │
│ ├─ AFTER hooks │ │
│ └─ event emission │ │
└─────────────────────────────┘ │
▼
┌─────────────────┐
│ Ollama Instance │
│ (local/remote) │
│ │
│ POST /api/chat │
│ ← JSON / NDJSON │
└─────────────────┘
Zero translation layers. httpx → /api/chat → Ollama. That's the whole call path.
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v39 tests cover: initialization, capabilities, request body building, non-streaming calls, streaming calls with thinking tokens, tool calls, stop words, context overflow handling, auth headers, async call delegation, model discovery.
MIT — see LICENSE file.