A minimal, end‑to‑end demonstration of modern agent architectures for runtime and infrastructure diagnostics, built around real system logs, OpenStack control‑plane events, and swappable LLM backends.
This repository is a reference implementation showing how agents can reason over operational evidence.
Most “AI troubleshooting” tools hallucinate answers when evidence is missing.
This project does the opposite:
- Enforces evidence scope
- Detects evidence gaps
- Separates baseline vs incident behavior
- Guides the next diagnostic step
- Refuses to guess
The result is an agent that behaves like a senior infrastructure engineer, not a chatbot.
- Single Agent – plain LLM reasoning (no tools)
- ReAct Agent – reasoning + tool use
- RAG Agent – retrieval‑augmented reasoning
- Multi‑Agent – orchestration patterns
- Linux boot and system logs
- OpenStack control‑plane and service logs
- Baseline vs abnormal comparison
- Subsystem‑aware diagnostics (API, compute, MQ, DB)
- Cross‑layer reasoning (host ↔ control plane)
- arXiv research corpus
- Isolated from runtime evidence
- Explicit domain selection
Agents reason only from retrieved runtime evidence. If data is missing, the agent says so.
Every answer declares:
- what evidence was used
- what evidence is missing
- what would be needed next
If logs do not support a conclusion, no conclusion is drawn.
Operational logs and research papers are treated as separate epistemic domains.
Agents/
├── core/ # Agent kernel, ReAct logic, tool routing
├── rag/ # Retrieval layers (Linux, OpenStack, arXiv, comparison)
├── sources/ # Ingestion & normalization code (Linux / OpenStack)
│ ├── linux/
│ └── openstack/
├── data/ # Runtime artifacts & evidence (gitignored)
│ ├── arxiv_index/ # FAISS index for research knowledge
│ ├── linux_index/ # FAISS index for Linux runtime logs
│ └── sources/
│ ├── linux/ # Raw Linux logs
│ └── openstack/ # Raw OpenStack logs (normal / abnormal)
├── execution/ # LLM runtimes (local / OpenAI)
├── examples/ # Runnable agent demos (single / ReAct / RAG / multi)
├── orchestration/ # Multi-agent / graph experiments
├── tests/ # Kernel and routing tests
├── run.py # Unified CLI entrypoint
└── README.md
Direct question → LLM answer.
python run.py --mode single --query "What is OpenStack?"Used only as a baseline.
A reasoning loop that:
- thinks
- retrieves evidence
- observes
- reasons again
python run.py --mode react --domain runtime \
--query "Why is OpenStack unstable?"This is where most of the interesting work happens.
Classic retrieval-augmented generation over a document corpus (e.g. arXiv).
python run.py --mode rag --domain knowledge \
--query "Why do distributed systems fail?"Used to explain why a pattern is known — never to invent runtime facts.
Combines runtime + knowledge agents.
Runtime answers what is happening. Knowledge explains why this pattern is known.
python run.py --mode react --domain runtime \
--compare normal abnormal \
--query "Why is OpenStack unstable?"This forces the agent to:
- retrieve normal OpenStack behavior
- retrieve abnormal OpenStack behavior
- reason only over differences
Environmental facts shared by both baselines are not allowed as causes.
python run.py --mode react --domain runtime \
--compare normal abnormal \
--current-logs data/incidents/current.log \
--query "What is wrong with my cloud-init?"Key design rule:
Current logs are ephemeral context, not indexed knowledge.
They are injected once, reasoned over, and discarded.
This project enforces several non-negotiable rules:
If the logs don’t show it, the agent won’t invent it.
If a subsystem should emit logs but doesn’t, the agent can conclude:
“This likely never ran.”
In comparison mode:
Only differences between normal and abnormal baselines may be causal.
Old hardware, low RAM, or kernel quirks shared by both baselines are suppressed.
When evidence is missing, the agent says so and lists what would be needed next.
This is the opposite of hallucination.
The CLI exposes capability tiers, not model internals:
python run.py --llm tiny # very fast smoke tests
python run.py --llm mistral # log summarisation
python run.py --llm llama3 # decent local reasoning
python run.py --llm phi3 # strong local diagnostics
python run.py --llm qwen14 # stronger local diagnostics
python run.py --llm openai # strongest reasoningUnder the hood (via Ollama):
| CLI flag | Actual model |
|---|---|
| tiny | phi3:mini |
| mistral | mistral |
| llama3 | llama3:8b |
| phi3 | phi3:medium |
| qwen14 | qwen2.5:14b |
Missing local models are downloaded automatically by Ollama.
Small local models are great for:
- speed
- summaries
- iteration
They are not good at:
- causal reasoning
- absence-of-evidence inference
- epistemic constraints
This project is designed to expose those limits, not hide them.
You can route serious diagnostic questions to stronger models without changing agent logic.
- No hidden state
- No silent retries
- No background learning
- Every decision is inspectable
If the agent gives a bad answer, you can trace why.
MIT