A compact, runnable reference architecture for a central GenAI platform. It is intentionally small and readable rather than a framework, and demonstrates the core capabilities a platform team has to get right:
- Provider abstraction & routing - one neutral interface over OpenAI, Anthropic, and open-weights; a transparent policy that picks a model from a registry given capability/cost/latency/residency/eval constraints.
- Hybrid RAG - sentence-aware chunking, BM25 + dense retrieval fused with Reciprocal Rank Fusion, optional rerank.
- Agentic loop - a thin, fully traced plan/act loop with bounded steps, tool use, a reflection pass, and guardrails on the edges.
- Guardrails - input (prompt-injection), retrieved-context scanning, and output (PII redaction, length caps).
- Observability - nested spans with latency, token usage, and USD cost rolled up per run and per tenant; exportable as JSON.
- Evaluation - a golden-set harness computing recall@k / precision@k / MRR for retrieval and key-term coverage for generation, with a CI pass/fail gate.
- MCP server - exposes platform tools over the Model Context Protocol.
Everything runs offline with zero API keys via a deterministic mock provider, so you can clone, run, and read it immediately. Drop in real keys to swap in GPT or Claude with no code changes.
The platform has two lanes: an ingestion lane that turns source documents into an indexed knowledge base, and a serving lane that answers queries with routing, retrieval, agents, guardrails and evaluation. Cross-cutting concerns (observability, governance) wrap both. Solid nodes are implemented in this repo; dashed nodes marked (target) are where production infrastructure plugs in.
flowchart TB
subgraph INGEST["Ingestion lane (genai_platform/ingestion.py)"]
direction LR
SRC["Source docs<br/>(arXiv / S3 / CMS) (target)"]:::target
DISC[discover] --> PARSE[parse] --> CHK[chunk] --> IDX[index] --> RPT[report] --> CLN[cleanup]
ORCH["Airflow DAG orchestration (target)"]:::target
SRC --> DISC
ORCH -.schedules.-> DISC
end
subgraph STORE["Knowledge base"]
HS["HybridStore<br/>BM25 + dense + RRF (in-memory)"]
PG["Postgres + pgvector<br/>FTS + vector + RRF"]
RD["Redis cache"]
OS["OpenSearch lexical-at-scale (target)"]:::target
end
subgraph SERVE["Serving lane"]
APIL["FastAPI (api/main.py)<br/>/rag/answer · /agent/stream · /route · /evals"]
RTR["Model router<br/>registry + policy"]
RAG["Grounded answer<br/>cite + abstain + faithfulness"]
AGT["Agent loop<br/>tools · reflection · SSE stream"]
GR["Guardrails<br/>input · context · output"]
end
subgraph PROV["Model backends"]
MK[Mock offline]
OAI[OpenAI]
ANT[Anthropic]
OW["open-weights / Ollama / Bedrock (target)"]:::target
end
subgraph XCUT["Cross-cutting"]
OBS["Observability<br/>spans · cost · latency"]
OTEL["Langfuse / OTel export (target)"]:::target
EVAL["Evaluations<br/>recall · faithfulness · CI gate"]
end
IDX --> HS
HS -.scale out.-> PG
APIL --> RAG --> HS
APIL --> AGT --> HS
APIL --> RTR --> PROV
RAG --> RTR
AGT --> GR
RAG --> GR
APIL -.cache.-> RD
AGT --> OBS
RAG --> OBS
OBS -.export.-> OTEL
EVAL --> HS
classDef target stroke-dasharray: 4 3,opacity:0.75;
The agent loop, drawn out:
flowchart LR
T[task] --> IG{input<br/>guardrail}
IG -->|blocked| X[stop]
IG -->|ok| L[LLM step]
L -->|tool calls| TC[execute tools] --> L
L -->|final| S[stream answer tokens]
S --> OG[output guardrail] --> ANS[answer + trace]
L -.span.-> OBS[(tracer)]
TC -.span.-> OBS
S -.span.-> OBS
The demo is deliberately dependency-light so it runs offline in one process. The table shows what each piece would become at production scale; the interfaces are designed so these are swaps, not rewrites.
| Concern | In this repo | Production target |
|---|---|---|
| Orchestration | in-process ingest() stages |
Airflow / Dagster DAG per stage |
| Document parsing | .txt / .md loader |
Docling / Unstructured for PDFs |
| Vector + lexical store | in-memory HybridStore or Postgres + pgvector (Docker) |
OpenSearch for lexical at scale |
| Embeddings | deterministic hash embed (mock) | Jina / OpenAI / Voyage |
| Generation | Mock / OpenAI / Anthropic | add Ollama (local) / Bedrock (open-weights) |
| Cache | none or Redis (Docker) | Redis cluster + semantic cache |
| Tracing | in-process span tree + cost | Langfuse / OpenTelemetry export |
| Serving | FastAPI single process | FastAPI behind a gateway, autoscaled |
Two of the dashed boxes are now solid: a Postgres + pgvector store and a Redis cache ship in the repo behind the same interfaces, switched on by environment variables, and wired into a Docker Compose stack (see "Run with Docker" below).
The console is framed as a central GenAI platform serving multiple professional divisions (Health, Tax, Legal, Compliance) - opinionated defaults, configurable edges. Pick a division in the top bar; it sets the active knowledge base and the trust posture. The defining feature is the Grounded Q&A module: answers are drawn only from curated sources and cite them, and when grounding confidence falls below threshold the system abstains instead of guessing - because in regulated domains a confident wrong answer is worse than "I don't know."
All seeded content is synthetic and illustrative - not medical, legal, tax, or financial advice, and not affiliated with any company or product.
A ~3-minute walkthrough:
- Overview - frame it: one platform, many divisions, six capabilities.
- Grounded Q&A (Health) - ask an in-scope question; the answer cites its sources with a grounding-confidence meter. Then ask an out-of-scope question and watch it abstain. This is the trust money-shot.
- Model Compare / Router - show an EU-residency constraint eliminating US-only models; switch to on-prem for sensitive data. Model pluralism, made explainable axis by axis.
- Agent - run a task; the answer streams while the trace tree fills in with per-division cost and latency.
- Evaluations - run the trust gate (retrieval recall + coverage) that decides whether a workflow ships.
- Overview - the scenario framing and the six platform pillars.
- Grounded Q&A - cited answers from curated sources; abstains below a grounding threshold.
- Model Router - set hard constraints and ranking weights, see the chosen model, the ranked survivors, and the rejection reasons.
- Model Compare - put two models head to head with a per-axis (capability / cost / latency / eval) score decomposition, plus a stacked-bar view of the whole registry.
- Retrieval - query the indexed corpus and inspect each hit's lexical (BM25) and dense rank before Reciprocal Rank Fusion.
- Agent - run a task and watch the answer stream token by token over SSE while the span tree fills in live, with latency / token / cost rollups.
- Guardrails - scan text as input, retrieved context, and output; see findings and the PII-redacted form.
- Evaluations - run the golden RAG set for recall@k / precision@k / MRR / coverage and a CI pass-fail gate.
python run.pyThat installs the couple of things it needs (once), serves the console at
http://localhost:8000, and opens your browser. No virtualenv or extra steps.
On Windows you can also just double-click run.bat; on macOS/Linux, ./run.sh.
It runs fully offline on the mock provider; set GENAI_PROVIDER plus an API key
to use real models.
This brings up the production-shaped stack: the API backed by a real pgvector store and a Redis cache. Requires Docker.
docker compose up --build -d # app + postgres(pgvector) + redis
curl -s localhost:8000/api/health
curl -s -X POST localhost:8000/api/ingest -H 'content-type: application/json' -d '{}'
# then open http://localhost:8000 and use Grounded Q&A (now served from pgvector)The api service runs with GENAI_STORE=pgvector and GENAI_CACHE=redis, so
/api/rag/answer retrieves from Postgres (FTS + vector, fused with RRF) and
caches grounded answers in Redis. Add GENAI_PROVIDER and an API key to the
api service environment to use real models. OpenSearch is available under an
optional profile: docker compose --profile search up -d.
Switch backends without Docker too, against your own services:
export GENAI_STORE=pgvector GENAI_PG_DSN="host=localhost dbname=genai user=genai password=genai"
export GENAI_CACHE=redis REDIS_URL=redis://localhost:6379/0
pip install -e ".[api,infra]"
python -m examples.ingest_to_store # populate pgvector
python -m uvicorn api.main:app --port 8000Note: the Postgres/Redis paths were import- and schema-validated but not run against live services in the authoring sandbox (no Docker there). They are straightforward to bring up locally with the commands above.
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
python -m examples.run_router # model selection under different constraints
python -m examples.run_ingestion # ingestion pipeline: discover/parse/chunk/index/report
python -m examples.run_rag # hybrid retrieval ranks
python -m examples.run_agent # agent + tools + tracing + guardrails
python -m examples.run_evals # RAG eval report + CI gate
pytest # test suiteA polished UI drives the real platform over HTTP - it is a thin client, not a reimplementation. Five modules: Model Router, Retrieval, Agent (with a live trace tree), Guardrails, and Evaluations.
# 1. backend - exposes genai_platform over HTTP on :8000
pip install -e ".[api]"
uvicorn api.main:app --reload --port 8000
# 2. frontend - Vite dev server on :5173, proxies /api to :8000
cd web
npm install
npm run dev # open http://localhost:5173Single-process production mode: build the UI and let FastAPI serve it.
cd web && npm install && npm run build && cd ..
uvicorn api.main:app --port 8000 # UI + API both at http://localhost:8000cp .env.example .env # then edit, or just export the vars
export GENAI_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
export GENAI_EMBED_PROVIDER=openai # Anthropic has no embedding endpoint
export OPENAI_API_KEY=sk-...
pip install -e ".[anthropic,openai]"
python -m examples.run_agentgenai_platform/
providers/ base protocol, mock, openai, anthropic, router (model registry)
retrieval/ chunking, hybrid store (BM25 + dense + RRF + rerank)
agent/ tool registry, sample tools, inspectable loop
guardrails/ input / context / output checks
observability/ spans, latency, token + cost accounting
evals/ harness, judges, golden dataset
api/ FastAPI backend exposing the platform over HTTP
web/ React + TypeScript console (Vite)
mcp_server/ MCP server over stdio
examples/ runnable demos for each subsystem
tests/ pytest suite
- The router encodes a decision, not a favorite: hard constraints filter the registry, soft weights rank the survivors, and both the ranking and the rejection reasons are returned so any choice can be explained.
- Retrieval is evaluated separately from generation. Naive fixed-size chunking and top-k-only retrieval are the usual causes of bad RAG; this code shows sentence-aware chunking and hybrid fusion as saner defaults.
- The agent loop is a loop on purpose. Frameworks hide control flow; a platform team usually needs to see and bound it.
- Guardrails return findings instead of throwing, so policy (block / redact / warn) lives with the orchestrator, not the checker.