genai-platform

A compact, runnable reference architecture for a central GenAI platform. It is intentionally small and readable rather than a framework, and demonstrates the core capabilities a platform team has to get right:

Provider abstraction & routing - one neutral interface over OpenAI, Anthropic, and open-weights; a transparent policy that picks a model from a registry given capability/cost/latency/residency/eval constraints.
Hybrid RAG - sentence-aware chunking, BM25 + dense retrieval fused with Reciprocal Rank Fusion, optional rerank.
Agentic loop - a thin, fully traced plan/act loop with bounded steps, tool use, a reflection pass, and guardrails on the edges.
Guardrails - input (prompt-injection), retrieved-context scanning, and output (PII redaction, length caps).
Observability - nested spans with latency, token usage, and USD cost rolled up per run and per tenant; exportable as JSON.
Evaluation - a golden-set harness computing recall@k / precision@k / MRR for retrieval and key-term coverage for generation, with a CI pass/fail gate.
MCP server - exposes platform tools over the Model Context Protocol.

Everything runs offline with zero API keys via a deterministic mock provider, so you can clone, run, and read it immediately. Drop in real keys to swap in GPT or Claude with no code changes.

Architecture

The platform has two lanes: an ingestion lane that turns source documents into an indexed knowledge base, and a serving lane that answers queries with routing, retrieval, agents, guardrails and evaluation. Cross-cutting concerns (observability, governance) wrap both. Solid nodes are implemented in this repo; dashed nodes marked (target) are where production infrastructure plugs in.

flowchart TB
    subgraph INGEST["Ingestion lane (genai_platform/ingestion.py)"]
        direction LR
        SRC["Source docs<br/>(arXiv / S3 / CMS) (target)"]:::target
        DISC[discover] --> PARSE[parse] --> CHK[chunk] --> IDX[index] --> RPT[report] --> CLN[cleanup]
        ORCH["Airflow DAG orchestration (target)"]:::target
        SRC --> DISC
        ORCH -.schedules.-> DISC
    end

    subgraph STORE["Knowledge base"]
        HS["HybridStore<br/>BM25 + dense + RRF (in-memory)"]
        PG["Postgres + pgvector<br/>FTS + vector + RRF"]
        RD["Redis cache"]
        OS["OpenSearch lexical-at-scale (target)"]:::target
    end

    subgraph SERVE["Serving lane"]
        APIL["FastAPI (api/main.py)<br/>/rag/answer · /agent/stream · /route · /evals"]
        RTR["Model router<br/>registry + policy"]
        RAG["Grounded answer<br/>cite + abstain + faithfulness"]
        AGT["Agent loop<br/>tools · reflection · SSE stream"]
        GR["Guardrails<br/>input · context · output"]
    end

    subgraph PROV["Model backends"]
        MK[Mock offline]
        OAI[OpenAI]
        ANT[Anthropic]
        OW["open-weights / Ollama / Bedrock (target)"]:::target
    end

    subgraph XCUT["Cross-cutting"]
        OBS["Observability<br/>spans · cost · latency"]
        OTEL["Langfuse / OTel export (target)"]:::target
        EVAL["Evaluations<br/>recall · faithfulness · CI gate"]
    end

    IDX --> HS
    HS -.scale out.-> PG
    APIL --> RAG --> HS
    APIL --> AGT --> HS
    APIL --> RTR --> PROV
    RAG --> RTR
    AGT --> GR
    RAG --> GR
    APIL -.cache.-> RD
    AGT --> OBS
    RAG --> OBS
    OBS -.export.-> OTEL
    EVAL --> HS

    classDef target stroke-dasharray: 4 3,opacity:0.75;

The agent loop, drawn out:

flowchart LR
    T[task] --> IG{input<br/>guardrail}
    IG -->|blocked| X[stop]
    IG -->|ok| L[LLM step]
    L -->|tool calls| TC[execute tools] --> L
    L -->|final| S[stream answer tokens]
    S --> OG[output guardrail] --> ANS[answer + trace]
    L -.span.-> OBS[(tracer)]
    TC -.span.-> OBS
    S -.span.-> OBS

Implemented now vs production target

The demo is deliberately dependency-light so it runs offline in one process. The table shows what each piece would become at production scale; the interfaces are designed so these are swaps, not rewrites.

Concern	In this repo	Production target
Orchestration	in-process `ingest()` stages	Airflow / Dagster DAG per stage
Document parsing	`.txt` / `.md` loader	Docling / Unstructured for PDFs
Vector + lexical store	in-memory `HybridStore` or Postgres + pgvector (Docker)	OpenSearch for lexical at scale
Embeddings	deterministic hash embed (mock)	Jina / OpenAI / Voyage
Generation	Mock / OpenAI / Anthropic	add Ollama (local) / Bedrock (open-weights)
Cache	none or Redis (Docker)	Redis cluster + semantic cache
Tracing	in-process span tree + cost	Langfuse / OpenTelemetry export
Serving	FastAPI single process	FastAPI behind a gateway, autoscaled

Two of the dashed boxes are now solid: a Postgres + pgvector store and a Redis cache ship in the repo behind the same interfaces, switched on by environment variables, and wired into a Docker Compose stack (see "Run with Docker" below).

Demo scenario: regulated professional workflows

The console is framed as a central GenAI platform serving multiple professional divisions (Health, Tax, Legal, Compliance) - opinionated defaults, configurable edges. Pick a division in the top bar; it sets the active knowledge base and the trust posture. The defining feature is the Grounded Q&A module: answers are drawn only from curated sources and cite them, and when grounding confidence falls below threshold the system abstains instead of guessing - because in regulated domains a confident wrong answer is worse than "I don't know."

All seeded content is synthetic and illustrative - not medical, legal, tax, or financial advice, and not affiliated with any company or product.

A ~3-minute walkthrough:

Overview - frame it: one platform, many divisions, six capabilities.
Grounded Q&A (Health) - ask an in-scope question; the answer cites its sources with a grounding-confidence meter. Then ask an out-of-scope question and watch it abstain. This is the trust money-shot.
Model Compare / Router - show an EU-residency constraint eliminating US-only models; switch to on-prem for sensitive data. Model pluralism, made explainable axis by axis.
Agent - run a task; the answer streams while the trace tree fills in with per-division cost and latency.
Evaluations - run the trust gate (retrieval recall + coverage) that decides whether a workflow ships.

Modules in the console

Overview - the scenario framing and the six platform pillars.
Grounded Q&A - cited answers from curated sources; abstains below a grounding threshold.
Model Router - set hard constraints and ranking weights, see the chosen model, the ranked survivors, and the rejection reasons.
Model Compare - put two models head to head with a per-axis (capability / cost / latency / eval) score decomposition, plus a stacked-bar view of the whole registry.
Retrieval - query the indexed corpus and inspect each hit's lexical (BM25) and dense rank before Reciprocal Rank Fusion.
Agent - run a task and watch the answer stream token by token over SSE while the span tree fills in live, with latency / token / cost rollups.
Guardrails - scan text as input, retrieved context, and output; see findings and the PII-redacted form.
Evaluations - run the golden RAG set for recall@k / precision@k / MRR / coverage and a CI pass-fail gate.

Run it (one command)

python run.py

That installs the couple of things it needs (once), serves the console at http://localhost:8000, and opens your browser. No virtualenv or extra steps. On Windows you can also just double-click run.bat; on macOS/Linux, ./run.sh. It runs fully offline on the mock provider; set GENAI_PROVIDER plus an API key to use real models.

Run with Docker (Postgres + pgvector + Redis)

This brings up the production-shaped stack: the API backed by a real pgvector store and a Redis cache. Requires Docker.

docker compose up --build -d        # app + postgres(pgvector) + redis
curl -s localhost:8000/api/health
curl -s -X POST localhost:8000/api/ingest -H 'content-type: application/json' -d '{}'
# then open http://localhost:8000 and use Grounded Q&A (now served from pgvector)

The api service runs with GENAI_STORE=pgvector and GENAI_CACHE=redis, so /api/rag/answer retrieves from Postgres (FTS + vector, fused with RRF) and caches grounded answers in Redis. Add GENAI_PROVIDER and an API key to the api service environment to use real models. OpenSearch is available under an optional profile: docker compose --profile search up -d.

Switch backends without Docker too, against your own services:

export GENAI_STORE=pgvector GENAI_PG_DSN="host=localhost dbname=genai user=genai password=genai"
export GENAI_CACHE=redis REDIS_URL=redis://localhost:6379/0
pip install -e ".[api,infra]"
python -m examples.ingest_to_store      # populate pgvector
python -m uvicorn api.main:app --port 8000

Note: the Postgres/Redis paths were import- and schema-validated but not run against live services in the authoring sandbox (no Docker there). They are straightforward to bring up locally with the commands above.

Quick start

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

python -m examples.run_router    # model selection under different constraints
python -m examples.run_ingestion # ingestion pipeline: discover/parse/chunk/index/report
python -m examples.run_rag       # hybrid retrieval ranks
python -m examples.run_agent     # agent + tools + tracing + guardrails
python -m examples.run_evals     # RAG eval report + CI gate

pytest                           # test suite

Web console (React + FastAPI)

A polished UI drives the real platform over HTTP - it is a thin client, not a reimplementation. Five modules: Model Router, Retrieval, Agent (with a live trace tree), Guardrails, and Evaluations.

# 1. backend - exposes genai_platform over HTTP on :8000
pip install -e ".[api]"
uvicorn api.main:app --reload --port 8000

# 2. frontend - Vite dev server on :5173, proxies /api to :8000
cd web
npm install
npm run dev        # open http://localhost:5173

Single-process production mode: build the UI and let FastAPI serve it.

cd web && npm install && npm run build && cd ..
uvicorn api.main:app --port 8000      # UI + API both at http://localhost:8000

Using real models

cp .env.example .env             # then edit, or just export the vars
export GENAI_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
export GENAI_EMBED_PROVIDER=openai   # Anthropic has no embedding endpoint
export OPENAI_API_KEY=sk-...
pip install -e ".[anthropic,openai]"
python -m examples.run_agent

Layout

genai_platform/
  providers/     base protocol, mock, openai, anthropic, router (model registry)
  retrieval/     chunking, hybrid store (BM25 + dense + RRF + rerank)
  agent/         tool registry, sample tools, inspectable loop
  guardrails/    input / context / output checks
  observability/ spans, latency, token + cost accounting
  evals/         harness, judges, golden dataset
api/             FastAPI backend exposing the platform over HTTP
web/             React + TypeScript console (Vite)
mcp_server/      MCP server over stdio
examples/        runnable demos for each subsystem
tests/           pytest suite

Design notes

The router encodes a decision, not a favorite: hard constraints filter the registry, soft weights rank the survivors, and both the ranking and the rejection reasons are returned so any choice can be explained.
Retrieval is evaluated separately from generation. Naive fixed-size chunking and top-k-only retrieval are the usual causes of bad RAG; this code shows sentence-aware chunking and hybrid fusion as saner defaults.
The agent loop is a loop on purpose. Frameworks hide control flow; a platform team usually needs to see and bound it.
Guardrails return findings instead of throwing, so policy (block / redact / warn) lives with the orchestrator, not the checker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genai-platform

Architecture

Implemented now vs production target

Demo scenario: regulated professional workflows

Modules in the console

Run it (one command)

Run with Docker (Postgres + pgvector + Redis)

Quick start

Web console (React + FastAPI)

Using real models

Layout

Design notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
api		api
examples		examples
genai_platform		genai_platform
mcp_server		mcp_server
tests		tests
web		web
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
compose.yml		compose.yml
pyproject.toml		pyproject.toml
run.bat		run.bat
run.py		run.py
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

genai-platform

Architecture

Implemented now vs production target

Demo scenario: regulated professional workflows

Modules in the console

Run it (one command)

Run with Docker (Postgres + pgvector + Redis)

Quick start

Web console (React + FastAPI)

Using real models

Layout

Design notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages