A production-grade multi-agent AI system for e-commerce customer service, built with LangGraph, Claude, MCP (Model Context Protocol), and RAG.
The system routes customer conversations to three specialized sub-agents via a Supervisor, with full hybrid search, human-in-the-loop escalation, guardrails, and RAGAS evaluation.
graph TD
A[Customer Chat UI\nStreamlit] -->|HTTP/WebSocket| B[FastAPI + WebSocket\nlocalhost:8000]
B --> C{Supervisor Agent\nLangGraph}
C -->|product_query| D[Shop Agent\nAgentic RAG loop]
C -->|order_query| E[Order Agent\nguardrails + identity check]
C -->|support| F[Support Agent\nFAQ + policies RAG]
C -->|chitchat| G[Chitchat Node\nHaiku]
C -->|escalate| H[Human-in-the-Loop\ninterrupt + resume]
D -->|MCP| I[Product Catalog MCP\nlocalhost:8001]
E -->|MCP| J[Order Management MCP\nlocalhost:8002]
F -->|MCP| K[Knowledge Base MCP\nlocalhost:8003]
I --> L[(Qdrant\nvector DB)]
I --> M[(PostgreSQL\nproducts)]
J --> M
K --> L
C --> N[evaluate_response_node\nquality gate]
N -->|retry| C
N -->|accept| O[final_response]
The MCP boundary separates agent reasoning from data access. Agents speak only MCP — they have no direct database connections. Adding a new retail client means deploying a new set of MCP servers, not touching the agent code.
- Supervisor receives the user message, classifies intent (Haiku — cheap), routes to a sub-agent.
- The sub-agent executes a ReAct loop against its MCP tools, returns a draft response.
- evaluate_response_node judges quality (Haiku). If insufficient and under 3 retries, loops back.
- Accepted response passes through output guardrails (SQL/PII/stack-trace filter) before returning.
| Layer | Technology |
|---|---|
| Orchestration | LangGraph (stateful graphs, supervisor pattern) |
| Chains & Parsing | LangChain LCEL |
| Agent protocol | MCP + langchain-mcp-adapters |
| Primary LLM | Claude Sonnet (reasoning + tool calls) |
| Cheap LLM | Claude Haiku (intent classification, eval, chitchat) |
| Embeddings | BAAI/bge-small-en-v1.5 via FastEmbed (384-dim) |
| Sparse retrieval | Qdrant/BM25 via FastEmbed |
| Reranking | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Vector DB | Qdrant (Docker) |
| Relational DB | PostgreSQL |
| API | FastAPI + WebSocket streaming |
| Tracing | LangSmith |
| Eval | RAGAS (faithfulness, answer_relevancy, context_recall) |
| Guardrails | Custom regex (prompt injection, PII, output safety) |
| Frontend | Streamlit |
| Package manager | uv |
- Docker + Docker Compose
- uv (
curl -LsSf https://astral.sh/uv/install.sh | sh) - An Anthropic API key
git clone https://github.com/<your-username>/shopagent.git
cd shopagent
cp .env.example .env
# Edit .env and fill in ANTHROPIC_API_KEY (and optionally LANGCHAIN_API_KEY)docker compose up --buildThis starts: Qdrant, PostgreSQL, Product Catalog MCP, Order Management MCP, Knowledge Base MCP, FastAPI API, and Streamlit frontend. All services have health checks — the API will not start until all MCP servers are ready.
uv sync --group devuv run python scripts/seed_catalog.py # ~500 products → Qdrant + Postgres
uv run python scripts/seed_knowledge_base.py # FAQ + policies → Qdrant
uv run python scripts/seed_orders.py # Sample orders → Postgrescurl http://localhost:8000/health
# {"status":"ok","mcp_connected":true}Navigate to http://localhost:8501
# Unit tests — no Docker needed, all I/O mocked
uv run pytest tests/unit/ -v
# Integration tests — requires Docker Compose services running
uv run pytest tests/integration/ -v -m integration
# Guardrails tests specifically
uv run pytest tests/unit/test_guardrails.py -v
# RAGAS eval — requires all services + seeded data (~10-20 min, real LLM calls)
uv run pytest tests/eval/test_ragas.py::test_ragas_full_suite_and_save_baseline \
-v -m integration -sProducts have both semantic data (descriptions, reviews) and structured data (price, stock, attributes). Three-level retrieval:
User query: "wireless headphones under 100 euros in stock"
│
▼
┌─────────────────────────────┐
│ 1. Self-query decomposition │ semantic_query="wireless headphones"
│ (LLM → QueryDecomposition)│ max_price=100.0, in_stock=True
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 2. Hybrid search │ Dense (cosine) + Sparse (BM25) via
│ Qdrant RRF fusion │ Qdrant prefetch + Reciprocal Rank Fusion
└─────────────────────────────┘
│ top-20 candidates
▼
┌─────────────────────────────┐
│ 3. Cross-encoder reranking │ ms-marco-MiniLM-L-6-v2 scores each
│ + PostgreSQL enrichment │ (query, product) pair; top-5 returned
└─────────────────────────────┘
│ top-5 enriched products
▼
LLM context
Dual storage: semantic data in Qdrant (for search), price/stock/rating in PostgreSQL (authoritative, updated frequently without re-indexing).
Evaluation dataset: 55 questions across product search, support/FAQ, and order queries.
| Metric | Score | Description |
|---|---|---|
faithfulness |
— | Are claims in the answer grounded in retrieved context? |
answer_relevancy |
— | Does the answer address what was asked? |
context_recall |
— | Does the context contain the ground truth? |
Run
uv run pytest tests/eval/test_ragas.py::test_ragas_full_suite_and_save_baseline -v -m integration -sto generate baseline scores and populate this table. Results are saved totests/eval/results/baseline.json.
Three independent layers protect the system:
| Layer | What it catches | Action |
|---|---|---|
| Prompt injection (input) | "ignore previous instructions", DAN mode, <system> tags, role-hijack phrases |
Raises GuardrailViolation → canned refusal, no LLM call |
| PII redaction (output) | Emails, credit cards, IBANs, Dutch BSNs, phone numbers | Surgical token replacement before response is returned |
| Output safety (output) | Raw SQL, Python tracebacks, internal file paths, exposed API keys | Entire response replaced with safe fallback message |
All guardrails are pure synchronous functions with 39 unit tests (tests/unit/test_guardrails.py).
| Server | Port | Tools exposed |
|---|---|---|
| Product Catalog | 8001 | search_products, get_product_details, compare_products, get_recommendations |
| Order Management | 8002 | check_order_status, get_tracking_info, get_order_history, initiate_return |
| Knowledge Base | 8003 | search_knowledge_base, get_policy_details |
- Multi-tenant: new client = new MCP server set, zero changes to agent logic
- Testable: agents and MCP servers tested independently
- Replaceable: swap PostgreSQL for MongoDB by only changing the MCP server
ShopAgent is designed so that onboarding a new retail client requires no changes to agent code. The steps below take approximately 1-2 hours for a new client.
Copy the three MCP server templates:
cp -r src/mcp_servers/product_catalog src/mcp_servers/client_b_products
cp -r src/mcp_servers/order_management src/mcp_servers/client_b_orders
cp -r src/mcp_servers/knowledge_base src/mcp_servers/client_b_kbUpdate each server's database connection strings to point to the client's databases. The tool names and schemas stay the same — only the data source changes.
client-b-product-mcp:
build:
context: .
dockerfile: Dockerfile
command: ["uv", "run", "python", "-m", "src.mcp_servers.client_b_products.server", "--transport", "streamable-http"]
ports:
- "8011:8011"
environment:
- QDRANT_URL=http://qdrant:6333
- POSTGRES_DSN=postgresql://... # client B's databaseAdd similar services for orders and knowledge base.
In src/api/main.py, the MultiServerMCPClient config is the only thing that changes:
mcp_client = MultiServerMCPClient({
"product-catalog": {"url": "http://client-b-product-mcp:8011/mcp", "transport": "streamable_http"},
"order-management": {"url": "http://client-b-order-mcp:8012/mcp", "transport": "streamable_http"},
"knowledge-base": {"url": "http://client-b-kb-mcp:8013/mcp", "transport": "streamable_http"},
})For a multi-tenant API, read the MCP URLs from request headers or JWT claims so a single API instance can serve multiple tenants.
uv run python scripts/seed_catalog.py # with client B's products
uv run python scripts/seed_knowledge_base.py # with client B's FAQ/policies
uv run python scripts/seed_orders.py # with client B's order historyThe agents never change. The Supervisor graph, intent classifier, ReAct loops, and guardrails are all tenant-agnostic.
shopagent/
├── src/
│ ├── agents/
│ │ ├── supervisor.py # Supervisor graph + intent routing + guardrail wiring
│ │ ├── shop_agent.py # Product search subgraph (agentic RAG loop)
│ │ ├── order_agent.py # Order management subgraph
│ │ ├── support_agent.py # FAQ/knowledge base subgraph
│ │ └── state.py # ShopAgentState TypedDict + Pydantic models
│ ├── guardrails.py # Prompt injection, PII, output safety
│ ├── mcp_servers/
│ │ ├── product_catalog/ # MCP: search, details, compare, recommend
│ │ ├── order_management/ # MCP: status, tracking, returns
│ │ └── knowledge_base/ # MCP: FAQ search, policy details
│ ├── rag/
│ │ ├── ingestion.py # Chunking + embedding + Qdrant upsert
│ │ ├── retrieval.py # Hybrid search + reranking + PG enrichment
│ │ └── self_query.py # NL → semantic query + structured filters
│ └── api/
│ ├── main.py # FastAPI lifespan, MCP client, agent assembly
│ ├── routes.py # /health, /chat, /chat/{id}/resume, WS
│ └── schemas.py # Pydantic request/response models
├── tests/
│ ├── unit/ # 254 tests, no external services
│ │ └── test_guardrails.py # 39 guardrail tests (passing/failing cases)
│ ├── integration/ # Require Docker Compose services
│ └── eval/
│ ├── datasets/eval_questions.json # 55 Q&A pairs
│ ├── test_ragas.py # faithfulness + relevancy + recall
│ └── results/baseline.json # generated after first eval run
├── scripts/
│ ├── seed_catalog.py # Generate + ingest 500 products (with retry)
│ ├── seed_knowledge_base.py # Ingest FAQ + 4 policy docs
│ └── seed_orders.py # Seed sample orders into PostgreSQL
├── data/
│ ├── faq.md
│ └── policies/ # return, shipping, privacy, terms
├── frontend/app.py # Streamlit chat UI
├── docker-compose.yml # All 7 services with health checks
└── pyproject.toml # uv project (Python 3.12+)
Copy .env.example and fill in the required values:
# Required
ANTHROPIC_API_KEY=sk-ant-...
# LangSmith tracing (optional but recommended)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...
LANGCHAIN_PROJECT=shopagent-dev
# Set automatically by docker-compose.yml (override only for non-Docker use)
QDRANT_URL=http://localhost:6333
POSTGRES_DSN=postgresql://shopagent:shopagent@localhost:5432/shopagent
PRODUCT_MCP_URL=http://localhost:8001/mcp
ORDER_MCP_URL=http://localhost:8002/mcp
KNOWLEDGE_MCP_URL=http://localhost:8003/mcp| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe + MCP connection status |
POST |
/chat |
Synchronous chat — returns final response |
WS |
/chat/{session_id} |
Streaming WebSocket — yields per-node events |
POST |
/chat/{session_id}/resume |
Resume after human escalation interrupt |
GET |
/graph/{agent_id} |
Mermaid diagram for supervisor/shop/order/support |
Interactive docs: http://localhost:8000/docs
Every conversation creates one LangSmith trace with per-node spans. Model costs are tagged at the client level:
tier:primary/model:claude-sonnet-4-6— sub-agent reasoning callstier:cheap/model:claude-haiku-4-5-...— intent classification, quality eval, chitchat
Filter by tag in LangSmith to see the Haiku/Sonnet cost split across a session.
| Step | What was built |
|---|---|
| 01 | RAG pipeline: hybrid search, self-query decomposition, cross-encoder reranking |
| 02 | Single agent with LangGraph ReAct loop and product search tools |
| 03 | MCP servers, multi-tenant data layer, langchain-mcp-adapters integration |
| 04 | Multi-agent Supervisor, Order/Support sub-agents, HiTL escalation, FastAPI + Streamlit |
| 05 | Guardrails, RAGAS eval dataset, model fallback chain, LangSmith cost tags, README |