A research prototype implementing a retrieval-augmented generation (RAG) workflow for classifying Singapore trade-related documents. The project contains a lightweight RAG tool, a simple "naive" classification agent, prompt & model utilities, and focused evaluation utilities to measure classifier performance.
- Modular Architecture: A production-ready Python codebase with clear separation of concerns (Agents, Tools, Parsers, Evaluation).
- Containerized Workflow: A fully working Dockerfile and docker-compose setup for reproducible testing.
- Evaluation-First: Integrated promptfoo configuration for rigorous testing of prompts against trade scenarios.
- Auditor’s Log: A trace of the agent’s Chain of Thought (CoT) for the test cases, showing how it handled ambiguity.
- src/sg_trade_ragbot/agents/naive_agent.py — A simple classification agent used as a baseline.
- src/sg_trade_ragbot/tools/RAGTool.py — RAG tooling: document ingestion, retrieval, and context assembly.
- src/sg_trade_ragbot/utils/prompts/prompts.py — Prompt templates used by agents and tools.
- src/sg_trade_ragbot/utils/models/models.py — Model wrappers and helpers.
- src/sg_trade_ragbot/utils/evals/ — Evaluation configuration (e.g., bare_config.yaml) and utilities.
- tests/ — Unit tests for agents, tools, and utils.
- Reproducible Pipeline: Provide a containerized RAG pipeline for classifying trade text.
- Baseline Benchmarks: Offer simple agents to evaluate different retrieval and prompting strategies.
- Iteration: Make it easy to swap prompts, plug in different LLMs (OpenAI, Groq, etc.), and measure impact.
- Docker Desktop (Recommended)
- uv (Optional, for local dependency management)
- Python 3.13+ (If running locally without Docker)
This project is containerized to ensure consistent evaluations across different machines. It uses uv for dependency management and mounts configuration files so you can edit test cases without rebuilding the container.
The container requires API keys to function.
-
Copy the example environment file:
cp .env.example .env -
Open .env and add your keys (e.g., OPENAI_API_KEY, GROQ_API_KEY).Note: Do not add file paths to .env. The container handles paths automatically.
To run the promptfoo evaluation against the default configuration:
docker compose up --build
This will:
- Build the image (installing all dependencies from uv.lock).
- Run the evaluation script.
- Print the results to your terminal.
You do not need to rebuild the container to modify prompts or test cases.
- Open src/sg_trade_ragbot/utils/evals/eval_configs/bare_config.yaml in your local editor.
- Modify your prompts, test cases, or variables.
- Save the file.
- Run docker compose up again.
- The container sees your changes immediately via Docker volumes.
If you add a new library (e.g., spacy), you must rebuild the container for Docker to see it:
# 1. Update lockfile locally
uv add spacy
# 2. Rebuild container
docker compose up --build
The system generates a trace of the agent’s Chain of Thought (CoT) to show how it handles ambiguity in trade documents.
- Current Status: These traces are currently visible in the promptfoo debug logs/container output.
- Todo: Implement a structured export or cleaner visualization for the Auditor's Log in the final report.
- Current Status: The Docker configuration currently relies on external APIs (OpenAI, Groq). Local Ollama instances running on the host machine are not yet bridgeable to the container network in this release.
- Todo: Add a dedicated Ollama service to docker-compose.yml for fully offline, local model evaluation.
- Prompts: Tweak templates in utils/prompts to change agent behavior.
- Models: Wrappers in utils/models abstract the LLM/embedding implementations. You can swap in your preferred LLM client by implementing the required interface.
- Testing: Tests live in tests/ and use pytest. Run them frequently during development.
- [] fix retrieval json parsing
- [] fix chunking to be smaller and more efficient