GA4GH-RegBot: Compliance Assistant Status: MVP available — ingest, hybrid retrieval, local-first LLM (Ollama / Llama 3 by default) or optional OpenAI, programmatic citation checks, CLI, Streamlit, and a small PDF eval harness. Ongoing work: real-corpus evaluation, stricter evidence objects, and contributor tooling.
Overview RegBot is an LLM-powered tool designed to help researchers map their consent forms against GA4GH regulatory frameworks. It uses RAG (Retrieval-Augmented Generation) to flag compliance gaps automatically.
What works today
- Ingest policy PDFs or
.txtfiles into a local Chroma store plus a JSON manifest (chunk ids, page hints, source metadata). - Hybrid retrieval: embedding search + BM25, merged with reciprocal rank fusion.
- Compliance pass: JSON-mode LLM via Ollama by default (e.g.
llama3, configurable withREGBOT_OLLAMA_MODEL). SetREGBOT_LLM_PROVIDER=openaiandOPENAI_API_KEYto use OpenAI instead. If no LLM is reachable (or on API failure), a keyword heuristic fallback still returns grounded chunk ids. - Streamlit UI for upload + paste flows (
src/streamlit_app.py). - CLI:
python -m src.main …(see below). - Citation grounding (programmatic): Each
recommendations[]item must be{ "text": "...", "evidence_chunk_ids": ["..."] }with ids taken only from retrieved chunks; optionalcitations[]must also respect the same allow-list. Failed checks trigger automatic rewrite requests with the allow-list; optional token-overlap filtering on the LLM path (REGBOT_MIN_TOKEN_OVERLAP). - PDF eval harness:
evalsubcommand ingests a real GA4GH PDF and prints retrieval hits for built-in or custom queries (for manual review / building a gold set later).
Quickstart (Development)
- Prerequisites: Python 3.10–3.12 (CI uses 3.11). Python 3.14 is not supported yet for the full stack (native wheels for parts of the ML/Chroma toolchain often lag).
- Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt-
Configure environment variables:
- Export variables in your shell (recommended)
- If you use a local
.env, keep it private and do not commit it
-
LLM (default: local Ollama)
Install Ollama, runollama pull llama3(or another tag you set inREGBOT_OLLAMA_MODEL), and keep the daemon running (ollama serveorbrew services start ollamaon macOS). NoOPENAI_API_KEYis required for this path. -
Embeddings (first ingest)
The embedding model is downloaded from Hugging Face on first use. If downloads are slow or fail, try a longer timeout (HF_HUB_DOWNLOAD_TIMEOUT, seconds) or a mirror (REGBOT_HF_ENDPOINT=https://hf-mirror.com— setsHF_ENDPOINTfor the Hub client). -
Ingest a policy file into
./data/regbot_store(use--resetwhen reloading the same corpus):
python -m src.main ingest --path path/to/policy.pdf --reset- Check a consent / data-use text file:
python -m src.main check --consent path/to/consent.txt- Run the Streamlit UI from the repo root:
python -m streamlit run src/streamlit_app.py- End-to-end sample (synthetic policy + consent under
examples/):
python examples/run_demo.pyMore detail: examples/DEMO.md.
Evaluate retrieval on a real GA4GH PDF (resets the store by default if you pass --reset):
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --top-k 8Use your own query list (one line per query):
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --queries-file examples/eval/queries_ga4gh.txtOptionally append a full compliance JSON report for a consent file:
python -m src.main eval --pdf path/to/ga4gh_policy.pdf --reset --consent path/to/consent.txtRun tests
python -m unittest discover -s tests -p "test*.py" -vEnvironment Variables
REGBOT_LLM_PROVIDER:ollama(default) — local LLM via Ollama’s OpenAI-compatible HTTP API (no OpenAI key). Set toopenaito use OpenAI’s hosted API instead.OPENAI_API_KEY: Required only whenREGBOT_LLM_PROVIDER=openai. Model:REGBOT_LLM_MODEL(defaultgpt-4o-mini).REGBOT_OLLAMA_MODEL: Tag known to Ollama (defaultllama3). Examples:llama3,mistral,mistral:latest.REGBOT_OLLAMA_BASE_URL: Ollama HTTP host only (defaulthttp://127.0.0.1:11434);/v1is appended automatically for the OpenAI-compatible routes.REGBOT_OLLAMA_API_KEY: Sent as the Bearer/API key to Ollama’s shim (defaultollama; ignored by Ollama).REGBOT_STORE: On-disk store directory (default./data/regbot_store).REGBOT_EMBEDDING_MODEL: SentenceTransformers model id (defaultsentence-transformers/all-MiniLM-L6-v2).HF_HUB_DOWNLOAD_TIMEOUT: Hugging Face Hub download timeout in seconds (embedding model on first use). The app sets a higher default when unset; increase if you see read timeouts.REGBOT_HF_ENDPOINT: If set, copied toHF_ENDPOINT(e.g.https://hf-mirror.comwhere Hub mirrors are used).REGBOT_MIN_TOKEN_OVERLAP: On the LLM path, minimum token recall between each recommendation and cited chunk texts (default0.06). Set to0to disable dropping low-overlap rows.REGBOT_CHROMA_ANONYMIZED_TELEMETRY: Set to1to enable Chroma client telemetry; default is off (0).REGBOT_OPENAI_MAX_RETRIES: Retries for the OpenAI Python client (used for both OpenAI API and Ollama’s compatible endpoint; default3).
Architecture (implemented vs planned)
- Core: Python 3, package under
src/regbot/(ingest, hybrid retrieval, compliance, optional local embedding download helpers). - Embeddings:
sentence-transformers+ Hugging Face Hub (minimal file set; ONNX-heavy artifacts skipped where possible). - Vector store: Chroma persistent files under
REGBOT_STORE/chromaplusmanifest.jsonfor BM25 text. - Retrieval: cosine similarity in Chroma +
rank-bm25, fused via reciprocal rank fusion; optional metadata category filter. - LLM: Default: Ollama (
llama3orREGBOT_OLLAMA_MODEL) via OpenAI-compatible chat completions + JSON parsing. Optional:REGBOT_LLM_PROVIDER=openaiwithOPENAI_API_KEY. Fallback: keyword heuristic if OpenAI is selected without a key, or after LLM errors (e.g. Ollama not running). - UI: Streamlit (
src/streamlit_app.py). - Optional / roadmap: LangChain or LlamaIndex adapters on top of the same stores (not required by the current code); richer offline evaluation (Ragas, human labels); structured per-recommendation evidence (e.g. quotes).
Next steps (suggested priorities)
- Real GA4GH corpus: ingest official PDFs, tune chunk size/overlap and hybrid fusion weights using
eval+ a small gold query → chunk_id list (manual or semi-automated). - Richer evidence: optional quoted spans, stricter refusal when retrieved excerpts are insufficient (grounding and token-overlap checks are already in place for the LLM path).
- Contributor experience: Done in-repo: separate Lint workflow (Ruff check + format check),
CONTRIBUTING.md,.pre-commit-config.yaml,pyproject.toml,requirements-dev.txt. Still open: optional CImypy, broader type hints, Black-only rules if the team wants them. - Operational hardening: Done in-repo: Chroma telemetry off by default (
REGBOT_CHROMA_ANONYMIZED_TELEMETRY), clientmax_retries(REGBOT_OPENAI_MAX_RETRIES), clearValueErrorwhen a PDF yields no extractable text. Next: optional request timeouts, observability hooks.
Contributing
- See
CONTRIBUTING.mdfor venv setup, Ruff lint/format, optional pre-commit, and tests. - Open PRs against the upstream repo; keep changes scoped and tested (
python -m unittest discover -s tests -p "test*.py" -v). Do not commit.env, API keys, or localdata/regbot_store/.