Skip to content

feat: Build document ingestion pipeline for GA4GH policy compliance checks#12

Open
ReemHamraz wants to merge 3 commits into
ga4gh:mainfrom
ReemHamraz:feature/document-ingestion-pipeline
Open

feat: Build document ingestion pipeline for GA4GH policy compliance checks#12
ReemHamraz wants to merge 3 commits into
ga4gh:mainfrom
ReemHamraz:feature/document-ingestion-pipeline

Conversation

@ReemHamraz
Copy link
Copy Markdown

@ReemHamraz ReemHamraz commented Mar 6, 2026

Description

Closes #11
Implements the Phase 1 document ingestion pipeline for GA4GH-RegBot. Delivers a working end-to-end pipeline: load documents → chunk text → generate embeddings → store in ChromaDB → retrieve with rich citations. Includes 12 GA4GH policy documents with enriched metadata (category, subcategory, section headings) for citation grounding.

Documents Ingested

The following 12 GA4GH policy documents were compiled, structured, and successfully embedded:

# Document Category
1 Framework for Responsible Sharing of Genomic Data framework
2 GA4GH Consent Policy (POL 002 v2.0) consent_requirements
3 GA4GH Data Privacy and Security Policy (POL 001 v2.0) privacy_security
4 Machine Readable Consent Guidance (MRCG) duo_mapping
5 Consent Clauses for Genomic Research (2020) consent_requirements
6 Consent Toolkit: Clinical Genomic (D015 v6.0) consent_requirements
7 Consent Toolkit: Rare Disease consent_requirements
8 Familial Consent Clauses (D011 v1.0) consent_requirements
9 Consent Clauses for Large Scale Initiatives (D014 v1.0) consent_requirements
10 Pediatric Consent to Genetic Research (D012a v1.0) consent_requirements
11 Data Use Ontology (DUO) Reference — all data use codes + matching algorithm duo_mapping
12 GA4GH Ethics Review Recognition Policy — mutual recognition framework for RECs ethics_review

Note on file structure: Consent Toolkit documents 5–10 are bundled into a single structured text file (GA4GH_Consent_Toolkit_Clauses.txt) with clearly separated sections per toolkit, rather than stored as 6 individual files. This keeps the data/frameworks/ directory clean while preserving all content. The chunker splits them into individual chunks with section-level metadata regardless of file structure.
Documents 11 and 12 (DUO Reference + Ethics Review Recognition Policy) go beyond the standard 10 consent/privacy documents and provide additional coverage for data use ontology codes and ethics committee mutual recognition, both of which are critical for the compliance assistant's use case.

Metadata Schema Per Chunk

Every chunk stored in ChromaDB carries:

  • source — raw filename
  • display_name — human-readable document name for citations
  • category — document role (consent_requirements, duo_mapping, privacy_security, ethics_review, framework)
  • subcategory — topic area (general, toolkit_clauses, data_use_codes, mutual_recognition, etc.)
  • section — detected section heading for citation grounding
  • chunk_index — position within the source document
  • type — file format (text or pdf)
    This enables citations like:
    [GA4GH Consent Policy (POL 002 v2.0), 3.2 Types of Consent, Chunk 14]
    instead of returning anonymous text.

Changes

New Modules

File Purpose
src/config.py Centralized settings + DOCUMENT_REGISTRY with category mappings
src/ingestion/loader.py PDF/text loading + registry-based metadata enrichment + section detection + load_all_frameworks()
src/ingestion/chunker.py Text splitting + section heading detection per chunk
src/ingestion/embedder.py Sentence-transformer embeddings → ChromaDB storage
src/retrieval/retriever.py Similarity search + enriched citations (display_name, section)

CLI Commands

Command Description
python -m src.main ingest <file> Ingest a single document
python -m src.main ingest-all Bulk ingest all documents from data/frameworks/
python -m src.main query "<question>" Query with enriched citations
python -m src.main status Show document names and categories

Supporting Files

  • data/frameworks/ — 6 structured GA4GH policy text files (covering all 12 documents)
  • data/ga4gh_framework_excerpt.txt — Sample Framework excerpt
  • scripts/download_documents.py — Downloads additional GA4GH PDFs from Google Drive
  • tests/22 unit tests (loader, chunker, retriever)
  • .gitignore, .env.example, README.md

Key Design Decisions

  1. Local embeddings (all-MiniLM-L6-v2) — no API key needed, lowering the barrier for contributors
  2. ChromaDB with persistence — vector store survives restarts, no re-ingestion needed
  3. Enriched metadata — every chunk carries category, subcategory, display name, and section heading for precise citation grounding
  4. Bundled toolkit files — related consent toolkit documents are grouped by topic in structured text files with clear section markers, keeping the directory clean while the chunker handles granular splitting

Testing

  • 22 unit tests passing
  • E2E verified: ingest-all loaded all documents → query for consent requirements returns correct sections with enriched citations

How to test

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python -m pytest tests/ -v
python -m src.main ingest-all
python -m src.main query "What are the consent requirements for genomic data?"

- Add modular project structure (src/ingestion/, src/retrieval/)

- Implement PDF/text document loader with source metadata

- Add text chunking with LangChain RecursiveCharacterTextSplitter

- Integrate sentence-transformers (all-MiniLM-L6-v2) for local embeddings

- Set up ChromaDB for persistent vector storage

- Add retriever with similarity search and citation formatting

- Create CLI with ingest, query, and status commands

- Include sample GA4GH Framework excerpt for testing

- Add 22 unit tests (loader, chunker, retriever)

- Update README with setup, usage, and architecture docs

- Add .gitignore, .env.example, and centralized config
@ReemHamraz
Copy link
Copy Markdown
Author

ReemHamraz commented Mar 6, 2026

Hi @dedyli
I've been working on this PR for quite some time. Kindly review it and let me know if anything is to be changed.
I plan to open a few more PRs before turning in my proposal (like you'd asked) so that I have a decent list of contributions to this repo. And the fact that my proposal is almost done so I'll be sending it over by Sunday!

Thank you!!

- Add 5 comprehensive policy text documents in data/frameworks/:

  - DUO Data Use Ontology Reference (all data use codes + matching algo)

  - GA4GH Consent Policy Reference (informed consent, donor rights)

  - GA4GH Data Privacy & Security Policy Reference (13 privacy areas)

  - GA4GH Ethics Review Recognition Policy (mutual recognition)

  - GA4GH Consent Toolkit Clauses (genomic, rare disease, pediatric, familial)

- Enhance loader with category/subcategory metadata via DOCUMENT_REGISTRY

- Add section heading detection to chunker for richer citations

- Update retriever citations to show display_name and section

- Add ingest-all CLI command for bulk loading data/frameworks/

- Add download script for fetching GA4GH PDFs (scripts/download_documents.py)
- Add GA4GH_Framework_Responsible_Sharing.txt to data/frameworks/
- Update DOCUMENT_REGISTRY key to match new filename
- Add MRCG registry entry for DUO-bundled content
- Enhance cmd_status() to show document count, names, and categories
- Add get_collection_stats() to Retriever for richer metadata queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build document ingestion pipeline for GA4GH policy documents

1 participant