`feat: Build document ingestion pipeline for GA4GH policy compliance checks` by ReemHamraz · Pull Request #12 · ga4gh/GA4GH-RegBot

ReemHamraz · 2026-03-06T07:22:08Z

Description

Closes #11
Implements the Phase 1 document ingestion pipeline for GA4GH-RegBot. Delivers a working end-to-end pipeline: load documents → chunk text → generate embeddings → store in ChromaDB → retrieve with rich citations. Includes 12 GA4GH policy documents with enriched metadata (category, subcategory, section headings) for citation grounding.

Documents Ingested

The following 12 GA4GH policy documents were compiled, structured, and successfully embedded:

#	Document	Category
1	Framework for Responsible Sharing of Genomic Data	framework
2	GA4GH Consent Policy (POL 002 v2.0)	`consent_requirements`
3	GA4GH Data Privacy and Security Policy (POL 001 v2.0)	`privacy_security`
4	Machine Readable Consent Guidance (MRCG)	`duo_mapping`
5	Consent Clauses for Genomic Research (2020)	`consent_requirements`
6	Consent Toolkit: Clinical Genomic (D015 v6.0)	`consent_requirements`
7	Consent Toolkit: Rare Disease	`consent_requirements`
8	Familial Consent Clauses (D011 v1.0)	`consent_requirements`
9	Consent Clauses for Large Scale Initiatives (D014 v1.0)	`consent_requirements`
10	Pediatric Consent to Genetic Research (D012a v1.0)	`consent_requirements`
11	Data Use Ontology (DUO) Reference — all data use codes + matching algorithm	`duo_mapping`
12	GA4GH Ethics Review Recognition Policy — mutual recognition framework for RECs	`ethics_review`

Note on file structure: Consent Toolkit documents 5–10 are bundled into a single structured text file (GA4GH_Consent_Toolkit_Clauses.txt) with clearly separated sections per toolkit, rather than stored as 6 individual files. This keeps the data/frameworks/ directory clean while preserving all content. The chunker splits them into individual chunks with section-level metadata regardless of file structure.
Documents 11 and 12 (DUO Reference + Ethics Review Recognition Policy) go beyond the standard 10 consent/privacy documents and provide additional coverage for data use ontology codes and ethics committee mutual recognition, both of which are critical for the compliance assistant's use case.

Metadata Schema Per Chunk

Every chunk stored in ChromaDB carries:

source — raw filename
display_name — human-readable document name for citations
category — document role (consent_requirements, duo_mapping, privacy_security, ethics_review, framework)
subcategory — topic area (general, toolkit_clauses, data_use_codes, mutual_recognition, etc.)
section — detected section heading for citation grounding
chunk_index — position within the source document
type — file format (text or pdf)
This enables citations like:
[GA4GH Consent Policy (POL 002 v2.0), 3.2 Types of Consent, Chunk 14]
instead of returning anonymous text.

Changes

New Modules

File	Purpose
src/config.py	Centralized settings + `DOCUMENT_REGISTRY` with category mappings
src/ingestion/loader.py	PDF/text loading + registry-based metadata enrichment + section detection + load_all_frameworks()
src/ingestion/chunker.py	Text splitting + section heading detection per chunk
src/ingestion/embedder.py	Sentence-transformer embeddings → ChromaDB storage
src/retrieval/retriever.py	Similarity search + enriched citations (display_name, section)

CLI Commands

Command	Description
`python -m src.main ingest <file>`	Ingest a single document
`python -m src.main ingest-all`	Bulk ingest all documents from `data/frameworks/`
`python -m src.main query "<question>"`	Query with enriched citations
`python -m src.main status`	Show document names and categories

Supporting Files

data/frameworks/ — 6 structured GA4GH policy text files (covering all 12 documents)
data/ga4gh_framework_excerpt.txt — Sample Framework excerpt
scripts/download_documents.py — Downloads additional GA4GH PDFs from Google Drive
tests/ — 22 unit tests (loader, chunker, retriever)
.gitignore, .env.example, README.md

Key Design Decisions

Local embeddings (all-MiniLM-L6-v2) — no API key needed, lowering the barrier for contributors
ChromaDB with persistence — vector store survives restarts, no re-ingestion needed
Enriched metadata — every chunk carries category, subcategory, display name, and section heading for precise citation grounding
Bundled toolkit files — related consent toolkit documents are grouped by topic in structured text files with clear section markers, keeping the directory clean while the chunker handles granular splitting

Testing

22 unit tests passing
E2E verified: ingest-all loaded all documents → query for consent requirements returns correct sections with enriched citations

How to test

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python -m pytest tests/ -v
python -m src.main ingest-all
python -m src.main query "What are the consent requirements for genomic data?"

- Add modular project structure (src/ingestion/, src/retrieval/) - Implement PDF/text document loader with source metadata - Add text chunking with LangChain RecursiveCharacterTextSplitter - Integrate sentence-transformers (all-MiniLM-L6-v2) for local embeddings - Set up ChromaDB for persistent vector storage - Add retriever with similarity search and citation formatting - Create CLI with ingest, query, and status commands - Include sample GA4GH Framework excerpt for testing - Add 22 unit tests (loader, chunker, retriever) - Update README with setup, usage, and architecture docs - Add .gitignore, .env.example, and centralized config

ReemHamraz · 2026-03-06T07:26:28Z

Hi @dedyli
I've been working on this PR for quite some time. Kindly review it and let me know if anything is to be changed.
I plan to open a few more PRs before turning in my proposal (like you'd asked) so that I have a decent list of contributions to this repo. And the fact that my proposal is almost done so I'll be sending it over by Sunday!

Thank you!!

- Add 5 comprehensive policy text documents in data/frameworks/: - DUO Data Use Ontology Reference (all data use codes + matching algo) - GA4GH Consent Policy Reference (informed consent, donor rights) - GA4GH Data Privacy & Security Policy Reference (13 privacy areas) - GA4GH Ethics Review Recognition Policy (mutual recognition) - GA4GH Consent Toolkit Clauses (genomic, rare disease, pediatric, familial) - Enhance loader with category/subcategory metadata via DOCUMENT_REGISTRY - Add section heading detection to chunker for richer citations - Update retriever citations to show display_name and section - Add ingest-all CLI command for bulk loading data/frameworks/ - Add download script for fetching GA4GH PDFs (scripts/download_documents.py)

- Add GA4GH_Framework_Responsible_Sharing.txt to data/frameworks/ - Update DOCUMENT_REGISTRY key to match new filename - Add MRCG registry entry for DUO-bundled content - Enhance cmd_status() to show document count, names, and categories - Add get_collection_stats() to Retriever for richer metadata queries

ReemHamraz added 2 commits March 6, 2026 13:28

ReemHamraz mentioned this pull request Apr 7, 2026

Astropy: Hardening Astropy's Core Stability (Reem Hamraz) OpenAstronomy/gsoc-proposals#24

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`feat: Build document ingestion pipeline for GA4GH policy compliance checks`#12

`feat: Build document ingestion pipeline for GA4GH policy compliance checks`#12
ReemHamraz wants to merge 3 commits into
ga4gh:mainfrom
ReemHamraz:feature/document-ingestion-pipeline

ReemHamraz commented Mar 6, 2026 •

edited

Loading

Uh oh!

ReemHamraz commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ReemHamraz commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Documents Ingested

Metadata Schema Per Chunk

Changes

New Modules

CLI Commands

Supporting Files

Key Design Decisions

Testing

How to test

Uh oh!

ReemHamraz commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ReemHamraz commented Mar 6, 2026 •

edited

Loading

ReemHamraz commented Mar 6, 2026 •

edited

Loading