feat: Build document ingestion pipeline for GA4GH policy compliance checks#12
Open
ReemHamraz wants to merge 3 commits into
Open
feat: Build document ingestion pipeline for GA4GH policy compliance checks#12ReemHamraz wants to merge 3 commits into
feat: Build document ingestion pipeline for GA4GH policy compliance checks#12ReemHamraz wants to merge 3 commits into
Conversation
- Add modular project structure (src/ingestion/, src/retrieval/) - Implement PDF/text document loader with source metadata - Add text chunking with LangChain RecursiveCharacterTextSplitter - Integrate sentence-transformers (all-MiniLM-L6-v2) for local embeddings - Set up ChromaDB for persistent vector storage - Add retriever with similarity search and citation formatting - Create CLI with ingest, query, and status commands - Include sample GA4GH Framework excerpt for testing - Add 22 unit tests (loader, chunker, retriever) - Update README with setup, usage, and architecture docs - Add .gitignore, .env.example, and centralized config
Author
|
Hi @dedyli Thank you!! |
- Add 5 comprehensive policy text documents in data/frameworks/: - DUO Data Use Ontology Reference (all data use codes + matching algo) - GA4GH Consent Policy Reference (informed consent, donor rights) - GA4GH Data Privacy & Security Policy Reference (13 privacy areas) - GA4GH Ethics Review Recognition Policy (mutual recognition) - GA4GH Consent Toolkit Clauses (genomic, rare disease, pediatric, familial) - Enhance loader with category/subcategory metadata via DOCUMENT_REGISTRY - Add section heading detection to chunker for richer citations - Update retriever citations to show display_name and section - Add ingest-all CLI command for bulk loading data/frameworks/ - Add download script for fetching GA4GH PDFs (scripts/download_documents.py)
- Add GA4GH_Framework_Responsible_Sharing.txt to data/frameworks/ - Update DOCUMENT_REGISTRY key to match new filename - Add MRCG registry entry for DUO-bundled content - Enhance cmd_status() to show document count, names, and categories - Add get_collection_stats() to Retriever for richer metadata queries
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #11
Implements the Phase 1 document ingestion pipeline for GA4GH-RegBot. Delivers a working end-to-end pipeline: load documents → chunk text → generate embeddings → store in ChromaDB → retrieve with rich citations. Includes 12 GA4GH policy documents with enriched metadata (category, subcategory, section headings) for citation grounding.
Documents Ingested
The following 12 GA4GH policy documents were compiled, structured, and successfully embedded:
consent_requirementsprivacy_securityduo_mappingconsent_requirementsconsent_requirementsconsent_requirementsconsent_requirementsconsent_requirementsconsent_requirementsduo_mappingethics_reviewMetadata Schema Per Chunk
Every chunk stored in ChromaDB carries:
display_name— human-readable document name for citationscategory— document role (consent_requirements,duo_mapping,privacy_security,ethics_review, framework)subcategory— topic area (general,toolkit_clauses,data_use_codes,mutual_recognition, etc.)chunk_index— position within the source documenttype— file format (text or pdf)This enables citations like:
[GA4GH Consent Policy (POL 002 v2.0), 3.2 Types of Consent, Chunk 14]instead of returning anonymous text.
Changes
New Modules
DOCUMENT_REGISTRYwith category mappingsCLI Commands
python -m src.main ingest <file>python -m src.main ingest-alldata/frameworks/python -m src.main query "<question>"python -m src.main statusSupporting Files
data/frameworks/— 6 structured GA4GH policy text files (covering all 12 documents)tests/— 22 unit tests (loader, chunker, retriever)Key Design Decisions
all-MiniLM-L6-v2) — no API key needed, lowering the barrier for contributorsTesting
ingest-allloaded all documents → query for consent requirements returns correct sections with enriched citationsHow to test