A biomedical context engineering system. An agent uses Qdrant vector search engine tools (hybrid retrieval, recommendations) and Neo4j graph database tools (graph enrichment) to gather context, then fuses it into a single biomedical answer.
Originally forked from benitomartin/biomedical-graphrag.
References:
- Video: PubMed Navigator
- Article: Building a Biomedical GraphRAG: When Knowledge Graphs Meet Vector Search
Key Features:
- Context Engineering: Agent orchestrates Qdrant and Neo4j tools, fusing results into a single answer
- Qdrant Vector Search Engine: Hybrid retrieval (dense + BM25 with reranking) and constraint-based recommendations
- Neo4j Graph Database: Graph enrichment via ontology-based tools (collaborator networks, MeSH relations, gene co-mentions)
- Data Integration: Processes PubMed papers, gene data, and research citations
- Biomedical Schema: Specialized graph schema for papers, authors, institutions, genes, and MeSH terms
- Async Processing: High-performance async data collection and processing
biomedical-graphrag/
├── .github/ # GitHub workflows and templates
├── data/ # Dataset storage (PubMed, Gene data)
├── src/
│ └── biomedical_graphrag/
│ ├── api/ # FastAPI server
│ │ └── server.py # PubMed Navigator API endpoints
│ ├── application/ # Application layer
│ │ ├── cli/ # Command-line interfaces
│ │ └── services/ # Business logic services
│ ├── config.py # Configuration management
│ ├── data_sources/ # Data collection modules
│ ├── domain/ # Domain models and entities
│ ├── infrastructure/ # Database and external service adapters
│ └── utils/ # Utility functions
├── static/ # Static assets (images, etc.)
├── tests/ # Test suite
├── Dockerfile # Docker build configuration
├── LICENSE # MIT License
├── Makefile # Build and development commands
├── pyproject.toml # Project configuration and dependencies
├── README.md # This file
└── uv.lock # Dependency lock file
| Requirement | Description |
|---|---|
| Python 3.13+ | Programming language |
| uv | Package and dependency manager |
| Neo4j | Graph database for knowledge graphs |
| Qdrant | Vector search engine for embeddings |
| OpenAI | LLM provider for queries and embeddings |
| PubMed | Biomedical literature database |
-
Clone the repository:
git clone git@github.com:thierrypdamiba/biomedical-graphrag.git cd biomedical-graphrag -
Create a virtual environment:
uv venv
-
Activate the virtual environment:
source .venv/bin/activate -
Install the required packages:
uv sync --all-groups --all-extras
1. Create a `.env` file in the root directory:
```bash
cp .env.example .env
Configure API keys, model names, and other settings by editing the .env file:
# OpenAI Configuration
OPENAI__API_KEY=your_openai_api_key_here
OPENAI__MODEL=gpt-4o-mini
OPENAI__TEMPERATURE=0.0
OPENAI__MAX_TOKENS=1500
# Neo4j Configuration
NEO4J__URI=bolt://localhost:7687
NEO4J__USERNAME=neo4j
NEO4J__PASSWORD=your_neo4j_password
NEO4J__DATABASE=neo4j
# Qdrant Configuration
QDRANT__URL=http://localhost:6333
QDRANT__API_KEY=your_qdrant_api_key
QDRANT__COLLECTION_NAME=biomedical_papers
QDRANT__EMBEDDING_MODEL=text-embedding-3-large
QDRANT__EMBEDDING_DIMENSION=1536
QDRANT__RERANKER_EMBEDDING_DIMENSION=3072
QDRANT__ESTIMATE_BM25_AVG_LEN_ON_X_DOCS=300
QDRANT__CLOUD_INFERENCE=false
# PubMed Configuration (optional)
PUBMED__EMAIL=your_email@example.com
PUBMED__API_KEY=your_pubmed_api_key
# Data Paths
JSON_DATA__PUBMED_JSON_PATH=data/pubmed_dataset.json
JSON_DATA__GENE_JSON_PATH=data/gene_dataset.jsonThe system includes data collectors for biomedical and gene datasets:
# Collect PubMed papers and metadata
make pubmed-data-collector-run
# Override defaults (optional)
make pubmed-data-collector-run QUERY="cancer immunotherapy" MAX_RESULTS=200# Collect gene information related to the pubmed dataset
make gene-data-collector-run# Enrich PubMed dataset with related papers
make enrich-pubmed-dataset
# Skip the first N papers as sources (optional)
make enrich-pubmed-dataset START_INDEX=1000# Create the knowledge graph from datasets
make create-graph
# Delete all graph data (clean slate)
make delete-graph# Create vector collection for embeddings
make create-qdrant-collection
# Ingest embeddings into Qdrant
make ingest-qdrant-data
# Delete vector collection
make delete-qdrant-collectionNotes:
- Embeddings are built from PubMed paper abstracts.
- This project uses OpenAI embeddings Matryoshka Representation Learning (MRL) feature:
QDRANT__EMBEDDING_DIMENSIONis the prefix dimension used for retrieval (stored in Qdrant as theDensevector).QDRANT__RERANKER_EMBEDDING_DIMENSIONis the (larger) prefix dimension used for reranking (stored in Qdrant as theRerankervector).
make ingest-qdrant-datacurrently recreates the collection each run (seeqdrant_ingestion.py). If you don't want that, changerecreate=TruetoFalse. There's also anonly_newparameter which defaults toTrue, so we ingest only papers whose PMID is not already present in the collection. Setonly_new=Falseif you'd prefer to overwrite existing points or on a clean ingestion (then it should beFalse!)- The collection is configured by default with scalar quantization (compressed dense vectors).
QDRANT__CLOUD_INFERENCE=trueenables Qdrant Cloud Inference when embeddings are computed by Qdrant Cloud.QDRANT__ESTIMATE_BM25_AVG_LEN_ON_X_DOCScontrols how many documents are sampled to estimate the average abstract length used by BM25. This helps calibrate BM25-based scoring when using dense+BM25 hybrid retrieval.
# Run a custom natural language query
make custom-graph-query QUESTION="What are the latest research trends in cancer immunotherapy?"
# Or run directly with the CLI (positional args)
uv run src/biomedical_graphrag/application/cli/fusion_query.py "What are the latest research trends in cancer immunotherapy?"The hybrid query system combines vector search engine (Qdrant) with graph enrichment (Neo4j):
- Author collaboration networks
- Citation analysis and paper relationships
- Gene-paper associations
- MeSH term relationships
- Institution affiliations
LLM-powered tool selection & fusion:
- Runs one Qdrant tool: hybrid retrieval (BM25 + dense & reranking) or recommendations with constraints — to fetch relevant papers.
- Calls Neo4j enrichment tools for graph evidence.
- Produces one fused answer from both sources.
Output:
- The hybrid CLI prints a final section titled "=== Unified Biomedical Answer ===".
-
Who collaborates with Jennifer Doudna on CRISPR research?
-
Which genes are frequently co-mentioned with TP53?
The project includes a FastAPI server (PubMed Navigator) that exposes the context engineering pipeline via HTTP endpoints:
# Start the API server (runs on port 8765)
make run-apiAvailable Endpoints:
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /api/neo4j/stats |
Neo4j graph statistics (node/relationship counts) |
| POST | /api/graphrag-query |
Context engineering search (Qdrant + Neo4j) |
Search Request Example:
curl -X POST http://localhost:8765/api/graphrag-query \
-H "Content-Type: application/json" \
-d '{"query": "What genes are associated with breast cancer?", "limit": 5}'The frontend is maintained in a separate repository: biomedical-graphrag-frontend
The quickest way to run it locally:
# Auto-clones the frontend repo and starts it (requires pnpm)
make run-frontendOr manually:
git clone https://github.com/thierrypdamiba/biomedical-graphrag-frontend.git
cd biomedical-graphrag-frontend
pnpm install
pnpm devThe frontend connects to the hosted backend at https://biomedical-graphrag-9qqm.onrender.com by default, or you can point it to a local backend via GRAPHRAG_API_URL=http://localhost:8765.
Build and run the API server with Docker:
# Build the image
docker build -t biomedical-graphrag:latest .
# Run the container
docker run --rm -p 8765:8765 --env-file .env biomedical-graphrag:latest
# Test health endpoint
curl http://localhost:8765/health-
Make fails immediately with ".env file is missing"
- Create it with
cp .env.example .envand fill in required values.
- Create it with
-
Qdrant ingestion/query fails
- Confirm Qdrant is running and
QDRANT__URLpoints to it. - If you changed embedding model, recreate the collection.
- Confirm Qdrant is running and
-
Hybrid queries return errors about Neo4j
- Make sure Neo4j is running and credentials in
NEO4J__*are correct. - Ensure the graph exists (
make create-graph) before asking graph enrichment questions.
- Make sure Neo4j is running and credentials in
Run all tests:
make testsRun all quality checks (lint, format, type check, clean):
make all-check
make all-fixIndividual Commands:
-
Display all available commands:
make help -
Check code static typing
make mypy
-
Clean cache and build files:
make clean
This project is licensed under the MIT License - see the LICENSE file for details.
