Biomedical GraphRAG

Overview

A biomedical context engineering system. An agent uses Qdrant vector search engine tools (hybrid retrieval, recommendations) and Neo4j graph database tools (graph enrichment) to gather context, then fuses it into a single biomedical answer.

Originally forked from benitomartin/biomedical-graphrag.

References:

Video: PubMed Navigator
Article: Building a Biomedical GraphRAG: When Knowledge Graphs Meet Vector Search

Key Features:

Context Engineering: Agent orchestrates Qdrant and Neo4j tools, fusing results into a single answer
Qdrant Vector Search Engine: Hybrid retrieval (dense + BM25 with reranking) and constraint-based recommendations
Neo4j Graph Database: Graph enrichment via ontology-based tools (collaborator networks, MeSH relations, gene co-mentions)
Data Integration: Processes PubMed papers, gene data, and research citations
Biomedical Schema: Specialized graph schema for papers, authors, institutions, genes, and MeSH terms
Async Processing: High-performance async data collection and processing

Project Structure

biomedical-graphrag/
├── .github/                    # GitHub workflows and templates
├── data/                       # Dataset storage (PubMed, Gene data)
├── src/
│   └── biomedical_graphrag/
│       ├── api/                # FastAPI server
│       │   └── server.py       # PubMed Navigator API endpoints
│       ├── application/        # Application layer
│       │   ├── cli/            # Command-line interfaces
│       │   └── services/       # Business logic services
│       ├── config.py           # Configuration management
│       ├── data_sources/       # Data collection modules
│       ├── domain/             # Domain models and entities
│       ├── infrastructure/     # Database and external service adapters
│       └── utils/              # Utility functions
├── static/                     # Static assets (images, etc.)
├── tests/                      # Test suite
├── Dockerfile                  # Docker build configuration
├── LICENSE                     # MIT License
├── Makefile                    # Build and development commands
├── pyproject.toml              # Project configuration and dependencies
├── README.md                   # This file
└── uv.lock                     # Dependency lock file

Prerequisites

Requirement	Description
Python 3.13+	Programming language
uv	Package and dependency manager
Neo4j	Graph database for knowledge graphs
Qdrant	Vector search engine for embeddings
OpenAI	LLM provider for queries and embeddings
PubMed	Biomedical literature database

Installation

Clone the repository:

git clone git@github.com:thierrypdamiba/biomedical-graphrag.git
cd biomedical-graphrag

Create a virtual environment:
```
uv venv
```
Activate the virtual environment:
```
source .venv/bin/activate
```
Install the required packages:

uv sync --all-groups --all-extras


1. Create a `.env` file in the root directory:

```bash
 cp .env.example .env

Usage

Configuration

Configure API keys, model names, and other settings by editing the .env file:

# OpenAI Configuration
OPENAI__API_KEY=your_openai_api_key_here
OPENAI__MODEL=gpt-4o-mini
OPENAI__TEMPERATURE=0.0
OPENAI__MAX_TOKENS=1500

# Neo4j Configuration
NEO4J__URI=bolt://localhost:7687
NEO4J__USERNAME=neo4j
NEO4J__PASSWORD=your_neo4j_password
NEO4J__DATABASE=neo4j

# Qdrant Configuration
QDRANT__URL=http://localhost:6333
QDRANT__API_KEY=your_qdrant_api_key
QDRANT__COLLECTION_NAME=biomedical_papers
QDRANT__EMBEDDING_MODEL=text-embedding-3-large
QDRANT__EMBEDDING_DIMENSION=1536
QDRANT__RERANKER_EMBEDDING_DIMENSION=3072
QDRANT__ESTIMATE_BM25_AVG_LEN_ON_X_DOCS=300
QDRANT__CLOUD_INFERENCE=false

# PubMed Configuration (optional)
PUBMED__EMAIL=your_email@example.com
PUBMED__API_KEY=your_pubmed_api_key

# Data Paths
JSON_DATA__PUBMED_JSON_PATH=data/pubmed_dataset.json
JSON_DATA__GENE_JSON_PATH=data/gene_dataset.json

Data Collection

The system includes data collectors for biomedical and gene datasets:

# Collect PubMed papers and metadata
make pubmed-data-collector-run

# Override defaults (optional)
make pubmed-data-collector-run QUERY="cancer immunotherapy" MAX_RESULTS=200

# Collect gene information related to the pubmed dataset
make gene-data-collector-run

# Enrich PubMed dataset with related papers
make enrich-pubmed-dataset

# Skip the first N papers as sources (optional)
make enrich-pubmed-dataset START_INDEX=1000

Infrastructure Setup

Neo4j Graph Database

# Create the knowledge graph from datasets
make create-graph

# Delete all graph data (clean slate)
make delete-graph

Qdrant Vector Search Engine

# Create vector collection for embeddings
make create-qdrant-collection

# Ingest embeddings into Qdrant
make ingest-qdrant-data

# Delete vector collection
make delete-qdrant-collection

Notes:

Embeddings are built from PubMed paper abstracts.
This project uses OpenAI embeddings Matryoshka Representation Learning (MRL) feature:
- QDRANT__EMBEDDING_DIMENSION is the prefix dimension used for retrieval (stored in Qdrant as the Dense vector).
- QDRANT__RERANKER_EMBEDDING_DIMENSION is the (larger) prefix dimension used for reranking (stored in Qdrant as the Reranker vector).
make ingest-qdrant-data currently recreates the collection each run (see qdrant_ingestion.py). If you don't want that, change recreate=True to False. There's also an only_new parameter which defaults to True, so we ingest only papers whose PMID is not already present in the collection. Set only_new=False if you'd prefer to overwrite existing points or on a clean ingestion (then it should be False!)
The collection is configured by default with scalar quantization (compressed dense vectors).
QDRANT__CLOUD_INFERENCE=true enables Qdrant Cloud Inference when embeddings are computed by Qdrant Cloud.
QDRANT__ESTIMATE_BM25_AVG_LEN_ON_X_DOCS controls how many documents are sampled to estimate the average abstract length used by BM25. This helps calibrate BM25-based scoring when using dense+BM25 hybrid retrieval.

Query Commands

Hybrid Neo4j + Qdrant Queries

# Run a custom natural language query
make custom-graph-query QUESTION="What are the latest research trends in cancer immunotherapy?"

# Or run directly with the CLI (positional args)
uv run src/biomedical_graphrag/application/cli/fusion_query.py "What are the latest research trends in cancer immunotherapy?"

The hybrid query system combines vector search engine (Qdrant) with graph enrichment (Neo4j):

Author collaboration networks
Citation analysis and paper relationships
Gene-paper associations
MeSH term relationships
Institution affiliations

LLM-powered tool selection & fusion:

Runs one Qdrant tool: hybrid retrieval (BM25 + dense & reranking) or recommendations with constraints — to fetch relevant papers.
Calls Neo4j enrichment tools for graph evidence.
Produces one fused answer from both sources.

Output:

The hybrid CLI prints a final section titled "=== Unified Biomedical Answer ===".

Sample Queries

Who collaborates with Jennifer Doudna on CRISPR research?
Which genes are frequently co-mentioned with TP53?

API Server

The project includes a FastAPI server (PubMed Navigator) that exposes the context engineering pipeline via HTTP endpoints:

# Start the API server (runs on port 8765)
make run-api

Available Endpoints:

Method	Endpoint	Description
GET	`/health`	Health check
GET	`/api/neo4j/stats`	Neo4j graph statistics (node/relationship counts)
POST	`/api/graphrag-query`	Context engineering search (Qdrant + Neo4j)

Search Request Example:

curl -X POST http://localhost:8765/api/graphrag-query \
  -H "Content-Type: application/json" \
  -d '{"query": "What genes are associated with breast cancer?", "limit": 5}'

Frontend

The frontend is maintained in a separate repository: biomedical-graphrag-frontend

The quickest way to run it locally:

# Auto-clones the frontend repo and starts it (requires pnpm)
make run-frontend

Or manually:

git clone https://github.com/thierrypdamiba/biomedical-graphrag-frontend.git
cd biomedical-graphrag-frontend
pnpm install
pnpm dev

The frontend connects to the hosted backend at https://biomedical-graphrag-9qqm.onrender.com by default, or you can point it to a local backend via GRAPHRAG_API_URL=http://localhost:8765.

Docker

Build and run the API server with Docker:

# Build the image
docker build -t biomedical-graphrag:latest .

# Run the container
docker run --rm -p 8765:8765 --env-file .env biomedical-graphrag:latest

# Test health endpoint
curl http://localhost:8765/health

Troubleshooting

Make fails immediately with ".env file is missing"
- Create it with cp .env.example .env and fill in required values.
Qdrant ingestion/query fails
- Confirm Qdrant is running and QDRANT__URL points to it.
- If you changed embedding model, recreate the collection.
Hybrid queries return errors about Neo4j
- Make sure Neo4j is running and credentials in NEO4J__* are correct.
- Ensure the graph exists (make create-graph) before asking graph enrichment questions.

Testing

Run all tests:

make tests

Quality Checks

Run all quality checks (lint, format, type check, clean):

make all-check
make all-fix

Individual Commands:

Display all available commands:
```
make help
```
Check code static typing
```
make mypy
```
Clean cache and build files:
```
make clean
```

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biomedical GraphRAG

Table of Contents

Overview

Project Structure

Prerequisites

Installation

Usage

Configuration

Data Collection

Infrastructure Setup

Neo4j Graph Database

Qdrant Vector Search Engine

Query Commands

Hybrid Neo4j + Qdrant Queries

Sample Queries

API Server

Frontend

Docker

Troubleshooting

Testing

Quality Checks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
data		data
frontend		frontend
src/biomedical_graphrag		src/biomedical_graphrag
static		static
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Biomedical GraphRAG

Table of Contents

Overview

Project Structure

Prerequisites

Installation

Usage

Configuration

Data Collection

Infrastructure Setup

Neo4j Graph Database

Qdrant Vector Search Engine

Query Commands

Hybrid Neo4j + Qdrant Queries

Sample Queries

API Server

Frontend

Docker

Troubleshooting

Testing

Quality Checks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors