Build a biomedical research assistant that answers questions over ~27k PubMed papers using Qdrant vector search engine tooling and LLM-based (OpenAI) tool routing.
This workshop is a simplified version of the PubMed Navigator project, covering the vector search part of it. The goal is to learn the basics of context engineering with Qdrant: best practices, available features, and choices ensuring scalability.
- Designing LLM (OpenAI) tool definitions with clear routing signals
- Writing routing prompts based on best prompt engineering practices
Infrastructure and scalability:
- Cloud Inference through external providers (OpenAI). Offload embedding inference to Qdrant server-side to save on latency.
- Efficient batch upload with the Python client's
upload_points. - Conditional uploads/updates to save on latency while broadening or adapting available dataset.
- Scalar quantization. Compress vectors used for retrieval, reducing memory usage ~4x. Quantized vectors stay in RAM for fast search; originals live on disk for rescoring.
Capabilities important in high-precision domains:
- Hybrid retrieval. Combine dense semantic search with keyword-based search in one call. Keyword search in Qdrant is implemented through sparse vectors. Dense vectors catch meaning; BM25 formula-based sparse vectors catch exact matches (gene names, drug names) that dense embeddings might miss.
- Multistage retrieval. Use prefetch functionality to retrieve candidates cheaply, then rescore and rerank for precision. In this workshop we'll setup multistage retrieval based on Matryoshka Representation Learning (MRL) feature of OpenAI embeddings.
Capabilities for meaningful context engineering:
- Recommendation API. Discover papers based on positive/negative constraints in the user's query. Qdrant computes a target vector close to positives and far from negatives, using vector arithmetic.
- Python 3.13+
- uv package manager
- ~200 MB of free space: the dataset in
datafolder is stored compressed on disk (~42 MB), but is decompressed in memory (~163 MB) during ingestion - A Qdrant Cloud Free (Forever) Tier cluster. The UI will guide you through creating a cluster and obtaining your endpoint URL and API key. No credit card needed. Video walkthrough.
- An OpenAI API key. You'll need it for both embedding inference (
text-embedding-3-small) and LLM-based tool routing (gpt-4o-mini).
Embedding all 26,788 abstracts from
data/pubmed_dataset.json.gzwithtext-embedding-3-smallcosts approximately $0.15. Search queries usegpt-4o-minifor tool routing and summarization costing fractions of a cent per query. Expect around ~$0.17 for running the entire pipeline.
- Install dependencies:
uv sync- Copy the environment template and fill in your keys:
cp .env.example .env- For Qdrant Cloud API key & cluster URL, UI guides you through obtaining them after creating a cluster. Video walkthrough.
- For OpenAI API key, obtain it here
QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-qdrant-api-key
QDRANT_COLLECTION_NAME=pubmed_papers
OPENAI_API_KEY=your-openai-api-key
OPENAI_MODEL=gpt-4o-mini
PUBMED_JSON_PATH=data/pubmed_dataset.json.gz- You're all set. The dataset is already included at
data/pubmed_dataset.json.gz.
The included dataset contains 26,788 PubMed papers, each with:
- PMID (unique PubMed identifier)
- Title and abstract
- Authors with affiliations
- MeSH terms (Medical Subject Headings — the controlled vocabulary used by NLM to index articles)
- Journal, publication date, and DOI
pubmed-navigator-workshop/
├── data/
│ └── pubmed_dataset.json.gz # PubMed papers dataset (compressed)
├── workshop/
│ ├── config.py # Environment variables
│ ├── cli.py # CLI entry point
│ ├── infrastructure/
│ │ ├── search_engine.py # Qdrant infrastructure
│ │ └── ingestion.py # Data loading & ingestion orchestration
│ └── context_engineering/
│ ├── prompts.py # LLM prompt templates
│ ├── tools.py # Tool definitions for LLM routing
│ ├── search_engine_query.py # Qdrant tools execution
│ └── context.py # Orchestration pipeline
├── .env.example # Environment template
├── Makefile # Commands
└── pyproject.toml
We'll create a Qdrant collection and populate it with PubMed papers.
The paper abstracts get embedded and indexed in Qdrant with OpenAI's text-embedding-3-small. All other fields will be stored as payload metadata alongside the vectors.
make create-qdrant-collectionThis creates a configured empty collection (its name we provided in .env) with three named vectors — Dense, Reranker, and Lexical:
| Vector | Type | Dimensions | Role |
|---|---|---|---|
| Dense | dense, text-embedding-3-small |
1024 | Semantic retrieval, scalar quantization. Quantized vectors will be used for retrieval. Stored in RAM, originals on disk. |
| Reranker | dense, text-embedding-3-small |
1536 | Reranking only. Stored on disk and no vector index built for it. |
| Lexical | sparse | varies | BM25 keyword matching. |
All collection configuration lives in search_engine.py.
The Dense vectors are configured to be quantized using scalar quantization. Only quantized lighter vectors will be kept in RAM (always_ram=True). Each float32 in original vector will be compressed to 8 bits, reducing memory ~4x. Original vectors will stay on disk for rescoring at query time, so accuracy of search will be preserved despite compression.
Scalar quantization is a safe default choice. Other quantization methods are also available, see our documentation on quantization in Qdrant. It's important to keep quantization in mind early, at configuration time: at scale (much more than 30k papers), it saves significantly on latency and RAM costs.
Experiment: The
quantileparameter (default:0.99) insearch_engine.py→create_collection()controls how aggressively outlier values are clipped during quantization.
Qdrant supports sparse vectors alongside dense vectors. This lets us adding to the search pipeline the BM25-based retrieval, well-known in IR.
In biomedical domain keywords-based matching matters: specific terms like gene names ("TP53") or drug names ("metformin") carry precise meaning that dense embeddings sometimes dilute.
The
Lexicalvector is configured withmodifier=IDF, which enables server-side inverse document frequency weighting from BM25 formula. This is set insearch_engine.py→create_collection()viaSparseVectorParams.
make ingest-data-to-qdrantTo recreate the collection from scratch before ingestion:
make ingest-data-to-qdrant RECREATE=1With Cloud Inference (CLOUD_INFERENCE = True in search_engine.py), embedding happens server-side: the client sends raw text, Qdrant calls OpenAI.
With it enabled, ingestion should take around 5 minutes. You can speed it further up by playing with the parameters of upload_points in search_engine.py.
Without Cloud Inference ingestion will take longer.
Both dense vectors (Dense and Reranker) are derived from text-embedding-3-small embeddings, which supports Matryoshka Representation Learning (MRL) (OpenAI MRL), a training approach that front-loads all important information into the earliest dimensions of an embedding. Due to that, a single embedding can be truncated to any shorter prefix and still be used meaningfully.
We use two truncation levels from one model:
- 1024 dims → Dense, fast retrieval with lower memory footprint.
- 1536 dims → Reranker, higher precision for rescoring retrieved candidates.
MRL truncation requests for the same text in Qdrant Cloud Inference are deduplicated into one API call, to save costs and latency.
Experiment: You can change
OPENAI_RETRIEVER_EMBEDDING_DIMENSIONandOPENAI_RERANKER_EMBEDDING_DIMENSIONinsearch_engine.pyto see how dimensionality will affect resource usage and retrieval quality. 1536 is the maximum fortext-embedding-3-small. You can switchOPENAI_EMBEDDING_MODELtotext-embedding-3-largefor higher dimensions, but it is significantly more expensive.
Efficient batch upload. Papers are streamed via a generator into upload_points in batches (default: 32). Lazy batching, auto retries, parallelism, see batch_size, parallel, and max_retries parameters in search_engine.py → upsert_points().
Conditional uploads. By default, points are overwritten during upsertion if a point with this ID already exists in the collection. Pass ONLY_NEW=1 to switch to INSERT_ONLY mode, skipping existing points and saving on latency, if you want to expand the dataset efficiently or optimize the uploading process in case you had to restart it midway. Read more on Conditional Upserts/Updates here.
Note:
ONLY_NEW=1only speeds up the Qdrant upsert step. Embeddings are still computed for every paper, so it does not reduce inference costs.
BM25 Sparse Vectors Inference Qdrant supports BM25-based sparse vector inference on the server side.
BM25 usage in Qdrant requires an
avg_lenparameter, the average document length used in the formula. We estimate this by sampling the first 300 abstracts (see_estimate_avg_abstract_len()insearch_engine.py). You can adjustESTIMATE_BM25_AVG_LEN_ON_X_DOCSif needed.
Open your Qdrant Cluster Dashboard (Cluster UI). You should see your collection populated with all papers. Read more about Qdrant WebUI.
Query the collection using a context engineering pipeline that routes natural language questions to the right Qdrant tool via an LLM agent.
Two phases (see context.py):
- Tool routing — the LLM reads the question and calls one of two Qdrant tools:
retrieve_papers_based_on_query— hybrid search (dense + lexical) fused by reranking.recommend_papers_based_on_constraints— based on Recommendation API for queries with negative constraints ("papers about X but not Y").
- Summarization — the LLM summarizes retrieved papers into key findings.
retrieve_papers_based_on_query (see search_engine_query.py) runs a single Qdrant query that combines hybrid retrieval with multistage reranking via prefetch:
- Prefetch (hybrid) — two parallel candidate retrievals:
- Dense semantic search on the 1024-dim vector (with quantization
oversampling+rescoreto keep precision high. Read more on searching with quantization) - BM25 keyword search on the
Lexicalsparse vector
- Dense semantic search on the 1024-dim vector (with quantization
- Rerank (multistage) — fused candidates are rescored using the full 1536-dim
Rerankervector.
recommend_papers_based_on_constraints (see search_engine_query.py) uses Qdrant's Recommendation API with AVERAGE_VECTOR strategy. Qdrant finds points close to the average of positive vectors and far from the average of negative vectors.
Tool definitions (tools.py) — each definition tells the LLM what the tool does, when to use it and how to construct arguments.
See best practices: Anthropic "Best practices for tool definitions" | OpenAI function calling tools
Tool calling (context.py) with key parameters of the OpenAI Responses API call:
instructions, system prompt with routing rules and few-shot examples. Separated frominput(the user's question) so the LLM has a clear boundary between instructions and user input.tool_choice="required", the LLM must call a tool.parallel_tool_calls=False, not more than one tool call per response.
Experiment: You can change
OPENAI_MODELin.envto try different models (e.g.,gpt-4o) and see how tool routing accuracy and summary quality change.
Routing prompt with few-shot examples (prompts.py): three examples covering the key scenarios:
- Coherent query, no negative constraints → retrieval (shows filler removal, term preservation)
- One positive topic, two independent negative constraints → recommendation
- Two separable positive constraints, one negative one → recommendation
Each example includes a brief explanation so the LLM learns the reasoning, not just the pattern.
See best practices: Anthropic "Prompting best practices" | OpenAI "Prompt engineering"
make context-engineering-qdrant QUESTION="your question here"Control how many papers Qdrant returns (default: 5):
make context-engineering-qdrant QUESTION="your question here" LIMIT=10Watch the trace output to see which tool gets called and how arguments are constructed.
These should trigger retrieve_papers_based_on_query:
make context-engineering-qdrant QUESTION="What are the mechanisms of CRISPR-Cas9 delivery using lipid nanoparticles for in vivo genome editing?"
# Smaller limit — see how the summary changes
make context-engineering-qdrant QUESTION="What is the role of autophagy in neurodegenerative diseases?" LIMIT=3These should trigger recommend_papers_based_on_constraints:
make context-engineering-qdrant QUESTION="Find papers on photodynamic therapy for cancer treatment, but not studies focused on skin cancer."
# One positive, two independent negative constraints (should split into separate negative examples)
make context-engineering-qdrant QUESTION="Papers about TP53 mutations in cancer prognosis, but not animal studies and nothing like generic overviews of tumor suppressors."
# Two separable positives + exclusion (positives should split too)
make context-engineering-qdrant QUESTION="Research on BRCA1 DNA repair mechanisms and PD-L1 immunotherapy response biomarkers, excluding pediatric populations."Things to experiment with:
- Toggle
CLOUD_INFERENCEinsearch_engine.pyto compare server-side vs. client-side embedding. - Try different
batch_sizeandparallelvalues insearch_engine.py→upsert_points()to optimize ingestion speed.
and
- Change embedding dimensions (
OPENAI_RETRIEVER_EMBEDDING_DIMENSION,OPENAI_RERANKER_EMBEDDING_DIMENSIONinsearch_engine.py) and observe how retrieval quality and resource usage change. - Tune quantization search parameters (
oversampling,rescore) insearch_engine_query.py→retrieve_papers_based_on_query(). - Change the
limiton individual prefetch stages insearch_engine_query.pyto control how many candidates each retrieval method (dense vs. BM25) contributes before fusion and reranking. - Adjust
ESTIMATE_BM25_AVG_LEN_ON_X_DOCSor the defaultavg_leninsearch_engine.pyto see how BM25 scoring changes with different average document length estimates.
and
- Adjust
LIMITto see how the number of retrieved papers affects summarization. - Modify the prompt in
prompts.pyor tool descriptions intools.pyand see how the agent's behavior changes.
Check further resources for context engineering with Agents and Qdrant: