Context Engineering with Qdrant

Build a biomedical research assistant that answers questions over ~27k PubMed papers using Qdrant vector search engine tooling and LLM-based (OpenAI) tool routing.

This workshop is a simplified version of the PubMed Navigator project, covering the vector search part of it. The goal is to learn the basics of context engineering with Qdrant: best practices, available features, and choices ensuring scalability.

What you'll learn

Context engineering

Designing LLM (OpenAI) tool definitions with clear routing signals
Writing routing prompts based on best prompt engineering practices

Qdrant for Context Engineering

Infrastructure and scalability:

Cloud Inference through external providers (OpenAI). Offload embedding inference to Qdrant server-side to save on latency.
Efficient batch upload with the Python client's upload_points.
Conditional uploads/updates to save on latency while broadening or adapting available dataset.
Scalar quantization. Compress vectors used for retrieval, reducing memory usage ~4x. Quantized vectors stay in RAM for fast search; originals live on disk for rescoring.

Capabilities important in high-precision domains:

Hybrid retrieval. Combine dense semantic search with keyword-based search in one call. Keyword search in Qdrant is implemented through sparse vectors. Dense vectors catch meaning; BM25 formula-based sparse vectors catch exact matches (gene names, drug names) that dense embeddings might miss.
Multistage retrieval. Use prefetch functionality to retrieve candidates cheaply, then rescore and rerank for precision. In this workshop we'll setup multistage retrieval based on Matryoshka Representation Learning (MRL) feature of OpenAI embeddings.

Capabilities for meaningful context engineering:

Recommendation API. Discover papers based on positive/negative constraints in the user's query. Qdrant computes a target vector close to positives and far from negatives, using vector arithmetic.

What you'll need

Python 3.13+
uv package manager
~200 MB of free space: the dataset in data folder is stored compressed on disk (~42 MB), but is decompressed in memory (~163 MB) during ingestion
A Qdrant Cloud Free (Forever) Tier cluster. The UI will guide you through creating a cluster and obtaining your endpoint URL and API key. No credit card needed. Video walkthrough.
An OpenAI API key. You'll need it for both embedding inference (text-embedding-3-small) and LLM-based tool routing (gpt-4o-mini).

Embedding all 26,788 abstracts from data/pubmed_dataset.json.gz with text-embedding-3-small costs approximately $0.15. Search queries use gpt-4o-mini for tool routing and summarization costing fractions of a cent per query. Expect around ~$0.17 for running the entire pipeline.

Setup

Install dependencies:

uv sync

Copy the environment template and fill in your keys:

cp .env.example .env

For Qdrant Cloud API key & cluster URL, UI guides you through obtaining them after creating a cluster. Video walkthrough.
For OpenAI API key, obtain it here

QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-qdrant-api-key
QDRANT_COLLECTION_NAME=pubmed_papers

OPENAI_API_KEY=your-openai-api-key
OPENAI_MODEL=gpt-4o-mini

PUBMED_JSON_PATH=data/pubmed_dataset.json.gz

You're all set. The dataset is already included at data/pubmed_dataset.json.gz.

About the dataset

The included dataset contains 26,788 PubMed papers, each with:

PMID (unique PubMed identifier)
Title and abstract
Authors with affiliations
MeSH terms (Medical Subject Headings — the controlled vocabulary used by NLM to index articles)
Journal, publication date, and DOI

Project structure

pubmed-navigator-workshop/
├── data/
│   └── pubmed_dataset.json.gz     # PubMed papers dataset (compressed)
├── workshop/
│   ├── config.py                  # Environment variables
│   ├── cli.py                     # CLI entry point
│   ├── infrastructure/
│   │   ├── search_engine.py       # Qdrant infrastructure
│   │   └── ingestion.py           # Data loading & ingestion orchestration
│   └── context_engineering/
│       ├── prompts.py             # LLM prompt templates
│       ├── tools.py               # Tool definitions for LLM routing
│       ├── search_engine_query.py # Qdrant tools execution
│       └── context.py             # Orchestration pipeline
├── .env.example                   # Environment template
├── Makefile                       # Commands
└── pyproject.toml

Part 1: Data Ingestion

We'll create a Qdrant collection and populate it with PubMed papers.

The paper abstracts get embedded and indexed in Qdrant with OpenAI's text-embedding-3-small. All other fields will be stored as payload metadata alongside the vectors.

Step 1 — Create the collection

make create-qdrant-collection

This creates a configured empty collection (its name we provided in .env) with three named vectors — Dense, Reranker, and Lexical:

Vector	Type	Dimensions	Role
Dense	dense, `text-embedding-3-small`	1024	Semantic retrieval, scalar quantization. Quantized vectors will be used for retrieval. Stored in RAM, originals on disk.
Reranker	dense, `text-embedding-3-small`	1536	Reranking only. Stored on disk and no vector index built for it.
Lexical	sparse	varies	BM25 keyword matching.

All collection configuration lives in search_engine.py.

Scalar quantization

The Dense vectors are configured to be quantized using scalar quantization. Only quantized lighter vectors will be kept in RAM (always_ram=True). Each float32 in original vector will be compressed to 8 bits, reducing memory ~4x. Original vectors will stay on disk for rescoring at query time, so accuracy of search will be preserved despite compression.

Scalar quantization is a safe default choice. Other quantization methods are also available, see our documentation on quantization in Qdrant. It's important to keep quantization in mind early, at configuration time: at scale (much more than 30k papers), it saves significantly on latency and RAM costs.

Experiment: The quantile parameter (default: 0.99) in search_engine.py → create_collection() controls how aggressively outlier values are clipped during quantization.

BM25 sparse vectors

Qdrant supports sparse vectors alongside dense vectors. This lets us adding to the search pipeline the BM25-based retrieval, well-known in IR.

In biomedical domain keywords-based matching matters: specific terms like gene names ("TP53") or drug names ("metformin") carry precise meaning that dense embeddings sometimes dilute.

The Lexical vector is configured with modifier=IDF, which enables server-side inverse document frequency weighting from BM25 formula. This is set in search_engine.py → create_collection() via SparseVectorParams.

Step 2 — Ingest the data

make ingest-data-to-qdrant

To recreate the collection from scratch before ingestion:

make ingest-data-to-qdrant RECREATE=1

Cloud Inference + Matryoshka Representation Learning

With Cloud Inference (CLOUD_INFERENCE = True in search_engine.py), embedding happens server-side: the client sends raw text, Qdrant calls OpenAI.
With it enabled, ingestion should take around 5 minutes. You can speed it further up by playing with the parameters of upload_points in search_engine.py. Without Cloud Inference ingestion will take longer.

Both dense vectors (Dense and Reranker) are derived from text-embedding-3-small embeddings, which supports Matryoshka Representation Learning (MRL) (OpenAI MRL), a training approach that front-loads all important information into the earliest dimensions of an embedding. Due to that, a single embedding can be truncated to any shorter prefix and still be used meaningfully.

We use two truncation levels from one model:

1024 dims → Dense, fast retrieval with lower memory footprint.
1536 dims → Reranker, higher precision for rescoring retrieved candidates.

MRL truncation requests for the same text in Qdrant Cloud Inference are deduplicated into one API call, to save costs and latency.

Experiment: You can change OPENAI_RETRIEVER_EMBEDDING_DIMENSION and OPENAI_RERANKER_EMBEDDING_DIMENSION in search_engine.py to see how dimensionality will affect resource usage and retrieval quality. 1536 is the maximum for text-embedding-3-small. You can switch OPENAI_EMBEDDING_MODEL to text-embedding-3-large for higher dimensions, but it is significantly more expensive.

Ingestion

Efficient batch upload. Papers are streamed via a generator into upload_points in batches (default: 32). Lazy batching, auto retries, parallelism, see batch_size, parallel, and max_retries parameters in search_engine.py → upsert_points().

Conditional uploads. By default, points are overwritten during upsertion if a point with this ID already exists in the collection. Pass ONLY_NEW=1 to switch to INSERT_ONLY mode, skipping existing points and saving on latency, if you want to expand the dataset efficiently or optimize the uploading process in case you had to restart it midway. Read more on Conditional Upserts/Updates here.

Note: ONLY_NEW=1 only speeds up the Qdrant upsert step. Embeddings are still computed for every paper, so it does not reduce inference costs.

BM25 Sparse Vectors Inference Qdrant supports BM25-based sparse vector inference on the server side.

BM25 usage in Qdrant requires an avg_len parameter, the average document length used in the formula. We estimate this by sampling the first 300 abstracts (see _estimate_avg_abstract_len() in search_engine.py). You can adjust ESTIMATE_BM25_AVG_LEN_ON_X_DOCS if needed.

Result

Open your Qdrant Cluster Dashboard (Cluster UI). You should see your collection populated with all papers. Read more about Qdrant WebUI.

Part 2: Context Engineering

Query the collection using a context engineering pipeline that routes natural language questions to the right Qdrant tool via an LLM agent.

How the pipeline works

Two phases (see context.py):

Tool routing — the LLM reads the question and calls one of two Qdrant tools:
- retrieve_papers_based_on_query — hybrid search (dense + lexical) fused by reranking.
- recommend_papers_based_on_constraints — based on Recommendation API for queries with negative constraints ("papers about X but not Y").
Summarization — the LLM summarizes retrieved papers into key findings.

Hybrid search with multistage retrieval

retrieve_papers_based_on_query (see search_engine_query.py) runs a single Qdrant query that combines hybrid retrieval with multistage reranking via prefetch:

Prefetch (hybrid) — two parallel candidate retrievals:
- Dense semantic search on the 1024-dim vector (with quantization oversampling + rescore to keep precision high. Read more on searching with quantization)
- BM25 keyword search on the Lexical sparse vector
Rerank (multistage) — fused candidates are rescored using the full 1536-dim Reranker vector.

Recommendation with constraints

recommend_papers_based_on_constraints (see search_engine_query.py) uses Qdrant's Recommendation API with AVERAGE_VECTOR strategy. Qdrant finds points close to the average of positive vectors and far from the average of negative vectors.

Prompt and tool engineering

Tool definitions (tools.py) — each definition tells the LLM what the tool does, when to use it and how to construct arguments.

See best practices: Anthropic "Best practices for tool definitions" | OpenAI function calling tools

Tool calling (context.py) with key parameters of the OpenAI Responses API call:

instructions, system prompt with routing rules and few-shot examples. Separated from input (the user's question) so the LLM has a clear boundary between instructions and user input.
tool_choice="required", the LLM must call a tool.
parallel_tool_calls=False, not more than one tool call per response.

Experiment: You can change OPENAI_MODEL in .env to try different models (e.g., gpt-4o) and see how tool routing accuracy and summary quality change.

Routing prompt with few-shot examples (prompts.py): three examples covering the key scenarios:

Coherent query, no negative constraints → retrieval (shows filler removal, term preservation)
One positive topic, two independent negative constraints → recommendation
Two separable positive constraints, one negative one → recommendation

Each example includes a brief explanation so the LLM learns the reasoning, not just the pattern.

See best practices: Anthropic "Prompting best practices" | OpenAI "Prompt engineering"

Run a query

make context-engineering-qdrant QUESTION="your question here"

Control how many papers Qdrant returns (default: 5):

make context-engineering-qdrant QUESTION="your question here" LIMIT=10

Example queries to try

Watch the trace output to see which tool gets called and how arguments are constructed.

These should trigger retrieve_papers_based_on_query:

make context-engineering-qdrant QUESTION="What are the mechanisms of CRISPR-Cas9 delivery using lipid nanoparticles for in vivo genome editing?"

# Smaller limit — see how the summary changes
make context-engineering-qdrant QUESTION="What is the role of autophagy in neurodegenerative diseases?" LIMIT=3

These should trigger recommend_papers_based_on_constraints:

make context-engineering-qdrant QUESTION="Find papers on photodynamic therapy for cancer treatment, but not studies focused on skin cancer."

# One positive, two independent negative constraints (should split into separate negative examples)
make context-engineering-qdrant QUESTION="Papers about TP53 mutations in cancer prognosis, but not animal studies and nothing like generic overviews of tumor suppressors."

# Two separable positives + exclusion (positives should split too)
make context-engineering-qdrant QUESTION="Research on BRCA1 DNA repair mechanisms and PD-L1 immunotherapy response biomarkers, excluding pediatric populations."

What's next

Things to experiment with:

Toggle CLOUD_INFERENCE in search_engine.py to compare server-side vs. client-side embedding.
Try different batch_size and parallel values in search_engine.py → upsert_points() to optimize ingestion speed.

and

Change embedding dimensions (OPENAI_RETRIEVER_EMBEDDING_DIMENSION, OPENAI_RERANKER_EMBEDDING_DIMENSION in search_engine.py) and observe how retrieval quality and resource usage change.
Tune quantization search parameters (oversampling, rescore) in search_engine_query.py → retrieve_papers_based_on_query().
Change the limit on individual prefetch stages in search_engine_query.py to control how many candidates each retrieval method (dense vs. BM25) contributes before fusion and reranking.
Adjust ESTIMATE_BM25_AVG_LEN_ON_X_DOCS or the default avg_len in search_engine.py to see how BM25 scoring changes with different average document length estimates.

and

Adjust LIMIT to see how the number of retrieved papers affects summarization.
Modify the prompt in prompts.py or tool descriptions in tools.py and see how the agent's behavior changes.

Check further resources for context engineering with Agents and Qdrant:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
workshop		workshop
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context Engineering with Qdrant

What you'll learn