A fully working Retrieval-Augmented Generation (RAG) pipeline that lets you ask natural-language questions and get answers grounded in content from 33 podcast episode transcripts about vector search and AI.
Your Question
│
▼
[Embed with sentence-transformers]
│
▼
[k-NN search on OpenSearch] ← 33 podcast episodes, chunked + embedded
│
▼
[Generate answer with Ollama]
│
▼
Grounded Answer + Episode Citations| Layer | Technology | Purpose |
|---|---|---|
| Vector store | OpenSearch 2.13 (k-NN plugin) | Store & search 384-dim embeddings |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
Convert text → vectors |
| Index type | HNSW with cosine similarity | Approximate nearest-neighbour search |
| LLM | Ollama (local, e.g. gemma4) |
Generate grounded answers |
| Data | 33 Vector Podcast transcripts (Whisper) | Knowledge base |
cd workshop/
docker compose up -dWait ~30 seconds, then verify: http://localhost:9200
You should see {"status": "green" ...}.
OpenSearch Dashboards (optional UI): http://localhost:5601
uv syncollama pull gemma4You can use any Ollama model. Set OLLAMA_MODEL to override the default:
export OLLAMA_MODEL=gemma4 # default
export OLLAMA_HOST=http://localhost:11434 # defaultjupyter notebook notebooks/| Step | What happens |
|---|---|
| Connect | Verify OpenSearch is running |
| Create index | k-NN index with HNSW + cosine similarity, 384-dim |
| Parse | Load 33 .md files, extract frontmatter + transcript body |
| Chunk | Split transcripts into ~400-word windows with 50-word overlap |
| Embed | Generate sentence embeddings with all-MiniLM-L6-v2 |
| Index | Bulk-ingest all chunks into OpenSearch |
| Verify | Run a test k-NN query |
| Step | What happens |
|---|---|
| Semantic search | k-NN query finds meaning, not just keywords |
| Hybrid search | BM25 + k-NN combined for better term coverage |
| RAG pipeline | Retrieve → format context → generate with Ollama |
| Inspect context | See exactly what goes into the LLM prompt |
| Explore | Ask your own questions! |
Two-stage retrieval: bi-encoder retrieves top-20 candidates, then a cross-encoder
(ms-marco-MiniLM-L-6-v2) re-scores them for much higher precision before sending to the LLM.
Side-by-side RAG comparison shows the impact on answer quality.
Scope k-NN search by episode title, date range, or any structured field. Covers OpenSearch post-filtering, pre-filtering trade-offs, and the over-fetch pattern.
Swap all-MiniLM-L6-v2 (384d) for all-mpnet-base-v2 (768d) and compare:
search quality, encoding speed, and index size side by side on the same data.
Connect to an Aiven for OpenSearch cloud cluster over TLS. Create the same index, bulk-index podcast chunks, and run RAG queries against a fully managed service.
Launch streamlit_app.py. A streaming chat interface with source citations,
sidebar controls (search mode, top-k), and persistent chat history.
cd workshop/
streamlit run streamlit_app.py # opens http://localhost:8501workshop/
├── README.md
├── docker-compose.yml # OpenSearch + Dashboards
├── requirements.txt
├── streamlit_app.py # Streamlit chat UI
├── notebooks/
│ ├── 01_index_podcasts.ipynb # Core: parse, embed, index
│ ├── 02_rag_pipeline.ipynb # Core: search + RAG
│ ├── 03_reranking.ipynb # Advanced: cross-encoder re-ranking
│ ├── 04_metadata_filtering.ipynb # Advanced: filter by episode/date
│ ├── 05_larger_embeddings.ipynb # Advanced: 768-dim embeddings
│ └── 06_aiven_opensearch.ipynb # Advanced: managed cloud cluster
└── src/
├── parser.py # Markdown → PodcastChunk objects
├── opensearch_client.py # Index management, search, Aiven client
├── rag.py # RAG pipeline using Ollama SDK
└── reranker.py # Cross-encoder re-ranking{
"settings": { "index.knn": true },
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib"
}
}
}
}
}HNSW builds a multi-layer graph where each node connects to its nearest neighbours. Search starts at the top (sparse) layer and zooms into the bottom (dense) layer, pruning irrelevant branches early, achieving sub-millisecond search over millions of vectors.
Parameters:
ef_construction: graph quality during indexing (higher = better recall, slower build)m: max connections per node (higher = better recall, more memory)ef_search: candidates explored at query time (higher = better recall, slower search)
Long podcast transcripts are split into overlapping windows:
- chunk_size = 400 words enough context for a coherent idea
- overlap = 50 words ensures ideas spanning chunk boundaries are captured
The system prompt constrains the LLM to:
- Answer only from provided context (no hallucination)
- Cite the episode title for specific claims
- Acknowledge when context is insufficient
- "What is HNSW and why did Yury Malkov invent it?"
- "How do wormhole vectors differ from standard hybrid search?"
- "What are the main challenges of running vector search in production?"
- "How does Pinecone's architecture differ from Weaviate's?"
- "What advice do guests give for evaluating embedding model quality?"
- "What is the role of sparse vectors in hybrid search?"
docker compose down # stop containers
docker compose down -v # stop + delete the index data