Skip to content

RAG chatbot for DataCommit podcast episodes with Haystack & Chroma & Gemini

License

Notifications You must be signed in to change notification settings

enesmanan/DataCommit

Repository files navigation

DataCommit RAG Chatbot

A RAG (Retrieval-Augmented Generation) system for DataCommit podcast episodes. Downloads audio from YouTube, transcribes with Whisper, and enables Q&A using Haystack, ChromaDB, and Gemini.

DataCommit is a Turkish podcast series where data science experts share their career journeys, technical knowledge, and industry experiences. 🎙️ Watch all episodes on YouTube

DataCommit Banner


datacommit.mp4

Tech Stack

Audio to Text Pipeline

  • Audio Download: yt-dlp
  • Speech-to-Text: Local Whisper-Turbo
  • Audio Processing: FFmpeg, librosa, K-Means
  • Text Correction: Gemini 2.5 Flash Agent

RAG Pipeline

  • Backend: Python, Flask
  • RAG Framework: Haystack 2.22
  • Vector Database: ChromaDB
  • LLM: Google Gemini 3 Flash
  • Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
  • Frontend: HTML, CSS, JavaScript

Preprocessing Architecture

Preprocessing architecture


Setup

Prerequisites

1. Clone & Setup Environment

git clone https://github.com/enesmanan/DataCommit.git
cd DataCommit
python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

3. Create Vector Database

python create_database.py

This will:

  • Load all episode transcripts from data/Final/
  • Split them into chunks with metadata
  • Create embeddings and store in ChromaDB

To rebuild database: delete chroma_db/ folder and run again.

4. Run the Application

python app.py

Open your browser at: http://localhost:5000

For audio preprocessing (YouTube to transcript), see /preprocessing


Project Structure

DataCommit/
├── app.py                 # Flask web server
├── rag_pipeline.py        # RAG pipeline & Gemini integration
├── create_database.py     # Vector database creation
├── data/                  # Episode transcripts
├── chroma_db/             # Vector database (auto-generated)
├── static/                # Frontend assets (CSS, JS, images)
├── templates/             # HTML templates
└── preprocessing/         # Audio-to-text scripts

📬 Contact

Enes Fehmi Manan

Made with ❤️ for the Turkish Data Science Community

About

RAG chatbot for DataCommit podcast episodes with Haystack & Chroma & Gemini

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published