DataCommit RAG Chatbot

A RAG (Retrieval-Augmented Generation) system for DataCommit podcast episodes. Downloads audio from YouTube, transcribes with Whisper, and enables Q&A using Haystack, ChromaDB, and Gemini.

DataCommit is a Turkish podcast series where data science experts share their career journeys, technical knowledge, and industry experiences. 🎙️ Watch all episodes on YouTube

datacommit.mp4

Tech Stack

Audio to Text Pipeline

Audio Download: yt-dlp
Speech-to-Text: Local Whisper-Turbo
Audio Processing: FFmpeg, librosa, K-Means
Text Correction: Gemini 2.5 Flash Agent

RAG Pipeline

Backend: Python, Flask
RAG Framework: Haystack 2.22
Vector Database: ChromaDB
LLM: Google Gemini 3 Flash
Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
Frontend: HTML, CSS, JavaScript

Preprocessing Architecture

Setup

Prerequisites

Python 3.10+
Google Gemini API Key

1. Clone & Setup Environment

git clone https://github.com/enesmanan/DataCommit.git
cd DataCommit
python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here

3. Create Vector Database

python create_database.py

This will:

Load all episode transcripts from data/Final/
Split them into chunks with metadata
Create embeddings and store in ChromaDB

To rebuild database: delete chroma_db/ folder and run again.

4. Run the Application

python app.py

Open your browser at: http://localhost:5000

For audio preprocessing (YouTube to transcript), see /preprocessing

Project Structure

DataCommit/
├── app.py                 # Flask web server
├── rag_pipeline.py        # RAG pipeline & Gemini integration
├── create_database.py     # Vector database creation
├── data/                  # Episode transcripts
├── chroma_db/             # Vector database (auto-generated)
├── static/                # Frontend assets (CSS, JS, images)
├── templates/             # HTML templates
└── preprocessing/         # Audio-to-text scripts

📬 Contact

Enes Fehmi Manan

🔗 LinkedIn: linkedin.com/in/enesfehmimanan
🐙 GitHub: github.com/enesmanan
📧 Email: enesmanan768@gmail.com

Made with ❤️ for the Turkish Data Science Community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCommit RAG Chatbot

Tech Stack

Audio to Text Pipeline

RAG Pipeline

Preprocessing Architecture

Setup

Prerequisites

1. Clone & Setup Environment

2. Configure Environment

3. Create Vector Database

4. Run the Application

Project Structure

📬 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
chroma_db		chroma_db
data		data
preprocessing		preprocessing
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
create_database.py		create_database.py
rag_pipeline.py		rag_pipeline.py
requirements.txt		requirements.txt

License

enesmanan/DataCommit

Folders and files

Latest commit

History

Repository files navigation

DataCommit RAG Chatbot

Tech Stack

Audio to Text Pipeline

RAG Pipeline

Preprocessing Architecture

Setup

Prerequisites

1. Clone & Setup Environment

2. Configure Environment

3. Create Vector Database

4. Run the Application

Project Structure

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages