A RAG (Retrieval-Augmented Generation) system for DataCommit podcast episodes. Downloads audio from YouTube, transcribes with Whisper, and enables Q&A using Haystack, ChromaDB, and Gemini.
DataCommit is a Turkish podcast series where data science experts share their career journeys, technical knowledge, and industry experiences. 🎙️ Watch all episodes on YouTube
datacommit.mp4
- Audio Download: yt-dlp
- Speech-to-Text: Local Whisper-Turbo
- Audio Processing: FFmpeg, librosa, K-Means
- Text Correction: Gemini 2.5 Flash Agent
- Backend: Python, Flask
- RAG Framework: Haystack 2.22
- Vector Database: ChromaDB
- LLM: Google Gemini 3 Flash
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Frontend: HTML, CSS, JavaScript
- Python 3.10+
- Google Gemini API Key
git clone https://github.com/enesmanan/DataCommit.git
cd DataCommit
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txtCreate a .env file in the project root:
GEMINI_API_KEY=your_gemini_api_key_herepython create_database.pyThis will:
- Load all episode transcripts from
data/Final/ - Split them into chunks with metadata
- Create embeddings and store in ChromaDB
To rebuild database: delete
chroma_db/folder and run again.
python app.pyOpen your browser at: http://localhost:5000
For audio preprocessing (YouTube to transcript), see /preprocessing
DataCommit/
├── app.py # Flask web server
├── rag_pipeline.py # RAG pipeline & Gemini integration
├── create_database.py # Vector database creation
├── data/ # Episode transcripts
├── chroma_db/ # Vector database (auto-generated)
├── static/ # Frontend assets (CSS, JS, images)
├── templates/ # HTML templates
└── preprocessing/ # Audio-to-text scripts
Enes Fehmi Manan
- 🔗 LinkedIn: linkedin.com/in/enesfehmimanan
- 🐙 GitHub: github.com/enesmanan
- 📧 Email: enesmanan768@gmail.com
Made with ❤️ for the Turkish Data Science Community

