Skip to content

Commit 7e3ede6

Browse files
authored
Expand README with EduViQA dataset details
Added dataset section with statistics and topics covered.
1 parent 090506d commit 7e3ede6

1 file changed

Lines changed: 33 additions & 7 deletions

File tree

README.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,6 @@ We present **Video-RAC**, an adaptive chunking methodology for lecture videos wi
4141

4242
Alongside the method, we release **EduViQA**, a slide-centric, bilingual (Persian/English) lecture dataset containing 20 videos from 5 professors across STEM and education topics. Each lecture is paired with 50 synthetic QA items and categorized by duration (40% mid-length, ~20–40 minutes) to support controlled RAG benchmarking.
4343

44-
**Key Highlights:**
45-
-**Adaptive chunking** using CLIP embeddings and SSIM for semantic segmentation
46-
- 🌍 **Bilingual dataset** (Persian & English) with 20 educational videos
47-
- 📊 **50 QA pairs per video** for comprehensive evaluation
48-
- 🎯 **RAGAS evaluation** showing +12-15% improvement over baseline methods
49-
- 🔥 **Multimodal (image+text)** retrieval achieves best performance
50-
5144
This repository is the **official implementation** of the CSICC 2025 paper by *Hemmat et al.*
5245

5346
> **Hemmat, A., Vadaei, K., Shirian, M., Heydari, M.H., Fatemi, A.**
@@ -56,6 +49,39 @@ This repository is the **official implementation** of the CSICC 2025 paper by *H
5649
5750
---
5851

52+
## 📊 Dataset
53+
54+
### EduViQA: Bilingual Educational Video QA Dataset
55+
56+
![Dataset Composition](src/assets/fig-1.png)
57+
*Dataset composition highlighting topic distribution and lecture duration proportions.*
58+
59+
### Dataset Statistics
60+
61+
| Metric | Value |
62+
|--------|-------|
63+
| **Total Videos** | 20 (10 Persian, 10 English) |
64+
| **Professors** | 5 |
65+
| **Duration Focus** | 40% mid-length (20–40 minutes) |
66+
| **QA Pairs per Video** | 50 synthetic QA pairs |
67+
| **Format** | JSON annotations |
68+
69+
### Topics Covered
70+
- Computer Architecture
71+
- Data Structures
72+
- System Dynamics and Control
73+
- Teaching Skills
74+
- Descriptive Research
75+
- Regions in Human Geography
76+
- Differentiated Instruction
77+
- Business
78+
79+
The dataset also captures slide transitions and keyframes extracted via CLIP+SSIM chunking, enabling multimodal retrieval experiments with aligned visuals and transcripts.
80+
81+
**📥 Access Dataset:** [Hugging Face - EduViQA](https://huggingface.co/datasets/UIAIC/EduViQA)
82+
83+
---
84+
5985
## 🧠 Research Background
6086

6187
This framework underpins the **EduViQA bilingual dataset**, designed for evaluating lecture-based RAG systems in both Persian and English. The dataset and code form a unified ecosystem for multimodal question generation and retrieval evaluation.

0 commit comments

Comments
 (0)