The student mentorship process is often time-consuming for students and repetitive for university staff due to the high volume of inquiries regarding university regulations. This project introduces a chatbot that utilizes a Knowledge Graph constructed from unstructured text (specifically university PDF regulations). The system employs KG embedding models to predict missing links and infer relationships, enabling it to answer complex student queries with high accuracy.
For a deep dive into the methodology, architectural design, and evaluation metrics of this project, please refer to the following documents:
- Full Thesis PDF – A comprehensive breakdown of the research, implementation, and results.
- Presentation Slides – The final defense deck used for the graduation committee.
Experience the chatbot in action by viewing the recorded demonstration:
Click here to watch the Demo Video
(The demo showcases the end-to-end pipeline from processing a student's natural language query to the generation of a factual response based on the Knowledge Graph.)
The chatbot pipeline consists of four sequential stages:
- Pre-processing User Input: Normalizing text via spell-checking, grammar correction, and lemmatization.
- Input Comprehension: Extracting subjects and predicates using spaCy for dependency parsing and Fuzzywuzzy for entity mapping.
- Knowledge Graph Embedding Model: Utilizing trained embeddings (TransE, DistMult, or ComplEx) to predict the missing head or tail of a triplet.
- Response Generation: Converting predicted triplets back into natural language using NLTK and the Pattern library for grammatical conjugation.
The project is built using Python 3.10 and the following core libraries[cite: 1326]:
| Library | Version | Purpose |
|---|---|---|
| AmpliGraph | 2.0.0 | KG Embedding and Link Prediction |
| spaCy | 3.5.1 | NER, POS tagging, and Coreference Resolution |
| Stanford-OpenIE | 1.3.1 | Information extraction of (Subject-Relation-Object) triplets |
| Flask | 2.2.2 | Web framework for the chatbot interface |
| NLTK / Pattern | 3.6.3 / 3.6 | Natural Language Generation and text processing |
| PyPDF2 | 3.0.1 | Extracting raw text from university regulation PDFs |
├── data/
│ ├── raw_pdfs/ # University regulation documents
│ └── triplets.csv # Extracted Subject-Relation-Object data
├── src/
│ ├── preprocessing.py # Text cleaning and normalization
│ ├── kg_construction.py # Triple extraction and KG building
│ ├── embedding_model.py # Model training (TransE, ComplEx, etc.)
│ └── app.py # Flask application for user interaction
├── docs/
│ └── Thesis_Full_PDF.pdf # Full academic documentation
└── README.md
- Clone the repository:
git clone https://github.com/davidsamy1/Thesis-Chatbot.git cd Thesis-Chatbot - Install dependencies:
pip install -r requirements.txt
- Run the Application:
python app.py
The system was tested using three primary KG embedding algorithms to predict missing academic facts:
- ComplEx: Captured anti-symmetric relations and complex interactions.
- TransE: Provided efficient distance-based reasoning.
- DistMult: Used for semantic matching energy modeling. The experimental results demonstrated that the models successfully captured semantic relationships and structural properties of the university KG.
If you use this work in your research, please cite it as follows:
@bachelorthesis{Samy2023,
author = {David Samy},
title = {Building a Chatbot for Students Mentorship based on Extracted Knowledge Graphs},
school = {German University in Cairo (GUC)},
faculty = {Media Engineering and Technology},
year = {2023},
month = {June}
}