A Streamlit-based chatbot that processes text files, clusters sentences by topic using K-means, and answers queries with relevant sentences using the all-MiniLM-L6-v2 sentence transformer model. Built in Google Colab with ngrok for public access, this project showcases NLP, topic clustering, and web app development.
- File Upload: Upload
.txtfiles (100–1000 words) with single or multiple topics. - Query Processing: Enter queries (e.g., "How is AI used in healthcare?") with abbreviation support (e.g., "AI" → "artificial intelligence").
- Topic Clustering: Groups sentences into topics (2–5 clusters) using K-means on sentence embeddings.
- Top-K Results: Returns 1–5 most relevant sentences with similarity scores and explanations.
- Robustness: Handles edge cases (empty files, vague queries) with clear error messages.
- Web Interface: Streamlit UI with file uploader, query input, and sliders for customization.
- Python: Core programming language.
- Streamlit: Web app framework for the UI.
- sentence-transformers:
all-MiniLM-L6-v2for sentence embeddings. - scikit-learn: K-means clustering for topic grouping.
- NLTK: Text tokenization and query preprocessing (lemmatization).
- PyTorch: Backend for model computations.
- ngrok: Public URL tunneling for Colab.
- Google Colab: Development and testing environment.
- Open the Notebook:
- Download
document_query_chatbot.ipynbfrom this repository. - Upload to Open in Colab (set to "Anyone with the link can view").
- Use Python 3 runtime (GPU optional for faster processing).
- Download
- Prepare Text Files:
- Place
sample.txt,sample2.txt,mixed_sample.txt, andcomplex_sample.txtin/content/drive/MyDrive/. - Alternatively, upload files via Colab’s sidebar (
folder icon > Upload).
- Place
- Run Cells Sequentially:
- Cell 1: Installs libraries and ngrok.
- Cell 2: Mounts Google Drive and downloads NLTK resources.
- Cell 3: Configures ngrok with an authtoken.
- Cell 4: Saves the Streamlit app (
app.py). - Cell 5: Runs the app and provides a public ngrok URL.
- Cell 6: Fallback for testing without Streamlit/ngrok.
- Access the App:
- Click the ngrok URL from Cell 5 (e.g.,
https://abc123.ngrok.io). - Upload a text file, enter a query, set clusters/top-k, and view results.
- Click the ngrok URL from Cell 5 (e.g.,
- Upload a File:
- Use
complex_sample.txt(~800 words, covers AI, renewable energy, quantum computing, biotech).
- Use
- Enter a Query:
- Examples: "How is AI used in healthcare?", "What’s RE’s role in climate change?", "How does quantum computing help AI?".
- Set Parameters:
- Number of clusters (2–5): Try 4 for multi-topic files.
- Number of sentences (1–5): Try 3 for detailed results.
- View Results:
- Top Matching Sentences: Relevant sentences with topics, scores, and explanations.
- Expander: All sentences with their clusters and scores.
- Test Edge Cases:
- Empty query: Should show "Error: Please enter a query!".
- Single-sentence file: Should show "Error: File must contain at least two sentences.".
sample.txt: AI-focused (~100 words).sample2.txt: Renewable energy (~100 words).mixed_sample.txt: AI and renewable energy (~200 words).complex_sample.txt: AI, renewable energy, quantum computing, biotechnology (~800 words).
- Colab Demo: Run
document_query_chatbot.ipynbin Colab to test with ngrok. - Live Demo (Planned): Deployed on Streamlit Community Cloud (link TBD).
- Clone the repository:
git clone https://github.com/your-username/document-query-chatbot.git cd document-query-chatbot