VECsearch (CS2010 Data Mining CIA-2 Project for text processing)

Lightweight vector database implemented from scratch with NumPy and GPU parallelism, featuring custom tokenization, lemmatization, and stop-word removal—extended with RAG for context-aware document QA and an interactive CLI using prompt-toolkit.

Features

Lightweight vector database with sentence- and paragraph-level indexing
Custom tokenizer, lemmatizer, and stopword remover
GPU parallelism using CuPy for performance (optional)
RAG-based question answering using Google Gemini
Interactive terminal interface with arrow-key navigation

Setup Instructions

Step 1: Clone the Repository

Clone the repository to your local machine:

git clone https://github.com/Sathya4683/VECsearch.git
cd VECsearch

Step 2: Install Dependencies

pip install -r requirements.txt

This would install torch, langchain, langchain-community, langchain_google_genai, Numpy, Pandas, prompt_toolkit

Step 3: Set Up Google Gemini API Key

Obtain your Google Gemini API key from Google Gemini.
Paste your API key in .env file (refer demo.env)

Step 4: Run the Application

To run the application, use the following command:

python main.py

Step 5: Using the Application

Enter a Query
- input the source text simple (type in "assets/sample/sample.txt" to use the sample text document provided)
- input the source stopwords dataset (type in "assets/sample/stopwords.txt" to use the sample stopwords document provided)
- input your search query (e.g., "Plants making food?").
- also input the value of "k" (top K results) when prompted. Can type in "all" for all matches in descending order of the similarity score.
Choose a Retrieval Mode
- Select whether to view sentence-wise, paragraph-wise, or RAG-based results when prompted.
View Top Matches
- The application will display top matching sentences or paragraphs based on semantic similarity to your query.
Navigate the Results
- Use arrow keys in the interactive CLI to highlight and scroll through results for better readability. (refer CONTROLS.md)
Generate a RAG Response
- If you choose the RAG option, the app will combine the top retrieved sentences and paragraphs and send them, along with your question, to the Google Gemini API for a concise and context-aware answer.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

Done by Sathya and Pranav

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
preprocessed_output		preprocessed_output
scripts		scripts
.gitignore		.gitignore
CONTROLS.md		CONTROLS.md
LICENSE		LICENSE
README.md		README.md
demo.env		demo.env
main.py		main.py
rag.py		rag.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VECsearch (CS2010 Data Mining CIA-2 Project for text processing)

Features

Setup Instructions

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Set Up Google Gemini API Key

Step 4: Run the Application

Step 5: Using the Application

License

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VECsearch (CS2010 Data Mining CIA-2 Project for text processing)

Features

Setup Instructions

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Set Up Google Gemini API Key

Step 4: Run the Application

Step 5: Using the Application

License

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages