Spillage Search

A high-performance search engine for Medium articles built on Google's foundational search architecture research. Features AI-powered summaries, real-time indexing, and lightning-fast BM25 search scoring across 190k+ articles.

🚀 Features

Advanced Search: BM25 scoring algorithm with multi-field search (title, content, tags, authors)
AI Summaries: Google Gemini-powered article summarization
Real-time Indexing: Add new Medium articles instantly
Smart Filtering: Sort by relevancy, date, and more
Rich Results: Thumbnails, descriptions, author info, and tags
Live Status Updates: Real-time article uploads
Intelligent Caching: Query caching for improved performance
Members Only Content: View member's only content for free on Freedium

🏗️ Architecture

Backend (FastAPI)

Search Engine: Custom implementation based on Google's foundational research
Inverted Index: Barrel-based storage system for efficient retrieval
BM25 Scoring: Advanced relevance ranking with configurable parameters
Multi-threading: Parallel processing for index operations and scoring
RESTful API: Clean endpoints for search, upload, and summarization

Frontend (Next.js)

Modern UI: Responsive design with smooth animations
Real-time Updates: Live search status and progress indicators
Interactive Features: AI summary toggle, sorting controls
Optimized Performance: Client-side caching and efficient rendering

📊 Dataset

Source: Kaggle Medium Articles Dataset
Volume: 190,000+ articles
Coverage: Diverse topics across Medium's ecosystem
Preprocessing: Cleaned, tokenized, and indexed using NLTK

🛠️ Installation

Prerequisites

Python 3.8+
Node.js 16+
Google Gemini API key

Backend Setup

Clone the repository:

git clone https://github.com/bonevane/spillage-search
cd spillage-search/backend-python

Install Python dependencies:

pip install -r requirements.txt

Note: May need to include Sentence Transformer for semantic search

Set up environment variables:

cp .env.example .env
# Add your Google Gemini API key to .env
GEMINI_API_KEY=your_api_key_here

Download NLTK resources:

python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"

Run the FastAPI server:

uvicorn backend:app --reload --host 0.0.0.0 --port 8000

Frontend Setup

Navigate to frontend directory:

cd ../frontend-next

Install dependencies:

npm install

Set up environment variables:

cp .env.example .env.local
# Configure API URL
NEXT_PUBLIC_API_URL=http://localhost:8000

Start the development server:

npm run dev

🔧 Configuration

BM25 Parameters

k = 1.5          # Term frequency saturation parameter
b = 0.75         # Field length normalization
TITLE_VAR = 12   # Title field boost
AUTHOR_VAR = 6   # Author field boost
TAG_VAR = 8      # Tag field boost

Search Features

Multi-field Search: Searches across title, content, tags, and authors
Query Preprocessing: Tokenization, lemmatization, and stop word removal
Result Ranking: Intersection boosting and field-specific scoring
Performance: Thread-based parallel processing

Scoring Formula

BM25 = IDF × (TF × (k + 1)) / (TF + k × (1 - b + b × (|d| / avgdl)))

With additional boosters for:

Query term intersection
Title matches
Author relevance
Tag matches

📚 API Endpoints

Search

POST /search
Content-Type: application/json

{
  "query": "machine learning"
}

Add Article

POST /upload-url
Content-Type: application/json

{
  "url": "https://medium.com/article-url"
}

Generate Summary

POST /summarize
Content-Type: application/json

{
  "wait_for_results": true,
  "max_wait_seconds": 30,
  "summary_length": "short"
}

Summarize Specific Article

POST /summarize-article
Content-Type: application/json

{
  "url": "https://medium.com/article-url",
  "summary_length": "medium"
}

🤖 AI Integration

Gemini RAG Module

Context Processing: Intelligent content extraction
Summary Generation: Configurable length summaries
Error Handling: Graceful fallbacks

Summary Types

Short: Concise overview (1-2 paragraphs)
Medium: Detailed analysis (3-4 paragraphs)
Long: Comprehensive summary (5+ paragraphs)

📈 Performance

Search Speed: Sub-second response times
Concurrent Users: Multi-threaded request handling
Index Size: Optimized barrel-based storage
Memory Usage: Efficient data structures and caching

🤝 Contributing

All contributions are greatly appreciated!

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Project contributors Ahmad Shahmeer & Sikander Hayat Khan for their help in developing v1 of the search engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Kaggle Medium Articles Dataset
FastAPI and Next.js communities
Google Gemini AI for summarization

Built with ❤️ for the Medium community

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
backend-python		backend-python
frontend-next		frontend-next
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spillage Search

🚀 Features

🏗️ Architecture

Backend (FastAPI)

Frontend (Next.js)

📊 Dataset

🛠️ Installation

Prerequisites

Backend Setup

Frontend Setup

🔧 Configuration

BM25 Parameters

Search Features

Scoring Formula

📚 API Endpoints

Search

Add Article

Generate Summary

Summarize Specific Article

🤖 AI Integration

Gemini RAG Module

Summary Types

📈 Performance

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spillage Search

🚀 Features

🏗️ Architecture

Backend (FastAPI)

Frontend (Next.js)

📊 Dataset

🛠️ Installation

Prerequisites

Backend Setup

Frontend Setup

🔧 Configuration

BM25 Parameters

Search Features

Scoring Formula

📚 API Endpoints

Search

Add Article

Generate Summary

Summarize Specific Article

🤖 AI Integration

Gemini RAG Module

Summary Types

📈 Performance

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages