A high-performance search engine for Medium articles built on Google's foundational search architecture research. Features AI-powered summaries, real-time indexing, and lightning-fast BM25 search scoring across 190k+ articles.
- Advanced Search: BM25 scoring algorithm with multi-field search (title, content, tags, authors)
- AI Summaries: Google Gemini-powered article summarization
- Real-time Indexing: Add new Medium articles instantly
- Smart Filtering: Sort by relevancy, date, and more
- Rich Results: Thumbnails, descriptions, author info, and tags
- Live Status Updates: Real-time article uploads
- Intelligent Caching: Query caching for improved performance
- Members Only Content: View member's only content for free on Freedium
- Search Engine: Custom implementation based on Google's foundational research
- Inverted Index: Barrel-based storage system for efficient retrieval
- BM25 Scoring: Advanced relevance ranking with configurable parameters
- Multi-threading: Parallel processing for index operations and scoring
- RESTful API: Clean endpoints for search, upload, and summarization
- Modern UI: Responsive design with smooth animations
- Real-time Updates: Live search status and progress indicators
- Interactive Features: AI summary toggle, sorting controls
- Optimized Performance: Client-side caching and efficient rendering
- Source: Kaggle Medium Articles Dataset
- Volume: 190,000+ articles
- Coverage: Diverse topics across Medium's ecosystem
- Preprocessing: Cleaned, tokenized, and indexed using NLTK
- Python 3.8+
- Node.js 16+
- Google Gemini API key
- Clone the repository:
git clone https://github.com/bonevane/spillage-search
cd spillage-search/backend-python- Install Python dependencies:
pip install -r requirements.txtNote: May need to include Sentence Transformer for semantic search
- Set up environment variables:
cp .env.example .env
# Add your Google Gemini API key to .env
GEMINI_API_KEY=your_api_key_here- Download NLTK resources:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"- Run the FastAPI server:
uvicorn backend:app --reload --host 0.0.0.0 --port 8000- Navigate to frontend directory:
cd ../frontend-next- Install dependencies:
npm install- Set up environment variables:
cp .env.example .env.local
# Configure API URL
NEXT_PUBLIC_API_URL=http://localhost:8000- Start the development server:
npm run devk = 1.5 # Term frequency saturation parameter
b = 0.75 # Field length normalization
TITLE_VAR = 12 # Title field boost
AUTHOR_VAR = 6 # Author field boost
TAG_VAR = 8 # Tag field boost- Multi-field Search: Searches across title, content, tags, and authors
- Query Preprocessing: Tokenization, lemmatization, and stop word removal
- Result Ranking: Intersection boosting and field-specific scoring
- Performance: Thread-based parallel processing
BM25 = IDF Γ (TF Γ (k + 1)) / (TF + k Γ (1 - b + b Γ (|d| / avgdl)))
With additional boosters for:
- Query term intersection
- Title matches
- Author relevance
- Tag matches
POST /search
Content-Type: application/json
{
"query": "machine learning"
}POST /upload-url
Content-Type: application/json
{
"url": "https://medium.com/article-url"
}POST /summarize
Content-Type: application/json
{
"wait_for_results": true,
"max_wait_seconds": 30,
"summary_length": "short"
}POST /summarize-article
Content-Type: application/json
{
"url": "https://medium.com/article-url",
"summary_length": "medium"
}- Context Processing: Intelligent content extraction
- Summary Generation: Configurable length summaries
- Error Handling: Graceful fallbacks
- Short: Concise overview (1-2 paragraphs)
- Medium: Detailed analysis (3-4 paragraphs)
- Long: Comprehensive summary (5+ paragraphs)
- Search Speed: Sub-second response times
- Concurrent Users: Multi-threaded request handling
- Index Size: Optimized barrel-based storage
- Memory Usage: Efficient data structures and caching
All contributions are greatly appreciated!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Project contributors Ahmad Shahmeer & Sikander Hayat Khan for their help in developing v1 of the search engine
- The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Kaggle Medium Articles Dataset
- FastAPI and Next.js communities
- Google Gemini AI for summarization
Built with β€οΈ for the Medium community