AI-Powered Semantic Search Engine for GitHub Repositories
Developer Archives is a sophisticated backend system that enables semantic search across GitHub repositories using AI embeddings and vector similarity. The system automatically discovers, processes, and indexes repositories, making them searchable through natural language queries. For the frontend code visit https://github.com/sarpbilgic/developer-archives-frontend
Developer Archives consists of three main components:
- Discoverer (Lambda): Continuously discovers new repositories from GitHub and queues them for processing
- Processor (Lambda): Processes repositories by extracting README content, generating embeddings, and storing metadata
- API (Lambda/API Gateway): Provides REST API for semantic search and repository details
GitHub API β Discoverer β SQS Queue β Processor β PostgreSQL (pgvector) β API β Frontend
β β
DynamoDB S3 (README Storage)
(State Tracking) (Embedding Generation)
- Discovery Phase: Discoverer Lambda fetches repositories from GitHub API, tracks state in DynamoDB and sends saved project ids to SQS
- Processing Phase: Processor Lambda reads from SQS, extracts README, generates embeddings using Sentence Transformers
- Storage Phase: Stores embeddings in PostgreSQL with pgvector extension, README content in S3
- Search Phase: API uses vector similarity search to find relevant repositories based on semantic meaning
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Cloud β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β Discoverer βββββββΆβ SQS β β
β β Lambda β β Queue β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β DynamoDB β β Processor β β
β β (State Store)β β Lambda β β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β PostgreSQL β β
β β (pgvector) β β
β ββββββββ¬ββββββββ β
β β β
β ββββββββββββββββ β β
β β S3 ββββββββββββββ β
β β (README) β β
β ββββββββββββββββ β
β β² β
β β β
β ββββββββ΄ββββββββ β
β β API Lambda β β
β β (API Gateway)β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Technology | Purpose | Version |
|---|---|---|
| Python | Primary language | |
| FastAPI | REST API framework | Latest |
| PostgreSQL | Primary database | |
| pgvector | Vector similarity search | Latest |
| SQLAlchemy | ORM | 2.0+ |
| Alembic | Database migrations | Latest |
| Technology | Purpose |
|---|---|
| Sentence Transformers | Generate text embeddings |
| all-mpnet-base-v2 | Embedding model (768 dimensions) |
| Service | Purpose |
|---|---|
| Lambda | Serverless compute (API, Discoverer, Processor) |
| API Gateway | HTTP API endpoint |
| SQS | Message queue for processing |
| DynamoDB | State tracking for discovery |
| S3 | README content storage |
| RDS PostgreSQL | Primary database with pgvector |
| Parameter Store | Configuration management |
| CloudWatch | Logging and monitoring |
- boto3: AWS SDK
- httpx: Async HTTP client
- beautifulsoup4: HTML parsing
- pydantic: Data validation
- mangum: ASGI adapter for Lambda
- Natural language queries (e.g., "machine learning frameworks for computer vision")
- Vector similarity search using cosine similarity
- AI-powered understanding of repository content
- Filter by programming language
- Minimum star count
- Repository topics
- Combine multiple filters
- Repository statistics (stars, forks, watchers, issues)
- Language breakdown
- Topics and tags
- Owner information
- Last update timestamps
- Serverless components (auto-scaling)
- Asynchronous processing
- Efficient vector indexing (IVFFlat)
- S3 for large content storage
- Automatic repository discovery
- README extraction and cleaning
- Embedding generation
- Incremental updates
- Error handling and retry logic
- Python 3.11+
- PostgreSQL 16+ with pgvector extension
- AWS Account (for deployment)
- GitHub Personal Access Token
- Clone the repository
git clone <repository-url>
cd developer-archives- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Set up PostgreSQL with pgvector
CREATE DATABASE developer_archives;
\c developer_archives
CREATE EXTENSION vector;- Create
.envfile
DB_HOST=localhost
DB_PORT=5432
DB_NAME=db_name
DB_USERNAME=your_username
DB_PASSWORD=your_password
GITHUB_API_TOKEN=your_github_token
AWS_REGION=your_aws_region
SQS_URL=your_sqs_url
DYNAMODB_TABLE_NAME=your_table_name
S3_README_BUCKET=your_bucket_name- Run database migrations
alembic upgrade head- Start the API server
python -m app.main
# or
uvicorn app.main:app --reloadThe API will be available at http://localhost:8000
DB_HOST: PostgreSQL hostDB_PORT: PostgreSQL port (default: 5432)DB_NAME: Database nameDB_USERNAME: Database usernameDB_PASSWORD: Database password
AWS_REGION: AWS region (e.g., eu-central-1)SQS_URL: SQS queue URL for processingDYNAMODB_TABLE_NAME: DynamoDB table for state trackingS3_README_BUCKET: S3 bucket for README storage
GITHUB_API_TOKEN: GitHub Personal Access Token for API access
Production: https://your-api-gateway-url.amazonaws.com/
Local: http://localhost:8000
Semantic search for repositories
Query Parameters:
query(required): Search querylanguage(optional): Filter by programming languagemin_stars(optional): Minimum star counttopics(optional): Filter by topicspage(optional): Page number (default: 1)page_size(optional): Results per page (default: 12)
Example:
curl "https://api-url/api/v1/search?query=cloud+native+grpc+service&language=Go&min_stars=10000&page=1"Get detailed information about a specific repository
Response:
{
"id": 1,
"github_id": 123456,
"full_name": "user/repo",
"description": "A machine learning library",
"github_url": "https://github.com/user/repo",
"stars": 5000,
"forks": 1000,
"watchers": 500,
"open_issues": 50,
"primary_language": "Python",
"topics": ["machine-learning", "deep-learning"],
"languages_breakdown": {
"Python": 150000,
"JavaScript": 50000
},
"owner_login": "user",
"owner_avatar_url": "https://avatars.githubusercontent.com/...",
"owner_url": "https://github.com/user",
"created_at_github": "2020-01-01T00:00:00Z",
"pushed_at_github": "2024-01-01T00:00:00Z",
"updated_at_github": "2024-01-01T00:00:00Z"
}Get README content for a repository
Response:
# Repository Title
Repository README content in markdown format...
-
API Lambda + API Gateway
- FastAPI application wrapped with Mangum
- HTTP API Gateway for routing
- Deployed using Docker container
-
Discoverer Lambda
- Scheduled execution (CloudWatch Events)
- Discovers new repositories
- Updates DynamoDB state
- Sends messages to SQS
-
Processor Lambda
- Triggered by SQS messages
- Processes repositories
- Generates embeddings
- Stores in PostgreSQL and S3
- Lambda Functions: 3 (API, Discoverer, Processor)
- API Gateway: HTTP API
- RDS: PostgreSQL with pgvector
- SQS: Standard queue
- DynamoDB: On-demand table
- S3: Bucket for README storage
- Parameter Store: Configuration storage
- CloudWatch: Logs and monitoring
- IAM: Roles and policies
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Sentence Transformers: For the embedding model
- pgvector: For efficient vector similarity search
- FastAPI: For the excellent API framework
- AWS: For serverless infrastructure
For questions or support, please open an issue on GitHub.
Built with using Python, FastAPI, and AWS