This document outlines the production-grade architecture for AutoDataLab, a system for automated data cleaning, EDA, and ML model training.
- Technology: React + TypeScript (Vite)
- Hosting: Vercel / Netlify / Cloud provider static hosting
- Communication: HTTPS REST API (authenticated with JWT)
- Technology: FastAPI (Python)
- Authentication: Auth0 or AWS Cognito
- Database: PostgreSQL (AWS RDS / DigitalOcean / Render)
- Job Queue: Celery + Redis
- Storage: S3 for artifacts (MinIO for local dev)
- ML Libraries: pandas, scikit-learn, ydata-profiling, shap, joblib
- Report Generation: Jinja2 templates, optional WeasyPrint for PDF
- Technology: Celery workers in separate containers
- Tasks: Data cleaning, EDA, model training, report generation
- Local Dev: Docker + Docker Compose
- CI/CD: GitHub Actions
- Production Deploy: AWS ECS (Fargate), Render, or Google Cloud Run
- Monitoring: Sentry + Prometheus + Grafana
- Secrets: AWS Secrets Manager / HashiCorp Vault
# From project root
docker-compose build
docker-compose upThis starts:
- PostgreSQL (port 5432)
- Redis (port 6379)
- MinIO (ports 9000, 9001)
- FastAPI Backend (port 8000)
- Celery Worker
The frontend is already configured to connect to the backend at http://localhost:8000.
To use MSW (Mock Service Worker) for development without backend:
- MSW is already set up in the frontend
- To switch between real backend and MSW, update
VITE_API_URLinproject-frontend/.env
- Open frontend: http://localhost:5174
- Open backend docs: http://localhost:8000/docs
- Upload a CSV file through the frontend
- Watch the worker logs:
docker-compose logs -f worker - Check job status and view results
-
AWS Account Setup
- Create S3 bucket for artifacts
- Set up RDS PostgreSQL instance
- Launch ElastiCache Redis cluster
- Create IAM roles with appropriate permissions
-
Container Registry
- Set up ECR (Elastic Container Registry)
- Configure GitHub Actions to build and push images
-
Secrets Management
- Store credentials in AWS Secrets Manager
- Configure environment variables
-
ECS/Fargate Setup
- Create ECS cluster
- Define task definitions for backend and worker
- Configure load balancer
- Set up auto-scaling
-
Database Migration
- Set up Alembic migrations
- Run initial migration on RDS
-
Auth0 Integration
- Create Auth0 account and application
- Configure social login providers
- Implement JWT validation in FastAPI
- Add protected routes
-
Frontend Auth
- Install Auth0 React SDK
- Add login/logout flows
- Attach JWT tokens to API requests
-
Monitoring
- Set up Sentry for error tracking
- Configure CloudWatch logs
- Add health check endpoints
-
CI/CD Pipeline
- GitHub Actions for automated testing
- Automated deployment to staging
- Manual approval for production
-
LLM Backend Service
- Create microservice for LLM calls
- Implement rate limiting
- Add cost tracking per user
-
Frontend Integration
- Add "Ask Assistant" button
- Implement streaming responses
- Display LLM insights
- Never send entire dataset to LLM without consent
- Use aggregations and masked samples
- Store artifacts with short-lived presigned URLs
- HTTPS everywhere in production
- Rate limiting on upload endpoint
- File size limits and type validation
- JWT token validation
- CORS configuration
- S3 bucket policies (private by default)
- Presigned URLs with 10-minute expiration
- Encryption at rest and in transit
- RDS PostgreSQL (db.t3.micro): ~$15
- ElastiCache Redis (cache.t3.micro): ~$15
- S3 Storage: ~$1-5 (depending on usage)
- ECS Fargate: ~$30-50 (2 tasks: backend + worker)
- Data Transfer: ~$5-10
Total Estimated: ~$65-95/month for small-scale production
- Use S3 lifecycle policies to archive old artifacts
- Implement request caching with Redis
- Use spot instances for worker tasks
- Set up CloudWatch alarms for cost monitoring
- Add more ECS tasks for backend (behind ALB)
- Add more Celery workers for heavy processing
- Use SQS instead of Redis for very large job queues
- Add Redis caching for frequently accessed data
- Use CDN (CloudFront) for static assets
- Implement database connection pooling
- Add database read replicas for read-heavy workloads
- ✅ Set up local development environment
- ⏳ Test upload and job processing locally
- ⏳ Create AWS account and provision resources
- ⏳ Implement authentication (Auth0)
- ⏳ Set up CI/CD pipeline
- ⏳ Deploy to staging
- ⏳ Load testing and optimization
- ⏳ Production deployment
- FastAPI Documentation: https://fastapi.tiangolo.com
- Celery Documentation: https://docs.celeryproject.org
- Auth0 Documentation: https://auth0.com/docs
- AWS Documentation: https://docs.aws.amazon.com