Skip to content

Latest commit

 

History

History
212 lines (151 loc) · 5.57 KB

File metadata and controls

212 lines (151 loc) · 5.57 KB

AutoDataLab - Production Architecture Guide

Overview

This document outlines the production-grade architecture for AutoDataLab, a system for automated data cleaning, EDA, and ML model training.

Architecture Components

Frontend

  • Technology: React + TypeScript (Vite)
  • Hosting: Vercel / Netlify / Cloud provider static hosting
  • Communication: HTTPS REST API (authenticated with JWT)

Backend

  • Technology: FastAPI (Python)
  • Authentication: Auth0 or AWS Cognito
  • Database: PostgreSQL (AWS RDS / DigitalOcean / Render)
  • Job Queue: Celery + Redis
  • Storage: S3 for artifacts (MinIO for local dev)
  • ML Libraries: pandas, scikit-learn, ydata-profiling, shap, joblib
  • Report Generation: Jinja2 templates, optional WeasyPrint for PDF

Worker

  • Technology: Celery workers in separate containers
  • Tasks: Data cleaning, EDA, model training, report generation

Infrastructure

  • Local Dev: Docker + Docker Compose
  • CI/CD: GitHub Actions
  • Production Deploy: AWS ECS (Fargate), Render, or Google Cloud Run
  • Monitoring: Sentry + Prometheus + Grafana
  • Secrets: AWS Secrets Manager / HashiCorp Vault

Local Development Setup

Step 1: Start Docker Services

# From project root
docker-compose build
docker-compose up

This starts:

  • PostgreSQL (port 5432)
  • Redis (port 6379)
  • MinIO (ports 9000, 9001)
  • FastAPI Backend (port 8000)
  • Celery Worker

Step 2: Configure Frontend

The frontend is already configured to connect to the backend at http://localhost:8000.

To use MSW (Mock Service Worker) for development without backend:

  • MSW is already set up in the frontend
  • To switch between real backend and MSW, update VITE_API_URL in project-frontend/.env

Step 3: Test the System

  1. Open frontend: http://localhost:5174
  2. Open backend docs: http://localhost:8000/docs
  3. Upload a CSV file through the frontend
  4. Watch the worker logs: docker-compose logs -f worker
  5. Check job status and view results

Production Deployment Roadmap

Week 1-2: Infrastructure Setup

  1. AWS Account Setup

    • Create S3 bucket for artifacts
    • Set up RDS PostgreSQL instance
    • Launch ElastiCache Redis cluster
    • Create IAM roles with appropriate permissions
  2. Container Registry

    • Set up ECR (Elastic Container Registry)
    • Configure GitHub Actions to build and push images
  3. Secrets Management

    • Store credentials in AWS Secrets Manager
    • Configure environment variables

Week 3: Backend Deployment

  1. ECS/Fargate Setup

    • Create ECS cluster
    • Define task definitions for backend and worker
    • Configure load balancer
    • Set up auto-scaling
  2. Database Migration

    • Set up Alembic migrations
    • Run initial migration on RDS

Week 4: Authentication

  1. Auth0 Integration

    • Create Auth0 account and application
    • Configure social login providers
    • Implement JWT validation in FastAPI
    • Add protected routes
  2. Frontend Auth

    • Install Auth0 React SDK
    • Add login/logout flows
    • Attach JWT tokens to API requests

Week 5: Monitoring & CI/CD

  1. Monitoring

    • Set up Sentry for error tracking
    • Configure CloudWatch logs
    • Add health check endpoints
  2. CI/CD Pipeline

    • GitHub Actions for automated testing
    • Automated deployment to staging
    • Manual approval for production

Week 6: LLM Integration

  1. LLM Backend Service

    • Create microservice for LLM calls
    • Implement rate limiting
    • Add cost tracking per user
  2. Frontend Integration

    • Add "Ask Assistant" button
    • Implement streaming responses
    • Display LLM insights

Security Considerations

Data Privacy

  • Never send entire dataset to LLM without consent
  • Use aggregations and masked samples
  • Store artifacts with short-lived presigned URLs

API Security

  • HTTPS everywhere in production
  • Rate limiting on upload endpoint
  • File size limits and type validation
  • JWT token validation
  • CORS configuration

Storage Security

  • S3 bucket policies (private by default)
  • Presigned URLs with 10-minute expiration
  • Encryption at rest and in transit

Cost Optimization

AWS Services (Estimated Monthly)

  • RDS PostgreSQL (db.t3.micro): ~$15
  • ElastiCache Redis (cache.t3.micro): ~$15
  • S3 Storage: ~$1-5 (depending on usage)
  • ECS Fargate: ~$30-50 (2 tasks: backend + worker)
  • Data Transfer: ~$5-10

Total Estimated: ~$65-95/month for small-scale production

Optimization Tips

  • Use S3 lifecycle policies to archive old artifacts
  • Implement request caching with Redis
  • Use spot instances for worker tasks
  • Set up CloudWatch alarms for cost monitoring

Scaling Considerations

Horizontal Scaling

  • Add more ECS tasks for backend (behind ALB)
  • Add more Celery workers for heavy processing
  • Use SQS instead of Redis for very large job queues

Performance Optimization

  • Add Redis caching for frequently accessed data
  • Use CDN (CloudFront) for static assets
  • Implement database connection pooling
  • Add database read replicas for read-heavy workloads

Next Steps

  1. ✅ Set up local development environment
  2. ⏳ Test upload and job processing locally
  3. ⏳ Create AWS account and provision resources
  4. ⏳ Implement authentication (Auth0)
  5. ⏳ Set up CI/CD pipeline
  6. ⏳ Deploy to staging
  7. ⏳ Load testing and optimization
  8. ⏳ Production deployment

Support & Resources