Skip to content

Build self-healing backup service for platform databases #1

@wonderwomancode

Description

@wonderwomancode

Overview

Design and implement a robust, self-healing backup service that ensures all platform databases are continuously backed up to multiple independent storage locations. Essentially we take a the several github action scripts we have in all the repos and replace them with a service that we can also use in the Cloud API later.

Current State

We have basic backup workflows in place:

  • service-secrets: Infisical API export → age encryption → Storacha + GitHub Artifacts
  • service-cloud-api: PostgreSQL backup workflow (needs sidecar implementation)
  • service-auth: SQLite backup workflow (needs sidecar implementation)

Current Limitations:

  • PostgreSQL/SQLite are internal to Akash deployments (not accessible from GitHub Actions)
  • No automatic verification that backups are valid
  • No alerting on backup failures
  • Manual restore process
  • No backup rotation/retention management

Requirements

Core Features

  • Akash Backup Sidecars: Add backup containers to Akash SDLs that run independently
  • Multi-destination storage: Storacha (IPFS+Filecoin) + GitHub Artifacts (30 days)
  • Encryption at rest: age encryption with secure key management
  • Backup verification: Automatic restore tests to verify backups are valid
  • Self-healing: Automatic retry on failure with exponential backoff
  • Alerting: Notifications on backup failures (Slack/Discord/email)

Advanced Features

  • Point-in-time recovery: WAL archiving for PostgreSQL
  • Incremental backups: Only backup changes since last backup
  • Backup catalog: Database/API to track all backups, CIDs, and restore points
  • Restore API: One-click restore from any backup point
  • Cross-region redundancy: Store backups in multiple geographic locations

Technical Design

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Akash Deployment                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐ │
│  │   App       │    │  Database   │    │ Backup Sidecar  │ │
│  │  (API)      │◄──►│ (PostgreSQL)│◄──►│ (pg_dump daily) │ │
│  └─────────────┘    └─────────────┘    └────────┬────────┘ │
└─────────────────────────────────────────────────┼──────────┘
                                                  │
                    ┌─────────────────────────────┼─────────────────────────────┐
                    │                             │                             │
                    ▼                             ▼                             ▼
              ┌──────────┐                 ┌──────────────┐              ┌─────────────┐
              │ Storacha │                 │    GitHub    │              │   Backup    │
              │  (IPFS)  │                 │  Artifacts   │              │   Catalog   │
              └──────────┘                 └──────────────┘              └─────────────┘

Backup Sidecar Container

  • Alpine Linux + pg_dump/sqlite3 + age + w3 CLI
  • Cron schedule: 3 AM UTC daily
  • Process: dump → compress → encrypt → upload to Storacha → POST metadata to catalog
  • Health endpoint: /health for monitoring
  • Metrics: backup size, duration, success/failure

GitHub Secrets Required (per service)

  • AGE_PUBLIC_KEY: Encryption key
  • W3_PRINCIPAL: Storacha authentication
  • W3_PROOF: Storacha delegation proof
  • INFISICAL_CLIENT_*: For fetching other secrets

Implementation Steps

  1. Create backup sidecar Docker image
  2. Update Akash SDLs to include sidecar
  3. Implement backup catalog service (can be simple JSON in Storacha initially)
  4. Add backup verification job
  5. Add alerting integration
  6. Document restore procedures

Labels

infrastructure, backup, priority-high

Metadata

Metadata

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions