Skip to content

lucent1/ingest-light

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ingest Light

A lightweight, scalable data ingestion service built in Go with support for multiple data sources, privacy protection, and tiered processing configurations.

πŸš€ Features

  • Multi-Source Ingestion: Support for CSV, JSON, and PostgreSQL data sources
  • Privacy Protection: Built-in data sanitization and encryption
  • Tiered Processing: Configurable performance tiers (small, medium, large)
  • Dual Storage: PostgreSQL for metadata and ClickHouse for analytics
  • RESTful API: Simple HTTP endpoints for data ingestion
  • Docker Support: Easy deployment with Docker Compose
  • Graceful Shutdown: Proper signal handling and cleanup
  • Auto-scaling Workers: Configurable worker pools for optimal performance

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Sources  │───▢│  Privacy Guard  │───▢│   Transform     β”‚
β”‚  (CSV/JSON/DB)  β”‚    β”‚  (Sanitization) β”‚    β”‚   (Processing)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PostgreSQL    │◀───│   Storage       │◀───│   Ingestor      β”‚
β”‚   (Metadata)    β”‚    β”‚   Layer         β”‚    β”‚   (Workers)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             |
β”‚   ClickHouse    │◀───│   Storage       β”‚         ◀─── 
β”‚   (Analytics)   β”‚    β”‚   Layer         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

  • Go 1.23.4 or later
  • Docker and Docker Compose
  • PostgreSQL 15+
  • ClickHouse 23.3+

πŸ› οΈ Installation & Setup

Option 1: Docker Compose (Recommended)

  1. Clone the repository

    git clone https://github.com/lucent1/ingest-light.git
    cd ingest-light
  2. Start the services

    docker-compose up -d

    This will start:

    • PostgreSQL on port 5432
    • ClickHouse on ports 8123 (HTTP) and 9000 (native)
    • Ingest service on port 8080
  3. Verify the setup

    curl http://localhost:8080/status

Option 2: Local Development

  1. Install dependencies

    go mod download
  2. Set up databases

    • Start PostgreSQL and create a database named ingestion
    • Start ClickHouse
  3. Configure environment variables

    export POSTGRES_URL="postgres://username:password@localhost:5432/ingestion?sslmode=disable"
    export CLICKHOUSE_URL="localhost:9000"
    export PRIVACY_ENCRYPTION_KEY="your-secure-encryption-key"
    export PRIVACY_SALT="your-secure-salt"
  4. Run the service

    go run cmd/server/main.go

βš™οΈ Configuration

The service uses a YAML configuration file (config.yaml) to define processing tiers:

tiers:
  small:
    workers: 4
    buffer_size: 100
    batch_size: 50
    timeout_ms: 500
    retention_days: 1

  medium:
    workers: 8
    buffer_size: 500
    batch_size: 100
    timeout_ms: 300
    retention_days: 7

  large:
    workers: 16
    buffer_size: 1000
    batch_size: 200
    timeout_ms: 200
    retention_days: 14

Environment Variables

Variable Description Default
CONFIG_PATH Path to configuration file config.yaml
POSTGRES_URL PostgreSQL connection string -
CLICKHOUSE_URL ClickHouse connection string -
PRIVACY_ENCRYPTION_KEY Encryption key for privacy -
PRIVACY_SALT Salt for privacy functions -

πŸ“‘ API Endpoints

POST /ingest

Ingest data into the system.

Request Body:

{
  "source": "csv_file",
  "tier": "medium",
  "payload": [
    {
      "id": 1,
      "name": "John Doe",
      "email": "john@example.com"
    }
  ]
}

Response:

{
  "status": "processed",
  "processed_at": "2024-01-01T00:00:00Z",
  "tier": "medium",
  "records": 1
}

GET /status

Check service health status.

Response:

{
  "status": "ok",
  "time": "2024-01-01T00:00:00Z"
}

GET /tiers

Get current tier configuration.

Response:

{
  "workers": 8,
  "buffer_size": 500,
  "batch_size": 100,
  "timeout_ms": 300,
  "retention_days": 7
}

πŸ”’ Privacy & Security

The service includes a privacy guard that:

  • Sanitizes sensitive data before processing
  • Encrypts personal information
  • Provides configurable data retention policies
  • Supports GDPR compliance requirements

Important: Change the default encryption key and salt in production!

πŸ§ͺ Testing

Run the test suite:

go test ./...

Run specific test files:

go test ./internal/privacy
go test ./internal/transform

πŸ“Š Monitoring

The service provides built-in monitoring:

  • Health check endpoint (/status)
  • Tier configuration endpoint (/tiers)
  • Structured logging for debugging
  • Graceful shutdown handling

πŸš€ Performance Tuning

Tier Selection

Choose the appropriate tier based on your workload:

  • Small: Low-volume, real-time processing
  • Medium: Balanced performance for most use cases
  • Large: High-volume, batch processing

Database Optimization

  • PostgreSQL: Configure connection pooling and indexing
  • ClickHouse: Optimize for analytical queries and compression

πŸ”§ Development

Project Structure

ingest/
β”œβ”€β”€ cmd/server/          # Main application entry point
β”œβ”€β”€ internal/            # Internal packages
β”‚   β”œβ”€β”€ adapter/         # Data source adapters
β”‚   β”œβ”€β”€ api/            # HTTP handlers and types
β”‚   β”œβ”€β”€ cleaner/        # Data retention and cleanup
β”‚   β”œβ”€β”€ config/         # Configuration management
β”‚   β”œβ”€β”€ db/             # Database connections, schemas, and storage
β”‚   β”œβ”€β”€ ingest/         # Core ingestion logic
β”‚   β”œβ”€β”€ privacy/        # Privacy and security
β”‚   └── transform/      # Data transformation
β”œβ”€β”€ pkg/                # Public packages
β”œβ”€β”€ examples/           # Usage examples
└── test/              # Test files

Adding New Data Sources

  1. Implement the SourceAdapter interface in internal/adapter/
  2. Add configuration for the new source
  3. Update the main application to use the new adapter

Adding New Storage Backends

  1. Implement the storage interface in internal/db/
  2. Add connection logic in internal/db/
  3. Update the storage initialization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages