Skip to content

Hycky/oltp-simulator

Repository files navigation

πŸ₯ Alimentador-BD

OLTP Hospital Simulator β€” Continuous data streaming for CDC testing with Debezium or another CDC ingestion engine.

License: MIT Python 3.11+ PostgreSQL 14+


🎯 Overview

Alimentador-BD is a production-ready Python simulator that generates realistic hospital data in PostgreSQL with continuous INSERT/UPDATE operations. Perfect for testing CDC (Change Data Capture) pipelines with Debezium, validating data consistency, and developing ETL/ELT systems.

Key Features

✨ Continuous Data Streaming

  • 70% INSERT operations (new records)
  • 30% UPDATE operations (realistic modifications)
  • ~1 operation per 2 seconds (configurable)

πŸ₯ Realistic Hospital Schema

  • 7 OLTP tables (patients, doctors, appointments, exams, admissions, etc.)
  • ~13k initial seed records
  • Proper foreign keys and constraints
  • CDC-compatible triggers and indexes

🐍 Production-Ready Code

  • Type hints, docstrings, PEP 8 compliance
  • Error handling with exponential backoff
  • Batch operations with transaction support
  • Comprehensive logging with rotation

🐳 Multiple Deployment Options

  • Local development (Docker Compose)
  • AWS EC2 / RDS
  • Kubernetes
  • Standalone Python

πŸ“š Comprehensive Documentation

  • Quick start (5 minutes)
  • Complete user guide (Portuguese)
  • Technical architecture
  • Production deployment guide
  • Developer contribution guide

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • PostgreSQL 14+
  • Docker & Docker Compose (optional)

1. Setup (5 minutes)

# Clone repository
git clone https://github.com:Hycky/oltp-simulator.git
cd alimentador-bd

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure (copy example and edit credentials)
cp config/.env.example config/.env
# Edit config/.env with your PostgreSQL connection

2. Initialize Database

make init    # Create schema and indexes
make seed    # Populate ~13k initial records

3. Start Streaming

make stream  # Continuous INSERT/UPDATE operations

4. Monitor

# In another terminal
make counts  # Show record counts
tail -f logs/app.log  # View live logs

With Docker Compose

# Start PostgreSQL + PgAdmin
docker-compose up -d postgres

# From host machine, initialize
make init
make seed

# Stream
make stream

πŸ“‹ Available Commands

Command Description
make install Create venv and install dependencies
make init Create schema, indexes, lookup data
make seed Populate ~13k initial records
make stream Start continuous streaming
make reset Drop + recreate + seed all
make counts Display table record counts
make fmt Format code with Black
make lint Check code with Ruff
make clean Remove cache and temp files

πŸ›οΈ Database Schema

7 OLTP Tables

pacientes (2,000)
β”œβ”€β”€ id, nome, nascimento, cpf, telefone, endereco
β”œβ”€β”€ created_at, updated_at (automatic)
└── PRIMARY KEY, UNIQUE(cpf), INDEX(cpf)

medicos (200)
β”œβ”€β”€ id, nome, crm, especialidade, telefone
└── PRIMARY KEY, UNIQUE(crm), INDEX(crm)

convenios (12)
β”œβ”€β”€ id, nome, cnpj, tipo, cobertura
└── PRIMARY KEY, UNIQUE(cnpj)

pacientes_convenios (2,500+)
β”œβ”€β”€ id, paciente_id β†’ pacientes
β”œβ”€β”€ convenio_id β†’ convenios
└── UNIQUE(paciente_id, convenio_id)

consultas (4,000+)
β”œβ”€β”€ id, paciente_id β†’ pacientes
β”œβ”€β”€ medico_id β†’ medicos
β”œβ”€β”€ data, motivo, status (agendada|realizada|cancelada|faltou)
└── INDEX(paciente_id, medico_id, data)

exames (3,500+)
β”œβ”€β”€ id, paciente_id β†’ pacientes
β”œβ”€β”€ tipo_exame, data, resultado
└── INDEX(paciente_id, data)

internacoes (1,200+)
β”œβ”€β”€ id, paciente_id β†’ pacientes
β”œβ”€β”€ data_entrada, data_saida, motivo, quarto
└── CHECK(data_saida >= data_entrada)

Key Features

  • βœ… BIGSERIAL primary keys on all tables
  • βœ… Unique constraints on natural keys (CPF, CRM, CNPJ)
  • βœ… Cascading foreign keys (ON UPDATE CASCADE, ON DELETE RESTRICT)
  • βœ… Automatic timestamps with triggers (created_at, updated_at)
  • βœ… 9 strategic indexes for performance
  • βœ… CDC-compatible schema for Debezium

βš™οΈ Configuration

Environment Variables (.env)

# PostgreSQL Connection
PG_HOST=localhost
PG_PORT=5432
PG_USER=postgres
PG_PASSWORD=postgres
PG_DATABASE=teste_pacientes

# Streaming Configuration
STREAM_INTERVAL_SECONDS=2      # Delay between operations (seconds)
BATCH_SIZE=50                  # Records per batch
MAX_JITTER_MS=400              # Random delay variation (ms)

# Seeding Configuration
SEED_PACIENTES=2000
SEED_MEDICOS=200
SEED_CONVENIOS=12
SEED_CONSULTAS=4000
SEED_EXAMES=3500
SEED_INTERNACOES=1200
SEED_PACIENTES_CONVENIOS=2500

# Logging
LOG_LEVEL=INFO                 # DEBUG, INFO, WARNING, ERROR

TOML Configuration (config/settings.toml)

[db]
search_path = "public"
connect_timeout = 10

[stream]
interval_seconds = 2
batch_size = 50
max_jitter_ms = 400
fail_fast_on_critical = true

[logging]
level = "INFO"
rotate_when = "midnight"
backup_count = 7

πŸ”„ Streaming Operations

The simulator executes 8 realistic operations:

INSERTs (70%)

  1. insert_paciente - Register new patient
  2. insert_consulta - Schedule new appointment
  3. insert_exame - Request new lab test
  4. insert_internacao - Admit patient to hospital

UPDATEs (30%)

  1. update_paciente - Modify contact info
  2. update_consulta - Change appointment status
  3. update_exame - Record lab results
  4. update_internacao - Discharge patient

Each operation:

  • βœ… Validates foreign keys before execution
  • βœ… Commits in batches for performance
  • βœ… Logs operation type and counts
  • βœ… Handles errors gracefully (continues on non-critical failures)
  • βœ… Reconnects automatically with exponential backoff

πŸ§ͺ Testing & Validation

Verify Data Consistency

-- Check for orphaned records (should return 0)
SELECT COUNT(*) FROM consultas 
WHERE paciente_id NOT IN (SELECT id FROM pacientes);

-- Verify unique CPFs
SELECT cpf, COUNT(*) FROM pacientes 
GROUP BY cpf HAVING COUNT(*) > 1;

-- Check timestamp coherence
SELECT COUNT(*) FROM consultas 
WHERE created_at > now();

Monitor Growth

# Terminal 1 - Stream for 5 minutes
timeout 300 make stream

# Terminal 2 - Check growth every 10 seconds
while true; do make counts; sleep 10; done

Performance Testing

# Stress test: high throughput
STREAM_INTERVAL_SECONDS=0 timeout 60 make stream

# Measure: ~200 ops/minute

πŸ”Œ Debezium / CDC Integration

Alimentador-BD generates CDC-compatible changes for Debezium capture.

Debezium Configuration

{
  "name": "postgres-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "10.42.88.67",
    "database.port": 5441,
    "database.user": "app",
    "database.password": "app123",
    "database.dbname": "teste_pacientes",
    "database.server.name": "alimentador-bd",
    "plugin.name": "pgoutput",
    "publication.name": "alimentador_pub",
    "table.include.list": "public.*",
    "publication.autocreate.mode": "filtered",
    "slot.name": "alimentador_slot"
  }
}

What Gets Captured

  • βœ… All INSERT operations β†’ {before: null, after: {patient data}}
  • βœ… All UPDATE operations β†’ {before: {old data}, after: {new data}}
  • βœ… updated_at field automatically populated by triggers
  • βœ… Natural keys (CPF, CRM, CNPJ) for deduplication

Expected Kafka Events

{
  "schema": {...},
  "payload": {
    "before": null,
    "after": {
      "id": 2045,
      "nome": "JoΓ£o Silva",
      "cpf": "123.456.789-00",
      "created_at": 1705255200000,
      "updated_at": 1705255200000
    },
    "source": {
      "version": "2.4.0.Final",
      "connector": "postgresql",
      "name": "alimentador-bd",
      "ts_ms": 1705255200123,
      "txId": 12345,
      "lsn": 12345678,
      "xmin": null
    },
    "op": "c",
    "ts_ms": 1705255200123,
    "transaction": null
  }
}

πŸ“ Project Structure

alimentador_bd/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ .env.example          # Template for credentials
β”‚   └── settings.toml         # Configuration
β”œβ”€β”€ scripts/                  # Python modules
β”‚   β”œβ”€β”€ cli.py               # CLI interface (Typer)
β”‚   β”œβ”€β”€ stream.py            # Streaming engine
β”‚   β”œβ”€β”€ seed.py              # Initial data population
β”‚   β”œβ”€β”€ db_init.py           # Database connection
β”‚   β”œβ”€β”€ data_gen.py          # Data generation (Faker)
β”‚   β”œβ”€β”€ validators.py        # FK validation cache
β”‚   └── reset.py             # Reset orchestration
β”œβ”€β”€ sql/                      # SQL scripts
β”‚   β”œβ”€β”€ 01_schema.sql        # Table definitions
β”‚   β”œβ”€β”€ 02_indexes.sql       # Indexes
β”‚   β”œβ”€β”€ 03_seed-lookups.sql  # Initial data
β”‚   └── 99_drop_all.sql      # Cleanup
β”œβ”€β”€ logs/                     # Runtime logs
β”œβ”€β”€ Makefile                  # Build automation
β”œβ”€β”€ Dockerfile                # Container image
β”œβ”€β”€ docker-compose.yml        # Local stack
β”œβ”€β”€ pyproject.toml           # Python config
β”œβ”€β”€ requirements.txt         # Dependencies
β”œβ”€β”€ README.md                # This file
β”œβ”€β”€ GUIDE.md                 # User guide (Portuguese)
β”œβ”€β”€ ARCHITECTURE.md          # Technical design
β”œβ”€β”€ DEPLOYMENT.md            # Production setup
β”œβ”€β”€ CONTRIBUTING.md          # Contribution guide
β”œβ”€β”€ CHANGELOG.md             # Version history
└── LICENSE                  # MIT license

🐳 Docker Deployment

Run Locally

# Start PostgreSQL (Docker)
docker-compose up -d postgres

# Initialize from host
make init
make seed

# Stream
make stream

Build Image

docker build -t alimentador-bd:1.0.0 .

docker run --rm \
  -e PG_HOST=localhost \
  -e PG_USER=app \
  -e PG_PASSWORD=app123 \
  -e PG_DATABASE=teste_pacientes \
  -v ./logs:/app/logs \
  alimentador-bd:1.0.0 \
  python -m scripts.cli stream

☁️ Production Deployment

See DEPLOYMENT.md for detailed guides:

  • βœ… AWS EC2 + RDS setup
  • βœ… Kubernetes deployment
  • βœ… Monitoring and scaling
  • βœ… Backup and recovery
  • βœ… Security best practices

πŸ“š Documentation

Document Purpose
README.md Overview, quick start, schema (this file)
GUIDE.md Complete user manual in Portuguese πŸ‡§πŸ‡·
ARCHITECTURE.md Technical design and data flow
DEPLOYMENT.md Production setup (AWS, K8s, Docker)
CONTRIBUTING.md How to contribute, dev setup
CHANGELOG.md Version history and roadmap

πŸ› Troubleshooting

Connection Error: "connection refused"

# Check PostgreSQL is running
psql -U postgres -h localhost -c "SELECT 1"

# Verify credentials in config/.env
cat config/.env | grep PG_

IntegrityError: "duplicate key value"

This is expected and handled gracefully. The simulator skips duplicates and logs them:

grep "IntegrityError" logs/app.log

Stream not starting

# Verify database is initialized
make init
make seed
make counts

# Check logs
tail -20 logs/app.log

Slow inserts

# Check disk space and PostgreSQL performance
df -h
psql -U app -d teste_pacientes -c "SELECT * FROM pg_stat_user_tables"

# Reduce batch size if needed
BATCH_SIZE=25 make stream

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Development setup
  • Code style guidelines
  • Testing procedures
  • Pull request workflow

Quick start for contributors:

git clone https://github.com/yourusername/alimentador-bd.git
cd alimentador-bd
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
make init && make seed
make stream  # Test it works

πŸ“Š Performance Metrics

Metric Value
Seed time <2 seconds
Initial records ~13,000
Stream rate 1 op / 2s
Batch size 50 records
Insert ops 70%
Update ops 30%
Throughput 200+ ops/min
Memory usage ~256 MB
CPU usage Low (<1 core)

πŸ“ License

MIT License - See LICENSE for details

Copyright Β© 2025 Henrique Ferreira


πŸ“ž Support


πŸŽ‰ Next Steps

  1. Read GUIDE.md (Portuguese user guide) or this README
  2. Setup with make install && make init && make seed
  3. Run with make stream
  4. Monitor with make counts and tail -f logs/app.log
  5. Deploy using DEPLOYMENT.md for production

Version: 1.0.0 | Status: Production Ready βœ… | License: MIT

About

Realistic OLTP data simulator for CDC testing with Debezium

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors