Skip to content

v0.2.0: Late-arriving data detection, Delta Lake support, and metadata purge#16

Merged
harishconti merged 2 commits intomainfrom
claude/next-roadmap-features-U6AVW
Apr 11, 2026
Merged

v0.2.0: Late-arriving data detection, Delta Lake support, and metadata purge#16
harishconti merged 2 commits intomainfrom
claude/next-roadmap-features-U6AVW

Conversation

@Sruthi-ng
Copy link
Copy Markdown

What does this PR do?

Introduces three major features for v0.2.0:

  1. Late-Arriving Data Detector — New monitoring capability that tracks whether expected data batches arrive within a configured grace period. Distinct from freshness checks (which measure staleness of the newest record), late-arriving detection verifies that scheduled ETL/streaming jobs complete on time. Includes cron-based scheduling, configurable grace periods, and alert dispatch.

  2. Delta Lake Connector — Adds native support for Delta Lake tables via the deltalake Python library. Supports local filesystem, S3, GCS, and Azure Data Lake Storage backends. Includes schema inspection, row counting (with transaction log optimization), timestamp queries, and SQL execution via DuckDB integration.

  3. Scheduled Metadata Purge — Automatic cleanup of metadata records to prevent unbounded database growth. Configurable per-table retention windows (default 90 days, with overrides for schema snapshots, distribution snapshots, and audit logs). Includes dry-run capability and purge history tracking.

  4. Backfill Detection — Enhanced volume anomaly detection to distinguish historical data re-ingestion (backfills) from genuine spikes. When a volume spike is detected, the system samples row timestamps; if a large fraction carry event timestamps significantly older than "now", the alert is downgraded to INFO severity.

Type of change

  • New connector
  • New check template
  • Bug fix
  • Documentation
  • Other

Changes

New Files

  • backend/routers/late_arriving.py — Late-arriving data detection API and scheduler integration
  • connectors/deltalake.py — Delta Lake warehouse connector with multi-backend support
  • backend/routers/maintenance.py — Metadata purge API and scheduler integration

Modified Files

  • backend/models.py — Added BackfillEvent and LateArrivingRecord models
  • backend/routers/checks.py — Integrated backfill detection into volume anomaly checks
  • backend/scheduler.py — Added scheduler jobs for late-arriving checks and metadata purge
  • backend/main.py — Registered new routers; updated version display
  • connectors/base.py — Added Delta Lake connector factory
  • config/kit.yml — Added configuration sections for late-arriving detection and maintenance purge
  • pyproject.toml — Bumped version to 0.2.0
  • ROADMAP.md — Marked completed features

Checklist

  • CHANGELOG.md updated (for features)
  • Config template updated with new sections
  • Version bumped to 0.2.0
  • Scheduler integration complete
  • API endpoints documented with docstrings

Test Plan

The changes include:

  • New API endpoints with auth guards (/late-arriving/, /maintenance/purge)
  • Scheduler integration that respects enabled flags in config
  • Database models with proper foreign key relationships
  • Connector abstraction with fallback to croniter-less behavior

Verification:

  • Existing tests should pass (no breaking changes to core interfaces)
  • Late-arriving checks respect cron expressions when croniter is installed
  • Backfill detection integrates seamlessly with existing volume checks
  • Metadata purge respects retention policy overrides from config
  • Delta Lake connector handles missing credentials gracefully with clear error messages

https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM

claude added 2 commits April 11, 2026 11:27
…tadata purge

## Delta Lake connector (connectors/deltalake.py)
- New WarehouseConnector for Delta tables via the `deltalake` Python library
- Supports local paths, S3, GCS, and Azure Data Lake Storage Gen2
- Cheap row-count reads from Delta transaction log statistics (O(1))
- SQL query support via DuckDB bridge (registers Delta table as a DuckDB view)
- get_table_history() and get_file_stats() for FinOps/lineage use cases
- Registered as warehouse_type=deltalake in connectors/base.py factory
- pip install observakit[deltalake] extra added to pyproject.toml

## Backfill detection (backend/routers/checks.py + backend/models.py)
- New BackfillEvent model persists detected backfill metadata
- _detect_backfill() samples up to 10,000 rows and analyses the timestamp
  distribution: if ≥40% of rows have event_ts older than the rolling window
  AND the row spread exceeds 24h, the spike is classified as a backfill
- Backfill-flagged spikes are suppressed from ANOMALY alerts; a low-severity
  INFO alert is sent instead so the team stays informed
- Volume checks now accept an optional timestamp_column per table to enable this

## Late-arriving data detector (backend/routers/late_arriving.py + models.py)
- New LateArrivingRecord model and full router (GET /, GET /{table}, POST /check)
- Cron-aware: uses croniter to compute the most recent expected arrival window
  (falls back to midnight UTC if croniter is not installed)
- Three statuses: ok, late (data arrived after deadline), missing (nothing arrived)
- Scheduler job added (LATE_ARRIVING_CHECK_INTERVAL env var, default 30 min)
- Enabled via late_arriving: block in kit.yml

## Scheduled metadata purge (backend/routers/maintenance.py + models.py)
- New MaintenanceLog model audits every purge run
- Purges 14 metadata tables with configurable per-table retention days
- GET /maintenance/status returns last run info + full retention policy
- POST /maintenance/purge supports dry_run=true for safe inspection
- Scheduler job added (MAINTENANCE_PURGE_INTERVAL env var, default 1440 min/24h)
- Enabled via maintenance: block in kit.yml

## Config & wiring
- kit.yml gains late_arriving and maintenance sections (both disabled by default)
- backend/main.py: new routers registered at /late-arriving and /maintenance
- backend/scheduler.py: two new optional scheduler jobs
- Version bumped to 0.2.0 across main.py and pyproject.toml
- ROADMAP.md: all four v0.2.0 items marked complete

https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM
- Remove unused `sqlalchemy.text` import from maintenance.py
- Remove unused `pyarrow.compute` import from deltalake.py (full-scan
  fallback uses len(batch) directly, not a compute function)

https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM
@harishconti harishconti merged commit 6b99043 into main Apr 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants