v0.2.0: Late-arriving data detection, Delta Lake support, and metadata purge#16
Merged
harishconti merged 2 commits intomainfrom Apr 11, 2026
Merged
Conversation
…tadata purge
## Delta Lake connector (connectors/deltalake.py)
- New WarehouseConnector for Delta tables via the `deltalake` Python library
- Supports local paths, S3, GCS, and Azure Data Lake Storage Gen2
- Cheap row-count reads from Delta transaction log statistics (O(1))
- SQL query support via DuckDB bridge (registers Delta table as a DuckDB view)
- get_table_history() and get_file_stats() for FinOps/lineage use cases
- Registered as warehouse_type=deltalake in connectors/base.py factory
- pip install observakit[deltalake] extra added to pyproject.toml
## Backfill detection (backend/routers/checks.py + backend/models.py)
- New BackfillEvent model persists detected backfill metadata
- _detect_backfill() samples up to 10,000 rows and analyses the timestamp
distribution: if ≥40% of rows have event_ts older than the rolling window
AND the row spread exceeds 24h, the spike is classified as a backfill
- Backfill-flagged spikes are suppressed from ANOMALY alerts; a low-severity
INFO alert is sent instead so the team stays informed
- Volume checks now accept an optional timestamp_column per table to enable this
## Late-arriving data detector (backend/routers/late_arriving.py + models.py)
- New LateArrivingRecord model and full router (GET /, GET /{table}, POST /check)
- Cron-aware: uses croniter to compute the most recent expected arrival window
(falls back to midnight UTC if croniter is not installed)
- Three statuses: ok, late (data arrived after deadline), missing (nothing arrived)
- Scheduler job added (LATE_ARRIVING_CHECK_INTERVAL env var, default 30 min)
- Enabled via late_arriving: block in kit.yml
## Scheduled metadata purge (backend/routers/maintenance.py + models.py)
- New MaintenanceLog model audits every purge run
- Purges 14 metadata tables with configurable per-table retention days
- GET /maintenance/status returns last run info + full retention policy
- POST /maintenance/purge supports dry_run=true for safe inspection
- Scheduler job added (MAINTENANCE_PURGE_INTERVAL env var, default 1440 min/24h)
- Enabled via maintenance: block in kit.yml
## Config & wiring
- kit.yml gains late_arriving and maintenance sections (both disabled by default)
- backend/main.py: new routers registered at /late-arriving and /maintenance
- backend/scheduler.py: two new optional scheduler jobs
- Version bumped to 0.2.0 across main.py and pyproject.toml
- ROADMAP.md: all four v0.2.0 items marked complete
https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM
- Remove unused `sqlalchemy.text` import from maintenance.py - Remove unused `pyarrow.compute` import from deltalake.py (full-scan fallback uses len(batch) directly, not a compute function) https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Introduces three major features for v0.2.0:
Late-Arriving Data Detector — New monitoring capability that tracks whether expected data batches arrive within a configured grace period. Distinct from freshness checks (which measure staleness of the newest record), late-arriving detection verifies that scheduled ETL/streaming jobs complete on time. Includes cron-based scheduling, configurable grace periods, and alert dispatch.
Delta Lake Connector — Adds native support for Delta Lake tables via the
deltalakePython library. Supports local filesystem, S3, GCS, and Azure Data Lake Storage backends. Includes schema inspection, row counting (with transaction log optimization), timestamp queries, and SQL execution via DuckDB integration.Scheduled Metadata Purge — Automatic cleanup of metadata records to prevent unbounded database growth. Configurable per-table retention windows (default 90 days, with overrides for schema snapshots, distribution snapshots, and audit logs). Includes dry-run capability and purge history tracking.
Backfill Detection — Enhanced volume anomaly detection to distinguish historical data re-ingestion (backfills) from genuine spikes. When a volume spike is detected, the system samples row timestamps; if a large fraction carry event timestamps significantly older than "now", the alert is downgraded to INFO severity.
Type of change
Changes
New Files
backend/routers/late_arriving.py— Late-arriving data detection API and scheduler integrationconnectors/deltalake.py— Delta Lake warehouse connector with multi-backend supportbackend/routers/maintenance.py— Metadata purge API and scheduler integrationModified Files
backend/models.py— AddedBackfillEventandLateArrivingRecordmodelsbackend/routers/checks.py— Integrated backfill detection into volume anomaly checksbackend/scheduler.py— Added scheduler jobs for late-arriving checks and metadata purgebackend/main.py— Registered new routers; updated version displayconnectors/base.py— Added Delta Lake connector factoryconfig/kit.yml— Added configuration sections for late-arriving detection and maintenance purgepyproject.toml— Bumped version to 0.2.0ROADMAP.md— Marked completed featuresChecklist
Test Plan
The changes include:
/late-arriving/,/maintenance/purge)enabledflags in configVerification:
https://claude.ai/code/session_01MVDY9dWdMCUbYk5aDWu5YM