Skip to content

feat: MySQL CDC GA — type fidelity, bootstrap, cursor chunker, non-integer PKs#99

Open
dariomazzitellireplik-coder wants to merge 1 commit into
mainfrom
feat/mysql-cdc-ga
Open

feat: MySQL CDC GA — type fidelity, bootstrap, cursor chunker, non-integer PKs#99
dariomazzitellireplik-coder wants to merge 1 commit into
mainfrom
feat/mysql-cdc-ga

Conversation

@dariomazzitellireplik-coder
Copy link
Copy Markdown
Collaborator

Summary

Graduates MySQL source from BETA → STABLE in v2.5.0. Closes the correctness and UX gaps identified at v2.3.0 (PR #97). After this lands, the README/docs drop the 🧪 Beta badge and the C11 type-roundtrip verify check passes against MySQL sources without --skip C11.

OpenSpec change: mysql-cdc-beta-to-ga.

Highlights

Type fidelity

  • Value::UInt64(u64) + DataType::UInt64 for BIGINT UNSIGNED. Previously dbmazz silently wrapped values ≥ 2^63 to negative i64. Sinks map to NUMERIC(20,0) (PG/SF) or LARGEINT (StarRocks).
  • Value::Timestamp(micros) replaces the temporary ISO-string shim from PR feat: add MySQL CDC source (BETA) #97 for TIMESTAMP/DATETIME columns. DATETIME(p) microseconds preserved end-to-end (previously the _us parameter was dropped).
  • DECIMAL precision/scale propagated from information_schema.columns — was hardcoded Decimal{38,9} regardless of source.

First-run binlog bootstrap (H5)

On a fresh start, SHOW MASTER STATUS is captured before the snapshot worker spawns and persisted as PROVISIONAL. The post-snapshot CDC stream resumes from that point — no more replaying days of binlogs, no hard error if old binlogs were purged. The first commit promotes the row to ACTIVE.

Snapshot performance (M3)

Cursor-based keyset paging replaces MIN(pk)/MAX(pk) + linear partitioning. Each chunk has bounded row count regardless of PK density — sparse distributions no longer produce empty / oversized chunks.

Non-integer PK support (M4)

VARCHAR / CHAR / UUID / BINARY / VARBINARY primary keys now snapshot-able. PkKind::Str uses COLLATE utf8mb4_bin for deterministic byte-wise ordering. Composite PK and no-PK tables are skipped with WARN logs.

Cleanup

  • Drop Source::start_replication, checkpoint_position, cleanup — engine uses create_loop exclusively.
  • Drop SinkResult::last_position — populated by all three sinks but never consumed.

Migrations (auto, idempotent)

  • dbmazz_snapshot_state adds start_pk_text/end_pk_text/pk_kind; relaxes NOT NULL on legacy start_pk/end_pk. Backfills Int-kinded rows.
  • dbmazz_checkpoints adds nullable status VARCHAR(16) for PROVISIONAL bootstrap rows.

Verify matrix

Run against dbmazz-mysql:dev built from this branch:

Combo Result
PG → PG 18/0/0
PG → StarRocks 17/0/1 ✓ (A4 skip expected)
MySQL → PG 18/0/0
MySQL → StarRocks 17/0/1

C11 type-roundtrip now passes 7/7 against MySQL — confirms BIGINT UNSIGNED + DATETIME micros + DECIMAL precision + Value::Timestamp end-to-end.

Static checks

  • cargo fmt --all -- --check clean
  • cargo clippy --features mysql-source -- -D warnings clean
  • cargo clippy -- -D warnings clean
  • cargo test --features mysql-source --lib270 passed / 0 failed
  • cargo test --lib158 passed / 0 failed

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --features mysql-source -- -D warnings
  • cargo clippy -- -D warnings
  • cargo test --features mysql-source
  • cargo test
  • ez-cdc verify — PG → PG: 18/0/0
  • ez-cdc verify — PG → SR: 17/0/1
  • ez-cdc verify — MySQL → PG: 18/0/0 (C11 7/7)
  • ez-cdc verify — MySQL → SR: 17/0/1 (C11 7/7)
  • ez-cdc verify MySQL → Snowflake (deferred — no SF creds in dev-stack)
  • H5 bootstrap integration test against a fresh MySQL with concurrent INSERTs (deferred — #[ignore]'d test in tasks.md)

Cross-repo

After merge: open a sister PR on ez-cdc-cli README to drop the historical "C11 expected to FAIL with MySQL" disclaimer (already noted in v0.5.3 CHANGELOG that this is resolved with dbmazz ≥ 2.4.0; v2.5.0 makes it fully accurate without the --skip C11 recommendation).

Breaking changes (in-tree only)

  • Value::UInt64 and DataType::UInt64 are new variants — sinks updated in lockstep.
  • Source trait method removals — PG/MySQL impls updated in lockstep.
  • SinkResult::last_position removal — sink impls updated in lockstep.
  • No out-of-tree implementors of these traits / structs.

…teger PKs

Graduates MySQL source from BETA to STABLE in v2.5.0. Closes the
correctness and UX gaps identified at v2.3.0 (PR #97).

## Type fidelity

- Add Value::UInt64(u64) and DataType::UInt64 for BIGINT UNSIGNED.
  Previously MySQL silently wrapped values >= 2^63 to negative i64.
  Sinks map to NUMERIC(20,0) (PG/SF) or LARGEINT (StarRocks);
  value_to_json stringifies to avoid downstream precision loss.
- Replace the temporary ISO-string TIMESTAMP shim from PR #97 with
  Value::Timestamp(micros). DATETIME(p) microseconds are preserved.
- Schema introspection reads NUMERIC_PRECISION / NUMERIC_SCALE /
  COLUMN_TYPE; DECIMAL columns land as DataType::Decimal{p,s} instead
  of the hardcoded (38,9).
- MySQL converter routes MYSQL_TYPE_NEWDECIMAL / MYSQL_TYPE_DECIMAL
  bytes through Value::Decimal instead of Value::String.

## First-run bootstrap (H5)

On a fresh start with no checkpoint, SHOW MASTER STATUS is captured
BEFORE the snapshot worker spawns and persisted as a PROVISIONAL
checkpoint. The post-snapshot CDC stream resumes from that point —
no more replaying days of binlogs, no hard error if old binlogs were
purged. The first commit promotes the row to ACTIVE.

dbmazz_checkpoints gains a nullable status column (idempotent
migration via SHOW COLUMNS probe).

## Snapshot performance (M3)

Replaces MIN(pk)/MAX(pk) + linear partitioning with cursor-based
keyset paging: SELECT pk WHERE pk > ? ORDER BY pk LIMIT chunk_size+1.
Each chunk has bounded row count regardless of PK density. Sparse
distributions (gaps from DELETEs / auto-increment skips) no longer
produce empty / oversized chunks.

## Non-integer PK support (M4)

find_mysql_integer_pk replaced with find_mysql_pk, dispatching on
DataType: Int*/UInt64 -> PkKind::Int|UInt, String/Text/Uuid ->
PkKind::Str (with COLLATE utf8mb4_bin for deterministic ordering),
Bytes -> PkKind::Bytes. Composite PK and no-PK tables are skipped
with WARN logs.

dbmazz_snapshot_state gains typed columns (start_pk_text, end_pk_text,
pk_kind) with backfill from legacy i64 columns. Idempotent.

## Cleanup

- Drop dead trait methods: Source::start_replication,
  checkpoint_position, cleanup (engine uses create_loop only).
- Drop SinkResult::last_position (populated but never consumed; LSN
  flows through PipelineEvent::lsn).

## Verify matrix (dbmazz-mysql:dev, v2.5.0)

- PG -> PG:        18/0/0
- PG -> StarRocks: 17/0/1 (A4 skip expected; SR has no metadata table)
- MySQL -> PG:     18/0/0
- MySQL -> SR:     17/0/1

Zero regressions, full Tier 1 green including C11 type roundtrip
against MySQL sources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants