Skip to content

feat(pg-pg): Automated schema dump mode#4283

Open
Amogh-Bharadwaj wants to merge 21 commits into
mainfrom
pg-pg/schema-migration
Open

feat(pg-pg): Automated schema dump mode#4283
Amogh-Bharadwaj wants to merge 21 commits into
mainfrom
pg-pg/schema-migration

Conversation

@Amogh-Bharadwaj
Copy link
Copy Markdown
Contributor

@Amogh-Bharadwaj Amogh-Bharadwaj commented May 6, 2026

Why

  • For Postgres to Postgres migration use-cases, users would prefer an apples-to-apples mirroring of their source setup on the target instance.
  • A well-recognized method to migrate schema only is pg_dump.

What

  • This PR adds an activity in SetupFlowWorkflow which runs pg_dump on the source database and pipes its output to the target database via PSQL.
  • It adds a new dynamic flag setting to gate the above. Also the above is gated to PG type system mirrors.
  • When the above flag is set, destination validation and destination create normalized tables steps are skipped.
  • It sets --no-owners and --no-privileges, leaving it to the user to add the desired roles later. pg_dumpall requires exact major version matching between pg_dumpall version and target version, which we cannot guarantee.
  • E2E tests added

@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from 4c76639 to c6688d5 Compare May 7, 2026 05:07
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2243 2 2241 203
View the top 2 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/e2e::TestApiPg
Stack Traces | 0.01s run time
=== RUN   TestApiPg
=== PAUSE TestApiPg
=== CONT  TestApiPg
--- FAIL: TestApiPg (0.01s)
2026/05/13 18:52:34 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/05/13 18:52:34 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
github.com/PeerDB-io/peerdb/flow/e2e::TestApiPg/TestResyncCompleted
Stack Traces | 0.09s run time
=== RUN   TestApiPg/TestResyncCompleted
=== PAUSE TestApiPg/TestResyncCompleted
=== CONT  TestApiPg/TestResyncCompleted
2026/05/13 18:50:49 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/05/13 18:50:49 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
    api_test.go:1302: 
        	Error Trace:	.../flow/e2e/api_test.go:1302
        	Error:      	Received unexpected error:
        	            	failed to get workflow_id for flow resync_completed_api_3hfhot8l: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
        	Test:       	TestApiPg/TestResyncCompleted
    api_test.go:49: begin tearing down postgres schema api_3hfhot8l
--- FAIL: TestApiPg/TestResyncCompleted (0.09s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: Consistent, deterministic failures across all CI matrix jobs caused by pg_dumpall: error: aborting because of server version mismatch in Test_PG_Schema_Dump_Role_Migration, with cascading ~31s timeouts in multiple other TestPeerFlowE2ETestSuitePG tests — a real environment/version mismatch bug, not flaky.
Confidence: 0.93

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from c6688d5 to d26ec76 Compare May 7, 2026 07:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: Multiple e2e tests in TestPeerFlowE2ETestSuitePG failed with "UNEXPECTED STATUS TIMEOUT STATUS_SETUP" at ~31 seconds across all CI matrix jobs, indicating the test infrastructure was unable to complete setup in time rather than a code regression — consistent with CI resource contention when running 32 parallel tests.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: Multiple TestPeerFlowE2ETestSuitePG tests consistently fail with UNEXPECTED STATUS TIMEOUT STATUS_SETUP across all three CI matrix variants, suggesting the recent docker-compose stable image tag upgrade broke flow setup rather than a random flaky timeout.
Confidence: 0.78

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: The e2e test suite timed out after exactly 900s (the configured limit) on the MariaDB 8.0 matrix job, with no specific assertion failures — a classic symptom of CI infrastructure slowness or a hanging test rather than a code regression.
Confidence: 0.88

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: All PG e2e tests consistently time out on STATUS_SETUP across every CI matrix variant, indicating the pg-pg/schema-migration PR likely introduced a regression that causes PostgreSQL flow setup to stall or exceed the 30-second per-test timeout.
Confidence: 0.72

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: All 7–8 failures across every matrix shard show "UNEXPECTED STATUS TIMEOUT STATUS_SETUP", meaning CDC workflows timed out during Temporal workflow setup — a classic infrastructure/timing flake, not a logic regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: The same set of e2e PostgreSQL tests (TestPeerFlowE2ETestSuitePG) fail at ~31 seconds across all three CI matrix jobs, pointing to a systematic regression — likely a service initialization timeout introduced by the recent docker-compose image tag upgrades — rather than a flaky failure.
Confidence: 0.82

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: Three separate e2e matrix jobs failed due to timing/timeout issues: one run hit the 15-minute test suite timeout causing a cascade panic, while the other two had individual WaitFor-based tests fail in different tests across different matrix configurations, with no consistent assertion error pointing to a real code regression.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: The same two PG schema dump tests fail with STATUS_SETUP timeouts across all CI matrix configurations, consistent with a real regression introduced by the recent Docker image tag upgrade (PR #4285) rather than random flakiness.
Confidence: 0.78

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: Both failing tests (Test_PG_Schema_Dump_No_Owner_No_Privileges and Test_PG_Schema_Dump_And_CDC) hit "UNEXPECTED STATUS TIMEOUT STATUS_SETUP" across two independent matrix configurations, indicating the Temporal workflow setup phase exceeded the wait deadline due to CI resource contention rather than a code regression.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: Both failing tests (Test_PG_Schema_Dump_No_Owner_No_Privileges and Test_PG_Schema_Dump_And_CDC) hit UNEXPECTED STATUS TIMEOUT STATUS_SETUP — a Temporal workflow setup poll timeout — across all CI matrix variants, while the triggering commit only changed unrelated ClickHouse GCS staging code, making this a timing/environment flake rather than a real regression.
Confidence: 0.8

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: All three failing tests hit setup/propagation timeouts ("UNEXPECTED STATUS TIMEOUT STATUS_SETUP", repeated record-count polling mismatches, ClickHouse table-not-found) that are consistent with CI resource contention in a parallelized e2e suite, not a logic regression from the unrelated ClickHouse GCS staging commit.
Confidence: 0.85

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: CI setup fails deterministically with "E: Unable to locate package postgresql-client-18" because the PostgreSQL apt repository is not configured on the runner, so no tests execute at all.
Confidence: 0.95

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: All failures are "UNEXPECTED STATUS TIMEOUT STATUS_SETUP" errors in Temporal-orchestrated e2e tests unrelated to the merged ClickHouse change, consistent with CI resource saturation causing setup phase timeouts.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: Both failing tests (Test_PG_Schema_Dump_And_CDC and Test_PG_Schema_Dump_No_Owner_No_Privileges) hit UNEXPECTED STATUS TIMEOUT STATUS_SETUP — a timing failure where the Temporal workflow's setup phase didn't complete within the wait window — which is a classic flaky pattern in e2e tests under CI load, unrelated to the latest ClickHouse staging commit.
Confidence: 0.75

✅ Automatically retrying the workflow

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from 6cc1c5e to 97a5fc4 Compare May 7, 2026 15:59
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔄 Flaky Test Detected

Analysis: The test TestApiMongo/TestCancelTableAddition_NoRemovalAssumed failed with PostgreSQL SQLSTATE 57P01 (administrator_shutdown), meaning a DB connection was terminated by an external admin command mid-query — an infrastructure event unrelated to the test logic.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj requested a review from serprex May 7, 2026 17:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

❌ Test Failure

Analysis: Real build failure: e2e/pg_schema_dump_test.go:351 references internal.GetSecondaryPostgresConfigFromEnv which does not exist in the codebase, causing a compilation error.
Confidence: 0.98

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj marked this pull request as ready for review May 7, 2026 17:19
@Amogh-Bharadwaj Amogh-Bharadwaj requested a review from a team as a code owner May 7, 2026 17:19
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review

Bug found in flow/workflows/setup_flow.go line 316: CreateNormalizedTable skipped even when pg_dump silently does nothing

When the PEERDB_PG_AUTOMATED_SCHEMA_DUMP flag is enabled but the source or destination peer uses SSH tunnels or non-password auth, RunPgDumpSchema silently returns nil without actually running pg_dump.

However, enablePgSchemaDump is still true (set based on the flag alone), so CreateNormalizedTable is also skipped. The result is that no tables are created on the destination -- neither via pg_dump nor via the normal CreateNormalizedTable path.

Suggested fix: Have RunPgDumpSchema return a (bool, error) instead of just error, where the bool indicates whether pg_dump actually ran. Then use that bool (rather than enablePgSchemaDump) as the skipCreateTables argument.

Also checked for CLAUDE.md compliance -- no violations found.

@Amogh-Bharadwaj Amogh-Bharadwaj changed the title WIP: PG - PG schema migration feat(pg-pg): Automated schema dump mode May 8, 2026
@Amogh-Bharadwaj Amogh-Bharadwaj requested a review from a team as a code owner May 8, 2026 13:07
Comment thread flow/workflows/setup_flow.go
@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from 323877a to 5aeaf1e Compare May 8, 2026 14:34
@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: Real bug: TestRunPipeline_ContextCancel fails consistently in all 3 CI matrix jobs with "runPipeline did not return after context cancel", almost certainly caused by the errors.AsType change in commit 55ec870 breaking context-cancellation error detection in the postgres pipeline connector.
Confidence: 0.95

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: Test_PartitionBy fails deterministically across all 4 ClickHouse test suites with the same assertion mismatch (expected "num" but got "(num)" for the partition_key column), indicating a real regression in how partition keys are generated or reported, not a timing/flaky issue.
Confidence: 0.85

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: Test_PartitionBy deterministically fails across all 4 ClickHouse test suites because ClickHouse now returns "(num)" instead of "num" for partition_key in system.tables, indicating a real behavioral regression rather than a flaky failure.
Confidence: 0.88

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from 80e2bf7 to b90d3a1 Compare May 13, 2026 18:40
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: All failures trace back to FATAL: terminating connection due to administrator command (SQLSTATE 57P01) — a PostgreSQL catalog connection was killed mid-run by an admin command/shutdown event, which is an infrastructure flake unrelated to any code change.
Confidence: 0.93

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The e2e test suite hit the hard 900-second timeout (ran 900.644s) with no assertion failures, indicating a slow or hanging test rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The e2e test package timed out at exactly the 900s hard limit with no assertion failures, indicating a flaky infrastructure/timing issue rather than a code regression.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: All 4 NullEngine test variants fail deterministically at the resync/initial-snapshot stage because data inserted during snapshot doesn't flow through the ClickHouse NullEngine materialized view to the target table, indicating a real bug not a flaky failure.
Confidence: 0.85

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@Amogh-Bharadwaj Amogh-Bharadwaj force-pushed the pg-pg/schema-migration branch from b90d3a1 to 9471f04 Compare May 14, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants