Skip to content

feat(commonware): upgrade to 2026.5.0#1

Draft
erenyegit wants to merge 1 commit into
Nunchi-trade:mainfrom
erenyegit:feat/upgrade-commonware-2026.5.0
Draft

feat(commonware): upgrade to 2026.5.0#1
erenyegit wants to merge 1 commit into
Nunchi-trade:mainfrom
erenyegit:feat/upgrade-commonware-2026.5.0

Conversation

@erenyegit
Copy link
Copy Markdown

What this does

Migrates the kora workspace from commonware 2026.4.0 to the 2026.5.0 release.

Why it's a draft

kora-e2e tests::consensus::test_empty_blocks (and any e2e test that uses TestHarness::run) overflows the test thread stack post-upgrade. The build is clean and the trusted-devnet smoke run finalizes, but this is a real regression I want to leave visible while we decide on the right fix. See the "Known issue" section below.

Major API migrations

  • Storage: merkle::journaledmerkle::full. QMDB VariableConfig and Db gain S: Strategy generic (using Sequential).
  • Runtime: tokio::Context no longer implements Clone. with_label and clone replaced with Supervisor::child("label"). Metrics::register now returns Registered<M>.
  • Reporter / Blocker / Manager: now sync, return commonware_actor::Feedback instead of impl Future. Manual Clone impls on FinalizedReporter, SeedReporter, and CommonwareRootProvider use context.child("xxx_clone").
  • Application trait: genesis removed. verify moved into Application (VerifyingApplication trait removed). propose/verify take impl Ancestry<Self::Block> instead of AncestorStream<A, B> with BlockProvider.
  • Config additions: simplex::Config gained floor: Floor<S, D>; marshal::Config gained start: Start<P::Scheme, B::Digest, B>. ActorInitializer::init / init_with_partition now take a start parameter.
  • Net: Sender::send returns Vec<PublicKey>; .await dropped on sync API calls.
  • Ed25519: internals are vendored. PrivateKey constructed via ReadExt::read (was ed25519_consensus::SigningKey::from(seed)).
  • DKG: bls12381::dkg types moved under the feldman_desmedt submodule.

Cross-referenced will's fix/commonware-resolver-upgrade branch on Nunchi-trade/daeji as a migration reference. simplex config/engine, marshal peers/broadcast, and transport-sim are copied wholesale from there.

Validation

  • cargo build --workspace clean
  • cargo test --workspace --no-run compiles all test targets
  • kora-marshal integration tests pass
  • just trusted-devnet brings up 4 validators + 1 secondary peer all healthy and finalizes at around 3 blocks/sec on this (I/O-degraded) host

Known issue: e2e stack overflow

tests::consensus::test_empty_blocks and any other test that goes through TestHarness::run overflows the test thread's 8MB stack.

Likely cause: our manual Clone impls on FinalizedReporter, SeedReporter, and CommonwareRootProvider call context.child("xxx_clone") because released 2026.5.0 no longer has Context: Clone. Over the test's event-loop iterations the supervision tree appears to grow until the default stack is exhausted.

daeji avoids this by patching commonware to main (where Context: Clone still exists). Three follow-ups under consideration:

  1. Refactor the Clone impls to use Arc<Context> or similar so cloning doesn't grow the supervision tree.
  2. Match daeji and apply [patch.crates-io] to point commonware at main.
  3. Bump RUST_MIN_STACK for the test binary.

Per request, not marking the failing test #[ignore] so the regression is visible. For context, 18 other kora-e2e tests are already #[ignore]'d as pre-existing flaky-in-parallel.

Follow-ups (separate PRs)

  • Resolve the stack overflow (one of the three options above).
  • Stability + tx load pass with the new mempool (spammoor-style).

Migrate the workspace from commonware 2026.4.0 to the 2026.5.0 release.

Major API migrations addressed:
- Storage: merkle::journaled -> merkle::full, QMDB VariableConfig and Db
  gain S: Strategy generic (Sequential)
- Runtime: tokio::Context no longer implements Clone; with_label and
  clone replaced with Supervisor::child("label"); Metrics::register now
  returns Registered<M>
- Reporter, Blocker, and Manager are sync and return commonware_actor
  Feedback instead of impl Future; manual Clone impls on FinalizedReporter,
  SeedReporter, and CommonwareRootProvider use context.child("clone_label")
- Application trait: genesis method removed; verify moved into Application
  (VerifyingApplication trait removed); propose/verify now take impl
  Ancestry<Self::Block> instead of AncestorStream<A, B> with BlockProvider
- simplex::Config gained floor: Floor<S, D>; marshal::Config gained
  start: Start<P::Scheme, B::Digest, B>; marshal ActorInitializer init
  and init_with_partition now take a start parameter
- Sender::send returns Vec<PublicKey>; .await dropped on sync API calls
- Ed25519 internals vendored: PrivateKey constructed via ReadExt::read
  instead of ed25519_consensus::SigningKey::from(seed)
- bls12381::dkg types moved under feldman_desmedt submodule

Cross-referenced will's fix/commonware-resolver-upgrade branch on
Nunchi-trade/daeji as a migration reference; simplex config/engine,
marshal peers/broadcast, and transport-sim copied wholesale from there.

Validation:
- cargo build --workspace and cargo test --workspace --no-run pass
- kora-marshal integration tests pass
- just trusted-devnet boots 4 validators + 1 secondary peer healthy and
  finalizes at around 3 blocks/sec on an I/O-degraded host

Known issue:
kora-e2e tests::consensus::test_empty_blocks (and any e2e test using
TestHarness::run) overflow the test thread stack post-upgrade. Likely
cause: manual Clone impls in reporters and root-provider use
context.child("xxx_clone") since released 2026.5.0 does not have
Context: Clone, and the supervision tree appears to grow over the test's
event-loop iterations until the default 8MB test stack is exhausted.
Daeji avoids this by patching commonware to main (where Context: Clone
still exists). Three follow-ups under consideration: (a) refactor the
Clone impls to Arc<Context> or similar to avoid tree growth, (b) match
daeji and apply [patch.crates-io] to main, (c) bump RUST_MIN_STACK in
test runners. Not adding #[ignore] markers per guidance to let the
failure be visible. 18 other kora-e2e tests are already marked
#[ignore] as pre-existing flaky-in-parallel.
@erenyegit erenyegit marked this pull request as draft May 29, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant