feat(core): implement `core/scheduler` by emlautarom1 · Pull Request #466 · NethermindEth/pluto

emlautarom1 · 2026-06-01T16:45:16Z

Closes #176

Uses an Actor model to implement the core/scheduler module, solving some concurrency hacks that the original implementation had.

Metrics have been ported using the existing Vise patterns but are untested.

- Add `Scheduler` struct (WIP) - Add `new_slot_ticker`

- Fix duplicated code

- `valcache` is required by scheduler - Fixes cyclical dependency issue

- Use `valcache` to get the list of active validators.

- Manages internal state of the scheduler - Stored in a `Mutex` for thread safety

- TODO's regarding concurrent access while resolving duties

- Try to reuse existing infrastructure

- Preserve only Actor API

- Add docs - Simplify error messages

- Fix edge case when trimming small epochs

…duler

- Not needed in practice, and they are large structs that would be expensive to clone.

emlautarom1 · 2026-06-01T17:07:57Z

@claude review

claude · 2026-06-01T17:08:17Z

Claude finished @emlautarom1's task in 4m 19s —— View job

Review: `core/scheduler` Implementation

Nicely structured port — the actor model is a clean fit for Rust and the test suite covers the important branches. Below are the issues I'd want addressed before merge, ordered roughly by importance.

1. Functional regression: `GetDutyDefinition` 100 ms timeout swallows legitimate waits

SchedulerHandle::get_duty_definition (crates/core/src/scheduler.rs:255-270) bounds the response to 100 ms total. Inside the actor, SchedulerActor::get_duty_definition (crates/core/src/scheduler.rs:344-370) issues self.client.fetch_slots_config().await? — a beacon-node network call — before it can answer.

In Charon, GetDutyDefinition runs under RWMutex and waits using isResolvingEpoch polling until the resolution finishes (scheduler.go:185-191). Here the actor model serialises the resolution against the query, and the comment at line 264 ("This has to be very rare event…") is migrated from a context where the call was free, into a context where every call pays a network round-trip. Under a slow beacon node the response will simply be TimeoutError.

Two things to fix together:

Don't put fetch_slots_config behind the timeout. Either fetch slots_per_epoch once when the scheduler starts (it doesn't change at runtime in the relevant configs and is already loaded by new_slot_ticker) or cache it on the actor and avoid hitting the API on every GetDutyDefinition.
Reconsider the 100 ms timeout. With network I/O removed from the hot path, get_duty_definition is essentially synchronous on the actor; the only real wait is "queued behind another message". Either drop the timeout (the actor responds quickly) or let the caller pass one. As written, callers can't distinguish "scheduler busy" from "duty truly absent".

Also note: SchedulerError::EpochNotResolved drops the duty field that Charon includes in the log (scheduler.go:194-196); add it for parity.

2. `new_slot_ticker` ignores cancellation during the inter-slot sleep

crates/core/src/scheduler.rs:627-655:

let wait = slot.time.signed_duration_since(chrono::Utc::now()).to_std().unwrap_or_default();
tokio::time::sleep(wait).await;

The Go reference uses select between ctx.Done() and the clock (scheduler.go:663-667), so cancellation interrupts the sleep. Here the cancel-token is only consulted after the sleep when sending into the channel — so shutting down mid-slot blocks the ticker task for up to a full slot duration (~12 s). Wrap the sleep in tokio::select! { _ = ct.cancelled() => break, _ = tokio::time::sleep(wait) => {} }.

3. Reorg handler runs unconditionally — Charon gates it on a feature flag

crates/core/src/scheduler.rs:328 / handle_chain_reorg runs whenever a value arrives on reorg_rx. Charon checks featureset.Enabled(featureset.SSEReorgDuties) (scheduler.go:142-154) and warns when the feature is disabled. The comment on with_chain_reorgs ("SSE feature check should be done by the caller") punts this to the caller, but:

The behaviour is silent — nothing logs an ignored reorg.
Anyone calling with_chain_reorgs skipping the check changes scheduler behaviour silently.

Either re-add the featureset check inside the scheduler (matching Charon) or document the requirement loudly. At minimum, log when a reorg is observed but the resolved epoch is u64::MAX (currently the early-return at line 330 is silent).

4. Subscribers terminate permanently on a single lag

subscribe_slot and subscribe_duty (crates/core/src/scheduler.rs:111-173) drop the subscription forever on RecvError::Lagged. This was a deliberate "NOTE: requires further analysis" call, but it means a single transient slow-down silently disables a downstream component for the rest of the process lifetime. Charon would not lose subscribers this way. Worth at minimum:

Documenting this behaviour on the public method.
Returning a handle (your TODO at line 118) so the caller can detect that its subscription died.

The 100-element broadcast channel buffer (line 104-105) covers ~20 minutes of slots, so in practice lag means the subscriber is wedged — but it could still happen under back-pressure and disappearing without observability is a regression vs Charon.

5. `handle_chain_reorg` is `pub` but on a private struct

crates/core/src/scheduler.rs:328 — declared pub async fn inside impl SchedulerActor, but SchedulerActor itself is private. Should be async fn. Same minor issue isn't worth a separate point but worth a sweep.

6. Redundant clones in `SchedulerBuilder::build`

crates/core/src/scheduler.rs:205 — new_slot_ticker(&client.clone(), ct.clone()).await? — client.clone() then immediately taking a reference is wasteful; new_slot_ticker takes &EthBeaconNodeApiClient. And the next line re-clones into the actor, so the temp clone is unused. Just pass &client.

7. `validator_status` reset path is `O(total_seen_validators)` per validator

crates/core/src/scheduler.rs:687-693 iterates every label combination ever recorded just to zero out the previous status for one pubkey. With a stable set this is fine, but for a long-running node with validator churn the linear scan stacks up (O(N²) per epoch). The comment correctly identifies that Vise can't delete labels — so the leak is unavoidable — but the iteration could keep an in-actor HashMap<PubKey, String> of last-known status to avoid the scan. Not blocking, but worth a TODO.

8. Style / nits

enum SchedulerMessage at line 234 is missing the conventional blank line / #[derive] and is wedged against the impl Default block above. Add a blank line for readability.
All error variants end in Error (TimeoutError, EthBeaconNodeApiClientError, …) — clippy lints recommend the bare form (Timeout, EthBeaconNodeApi). Pre-existing pattern in the codebase, so consistency wins, but it could be cleaned up in a follow-up.
fast_backoff is marked // TODO: Duplicated from crates/p2p/src/bootnode.rs — please file a follow-up issue or DRY this into a shared helper.
Metric module is named metrics but the file declares mod metrics; and re-exports SCHEDULER_METRICS via use crate::scheduler::metrics::SCHEDULER_METRICS; — fine, but the rest of the crate uses bare-metrics:: imports inline at call sites. Minor.

9. Tests

The test suite is solid, but two coverage gaps stand out:

No test exercises the delay_slot_offset path with a non-past slot. All tests use test_past_slot so the deadline has elapsed and the wait is Duration::ZERO. Worth one slow-path test to make sure cancellation actually wins the with_cancellation_token_owned race.
No test for the last_in_epoch pre-resolution branch (crates/core/src/scheduler.rs:419-423). Worth one.
get_duty_definition is tested in-actor but not via the SchedulerHandle 100ms-timeout path. Given point 1, a test that drives the timeout would surface the regression.

Summary

The actor model is the right call and the bulk of the logic mirrors Charon faithfully. The major concern is #1 (the 100 ms timeout combined with a network call inside the actor is a real behavioural regression). #2 and #3 are smaller but worth fixing for parity. Everything else is cleanup.

Read PR diff and changed files
Cross-reference with Charon Go implementation
Analyze functional correctness
Check Rust style and concurrency patterns
Review metrics implementation
Post review findings
• Branch: emlautarom1/core-scheduler

varex83 · 2026-06-02T13:07:38Z

 pluto-ssz.workspace = true
 ssz.workspace = true
 tree_hash.workspace = true
+anyhow.workspace = true


I think we can move it back to dev dependecies

varex83 · 2026-06-02T13:11:03Z

+    /// Subscribes a callback function for triggered slots.
+    pub fn subscribe_slot(
+        &mut self,
+        f: impl Fn(&types::Slot) -> Result<()> + Send + 'static,


Should we support async functions here as well?

Changed to support async callbacks. There is a tradeoff here to make:

If we await on each callback we can block the subscriber, meaning that it could lag behind the producer (current behavior), but ensures that callbacks are proceed in order with as single running task.

If we run each callback on separate tasks we don't lag behind, but we have an unbounded number of running tasks and we cannot ensure ordering (ex. the callback for slot 1 could not complete before we start running the callback for 2)

Charon uses the second approach but I'm not sure it's the right one.

Note that if we just returned the Receiver channel to the caller then it would be a per-caller decision (more flexibility, we push the decision for later).

varex83 · 2026-06-02T13:12:13Z

+
+        // TODO: We might want to return a handle so clients can `.abort()` them to drop
+        // the subscription
+        tokio::spawn(async move {


I would pass cancellation token here / store the handles in the struct and then on join them on drop

Not needed actually: when the actor gets dropped the rx.recv() call returns Err(Closed) so it exits automatically.

I'm thinking that maybe this approach of passing callbacks is not very good, and instead we should return the .subscribe() result received channel and let the caller decide what to do (ex. run it as tokio::spawn/tokio::spawn_blocking, handle cancellation, etc.)

emlautarom1 added 30 commits May 25, 2026 13:55

Implement initial scheduler module

a620362

- Add `Scheduler` struct (WIP) - Add `new_slot_ticker`

Add wait_chain_start

8189f08

- Fix duplicated code

Add wait_beacon_sync

67c47fd

Move valcache from app to core

3d19f19

- `valcache` is required by scheduler - Fixes cyclical dependency issue

Add resolve_active_validators

e60be59

- Use `valcache` to get the list of active validators.

Implement run (WIP)

36ed2ba

Use broadcast channel for emitCoreSlot

218be9f

Add Inner

a1b62b4

- Manages internal state of the scheduler - Stored in a `Mutex` for thread safety

Add schedule_slot (WIP)

efec692

Implement schedule_slot (WIP)

55a2030

Ignore dead_code during development

ef1e932

Cleanup TODOs

6015466

Resolve Attester duties

0ccdd1e

Resolve Proposer duties

3f70aaf

Resolve Sync Committee duties

624cf49

Add get_duty_definition

b832067

- TODO's regarding concurrent access while resolving duties

Add handle_chain_reorg

0b9bffc

Cleanup parsing errors

5ecb4d1

- Try to reuse existing infrastructure

Port to Actor model (WIP)

28d0b85

Inline inner

601d8a9

Remove old Scheduler code

6117781

- Preserve only Actor API

Fix epoch trimmed logic

6f45227

Use saturating_sub to prevent underflow

012522f

Check for resolved then trimmed

8a6007a

Allow early shutdown when waiting for start

a2e8f68

Correctly wait until the slot offset

759628c

Add guard for lagging subscribers

f476267

Add metrics support

76510f3

Add original Charon timeout logic

e6158bc

Add Default to builder

a1020a5

emlautarom1 added 4 commits May 29, 2026 19:54

Fix clippy lints

0993ff9

Cleanup

9606d1a

- Add docs - Simplify error messages

Add test suite

8043658

- Fix edge case when trimming small epochs

Merge remote-tracking branch 'origin/main' into emlautarom1/core-sche…

2e8fe9a

…duler

emlautarom1 requested review from iamquang95, mskrzypkows, therustmonk and varex83 June 1, 2026 16:45

emlautarom1 added 5 commits June 1, 2026 13:45

Delete planning artifact

f19dc97

Add missing docs

aa16bc6

Remove inner fields

04acd46

- Not needed in practice, and they are large structs that would be expensive to clone.

Rename definitions

36fc4bb

Formatting

7257c8e

varex83 reviewed Jun 2, 2026

View reviewed changes

emlautarom1 added 2 commits June 2, 2026 15:08

Revert anyhow change

b82ffb8

Support async callbacks

a5b8810

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): implement `core/scheduler`#466

feat(core): implement `core/scheduler`#466
emlautarom1 wants to merge 41 commits into
mainfrom
emlautarom1/core-scheduler

emlautarom1 commented Jun 1, 2026 •

edited

Loading

Uh oh!

emlautarom1 commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

varex83 Jun 2, 2026

Uh oh!

varex83 Jun 2, 2026

Uh oh!

emlautarom1 Jun 2, 2026

Uh oh!

varex83 Jun 2, 2026

Uh oh!

emlautarom1 Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

emlautarom1 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emlautarom1 commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: core/scheduler Implementation

1. Functional regression: GetDutyDefinition 100 ms timeout swallows legitimate waits

2. new_slot_ticker ignores cancellation during the inter-slot sleep

3. Reorg handler runs unconditionally — Charon gates it on a feature flag

4. Subscribers terminate permanently on a single lag

5. handle_chain_reorg is pub but on a private struct

6. Redundant clones in SchedulerBuilder::build

7. validator_status reset path is O(total_seen_validators) per validator

8. Style / nits

9. Tests

Summary

Uh oh!

varex83 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

varex83 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

emlautarom1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

varex83 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

emlautarom1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emlautarom1 commented Jun 1, 2026 •

edited

Loading

claude Bot commented Jun 1, 2026 •

edited

Loading

Review: `core/scheduler` Implementation

1. Functional regression: `GetDutyDefinition` 100 ms timeout swallows legitimate waits

2. `new_slot_ticker` ignores cancellation during the inter-slot sleep

5. `handle_chain_reorg` is `pub` but on a private struct

6. Redundant clones in `SchedulerBuilder::build`

7. `validator_status` reset path is `O(total_seen_validators)` per validator