Skip to content

Fix intermittent CI failures in daily test runs#415

Open
Naviabheeman wants to merge 2 commits into
chaintope:masterfrom
Naviabheeman:FixDailyTest
Open

Fix intermittent CI failures in daily test runs#415
Naviabheeman wants to merge 2 commits into
chaintope:masterfrom
Naviabheeman:FixDailyTest

Conversation

@Naviabheeman

Copy link
Copy Markdown
Contributor

AddressCI failures in 759, 760, 787 upto 790

  • fees.cpp: a longer confirmation target must never yield a higher fee estimate than a shorter one
  • test_node.py stop_node: catch ConnectionRefusedError and JSONRPCException (HTTP 503) so a mid-shutdown node does not fail the test
  • test_node.py assert_start_raises_init_error: kill() and wait() the process on timeout/exception so it releases the data-directory lock before the next start attempt
  • util_time_GetTime: replace BOOST_CHECK_EQUAL with a <=1-second window check to tolerate second boundaries between sequential clock reads
  • lint-all.sh: add lint-lock-ordering.py to the lint runner

@azuchi azuchi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #415 — Fix intermittent CI failures in daily test runs

Verified scope: head 703c6d57a, base master (54a27e153, up-to-date). 2 commits, +432 / -6 across 5 files.

CI failure investigation: I downloaded the failure logs for the 5 cited runs to verify what each fix actually addresses:

Run Failing test Root cause
759 feature_fee_estimation.py --scheme SCHNORR Estimated fee (0.000565) larger than last fee (0.000478) — monotonicity violation in estimateSmartFee
760 wallet_multiwallet.py http.client.RemoteDisconnected during encryptwallet (RPC triggers internal shutdown, response never sent)
787 wallet_multiwallet.py --usecli Stderr mismatch — prior tapyrusd held the data-directory lock on the next start attempt
789 / 790 wallet_multiwallet.py JSONRPCException -342: non-JSON HTTP response '503 Service Unavailable' — 503 from mid-shutdown HTTP server

✅ Strongly endorsed

test_node.py shutdown robustness — directly addresses 4 of 5 failures

Each catch precisely matches one observed failure mode:

  • ConnectionRefusedError → run 760 (the eventual error after RemoteDisconnected during encryptwallet)
  • JSONRPCException → runs 789 / 790 (503 from mid-shutdown HTTP server)
  • process.kill() + process.wait() on init-error → run 787 (data-dir lock not released before next start)

Good, focused robustness work.

util_tests.cpp::util_time_GetTime — clean fix for a real race

The original BOOST_CHECK_EQUAL(ntime, GetTimeMillis()/1000) was racing the second boundary between three sequential clock reads. The new (ms_secs >= ntime && ms_secs <= ntime + 1) window is the correct shape (monotonic, ≤1 s skew).

lint-lock-ordering.py — high-value addition

Lint script verified empirically:

  • No false positives on current master (Lock ordering lint: OK (no violations found))
  • Successfully detects the v0.7.2 regression when net_processing.cpp is reverted to the pre-fix state:
    FORBIDDEN: cs_inventory → cs_main  (violates canonical hierarchy)
      Found at: src/net_processing.cpp:1394
    
  • Detected canonical orderings look sensible: cs_main → cs_wallet (95 sites), cs_main → mempool.cs (6 sites), cs_inventory → cs_filter (3 sites) etc.

This is exactly the safety net that would have caught the v0.7.2 lock-order bug before merge.


🚨 Concerns — must address before merge

B1. fees.cpp monotonicity enforcement diverges from upstream and is O(N²)

The added block in estimateSmartFee:

if (confTarget > 2) {
    CFeeRate prevRate = estimateSmartFee(confTarget - 1, nullptr, conservative);
    if (prevRate.GetFeePerK() > 0 && llround(median) > prevRate.GetFeePerK()) {
        median = static_cast<double>(prevRate.GetFeePerK());
    }
}

Performance: confTarget may be up to 1008 (one week — see fees.h:154). The recursion walks confTarget → 2, so a worst-case call to estimateSmartFee(1008) recurses ~1006 times, each doing the full estimator computation (3× estimateCombinedFee + 1× estimateConservativeFee). Asymptotic cost goes from O(M) to O(N · M). For estimatesmartfee RPC with high targets this is observable latency.

Upstream comparison: Bitcoin Core's src/policy/fees/block_policy_estimator.cpp::estimateSmartFee (bitcoin/bitcoin@master) does not have this recursive enforcement. Instead, upstream relies on max(target/2 est, target est, 2*target est) plus estimateCombinedFee(...,true,...) which checks shorter time horizons inline. Crucially, upstream explicitly documents this as a known limitation:

* Note: In certain rare edge cases, monotonically increasing estimates may
* not be guaranteed. Specifically, given two targets N and M, where M > N,
* if a sub-estimate for target N fails to return a valid fee rate, while
* target M has valid fee rate for that sub-estimate, target M may result
* in a higher fee rate estimate than target N.
*
* See: https://github.com/bitcoin/bitcoin/issues/11800#issuecomment-349697807

So run 759's monotonicity violation is the same rare edge case upstream consciously chose not to fix in code. Adding a divergent fix here increases the maintenance burden of future upstream merges and pays O(N²) cost for what upstream treats as expected behavior.

Alternative — match upstream test instead: Tapyrus's check_estimates (test/functional/feature_fee_estimation.py:148) initialises last_feerate = float(max(fees_seen)), while upstream's check_smart_estimates uses:

mempoolMinFee = node.getmempoolinfo()["mempoolminfee"]
minRelaytxFee = node.getmempoolinfo()["minrelaytxfee"]
feerate_ceiling = max(max(fees_seen), float(mempoolMinFee), float(minRelaytxFee))
last_feerate = feerate_ceiling

This aligns Tapyrus with upstream's tolerance for the documented edge case without paying O(N²) per RPC call. (It does not eliminate flakiness for iterations 2+, but neither does the recursive enforcement when an intermediate target legitimately returns a higher value — the recursion only clips the current estimate.)

Suggested change: drop the fees.cpp recursion, mirror upstream's check_smart_estimates ceiling logic in the Python test. Or, if code-level enforcement is genuinely desired, implement it via memoisation / iteration rather than recursion.

B2. JSONRPCException catch in stop_node is broader than the documented case

except (subprocess.CalledProcessError, JSONRPCException) as e:
    self.log.debug("Stop command failed, node may already be stopping: %s", e)

The PR description says this handles "HTTP 503", but JSONRPCException matches any RPC error — authentication failures, parameter errors, server errors, etc. — and silently downgrades all of them to a debug-level log. A real shutdown bug that surfaces via a non-503 error would be masked.

Suggested: narrow the catch by inspecting e.error.get('code') for -342 (the specific 503 case observed in runs 789/790), or check '503' in str(e). Re-raise otherwise.


🔍 Minor observations (not blocking)

O1. lint-lock-ordering.py function-body extractor is naive

iter_function_bodies treats every top-level {...} block as a single "function body." In C++ source, namespace foo { ... } and class Foo { ... }; are top-level brace blocks, so AssertLockHeld in one method gets propagated to LOCK calls in other methods of the same namespace/class.

Reproduction:

namespace foo {
    void fn1() { AssertLockHeld(cs_a); }   // fn1 only
    void fn2() { LOCK(cs_b); }              // no precondition
}

Script reports cs_a → cs_b (false ordering).

Current impact: zero, because the inferred orderings don't trigger any FORBIDDEN_PAIRS entries on current master. But this is fragile — a future code change could happen to produce a false-positive violation that blocks CI for no real reason.

Fix (future): use a proper parser (libclang, tree-sitter) or restrict AssertLockHeld scope to the innermost enclosing function (track function start via signature regex).

O2. lint-lock-ordering.py doesn't strip string literals

const char* msg = "}";

would corrupt the brace depth counter. Rare in practice, but a clean implementation should tokenise.

O3. lint-lock-ordering.py TRY_LOCK handling is inconsistent across passes

Pass 2 (AssertLockHeld + LOCK) records TRY_LOCK as an acquisition edge (line 225–230), but Pass 3 (nested-brace) excludes TRY_LOCK from depth_locks (line 257). When TRY_LOCK succeeds, the held lock at outer depth is still followed by TRY_LOCK acquisition, which is a real ordering. Pass 3 should record it too.

O4. CannotSendRequest log level downgraded silently

Change from log.exception (full stack trace + ERROR level) to log.debug (silent unless -loglevel=debug). The comment justifies it ("already stopping"), but this also silences genuine CannotSendRequest bugs (e.g. socket exhaustion, race conditions). Consider log.info as a middle ground.


Recommended actions

Required before merge:

  1. B1: drop fees.cpp recursion; mirror upstream's feerate_ceiling pattern in check_estimates instead (or document why Tapyrus deliberately diverges from upstream)
  2. B2: narrow JSONRPCException catch to only the 503 case (e.error.get('code') == -342)

Recommended:
3. O1: at least add a comment to iter_function_bodies flagging the namespace/class FP risk so a future maintainer knows the boundary
4. O3: align TRY_LOCK handling between Pass 2 and Pass 3

Defer: O2 (string literal handling), O4 (log level choice)


Verification: lint script tested against current master (clean) and against 00bc440b9~1-state of net_processing.cpp (correctly reports the cs_inventory → cs_main violation at line 1394). Upstream block_policy_estimator.cpp and feature_fee_estimation.py fetched from bitcoin/bitcoin@master and compared line-by-line. CI failure logs downloaded via gh run view --log-failed --job <id> for runs 759, 760, 787, 790.

Naviabheeman and others added 2 commits June 2, 2026 22:16
- fees.cpp: a longer confirmation target must never yield a higher fee
  estimate than a shorter one
- test_node.py stop_node: catch ConnectionRefusedError and JSONRPCException
  (HTTP 503) so a mid-shutdown node does not fail the test
- test_node.py assert_start_raises_init_error: kill() and wait() the
  process on timeout/exception so it releases the data-directory lock
  before the next start attempt
- util_time_GetTime: replace BOOST_CHECK_EQUAL with a <=1-second window
  check to tolerate second boundaries between sequential clock reads
- lint-all.sh: add lint-lock-ordering.py to the lint runner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Naviabheeman Naviabheeman force-pushed the FixDailyTest branch 2 times, most recently from 50ade58 to de65433 Compare June 3, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants