Fix intermittent CI failures in daily test runs#415
Conversation
ca73e60 to
703c6d5
Compare
azuchi
left a comment
There was a problem hiding this comment.
Review of PR #415 — Fix intermittent CI failures in daily test runs
Verified scope: head 703c6d57a, base master (54a27e153, up-to-date). 2 commits, +432 / -6 across 5 files.
CI failure investigation: I downloaded the failure logs for the 5 cited runs to verify what each fix actually addresses:
| Run | Failing test | Root cause |
|---|---|---|
| 759 | feature_fee_estimation.py --scheme SCHNORR |
Estimated fee (0.000565) larger than last fee (0.000478) — monotonicity violation in estimateSmartFee |
| 760 | wallet_multiwallet.py |
http.client.RemoteDisconnected during encryptwallet (RPC triggers internal shutdown, response never sent) |
| 787 | wallet_multiwallet.py --usecli |
Stderr mismatch — prior tapyrusd held the data-directory lock on the next start attempt |
| 789 / 790 | wallet_multiwallet.py |
JSONRPCException -342: non-JSON HTTP response '503 Service Unavailable' — 503 from mid-shutdown HTTP server |
✅ Strongly endorsed
test_node.py shutdown robustness — directly addresses 4 of 5 failures
Each catch precisely matches one observed failure mode:
ConnectionRefusedError→ run 760 (the eventual error afterRemoteDisconnectedduringencryptwallet)JSONRPCException→ runs 789 / 790 (503 from mid-shutdown HTTP server)process.kill()+process.wait()on init-error → run 787 (data-dir lock not released before next start)
Good, focused robustness work.
util_tests.cpp::util_time_GetTime — clean fix for a real race
The original BOOST_CHECK_EQUAL(ntime, GetTimeMillis()/1000) was racing the second boundary between three sequential clock reads. The new (ms_secs >= ntime && ms_secs <= ntime + 1) window is the correct shape (monotonic, ≤1 s skew).
lint-lock-ordering.py — high-value addition
Lint script verified empirically:
- No false positives on current master (
Lock ordering lint: OK (no violations found)) - Successfully detects the v0.7.2 regression when
net_processing.cppis reverted to the pre-fix state:FORBIDDEN: cs_inventory → cs_main (violates canonical hierarchy) Found at: src/net_processing.cpp:1394 - Detected canonical orderings look sensible:
cs_main → cs_wallet (95 sites),cs_main → mempool.cs (6 sites),cs_inventory → cs_filter (3 sites)etc.
This is exactly the safety net that would have caught the v0.7.2 lock-order bug before merge.
🚨 Concerns — must address before merge
B1. fees.cpp monotonicity enforcement diverges from upstream and is O(N²)
The added block in estimateSmartFee:
if (confTarget > 2) {
CFeeRate prevRate = estimateSmartFee(confTarget - 1, nullptr, conservative);
if (prevRate.GetFeePerK() > 0 && llround(median) > prevRate.GetFeePerK()) {
median = static_cast<double>(prevRate.GetFeePerK());
}
}Performance: confTarget may be up to 1008 (one week — see fees.h:154). The recursion walks confTarget → 2, so a worst-case call to estimateSmartFee(1008) recurses ~1006 times, each doing the full estimator computation (3× estimateCombinedFee + 1× estimateConservativeFee). Asymptotic cost goes from O(M) to O(N · M). For estimatesmartfee RPC with high targets this is observable latency.
Upstream comparison: Bitcoin Core's src/policy/fees/block_policy_estimator.cpp::estimateSmartFee (bitcoin/bitcoin@master) does not have this recursive enforcement. Instead, upstream relies on max(target/2 est, target est, 2*target est) plus estimateCombinedFee(...,true,...) which checks shorter time horizons inline. Crucially, upstream explicitly documents this as a known limitation:
* Note: In certain rare edge cases, monotonically increasing estimates may
* not be guaranteed. Specifically, given two targets N and M, where M > N,
* if a sub-estimate for target N fails to return a valid fee rate, while
* target M has valid fee rate for that sub-estimate, target M may result
* in a higher fee rate estimate than target N.
*
* See: https://github.com/bitcoin/bitcoin/issues/11800#issuecomment-349697807So run 759's monotonicity violation is the same rare edge case upstream consciously chose not to fix in code. Adding a divergent fix here increases the maintenance burden of future upstream merges and pays O(N²) cost for what upstream treats as expected behavior.
Alternative — match upstream test instead: Tapyrus's check_estimates (test/functional/feature_fee_estimation.py:148) initialises last_feerate = float(max(fees_seen)), while upstream's check_smart_estimates uses:
mempoolMinFee = node.getmempoolinfo()["mempoolminfee"]
minRelaytxFee = node.getmempoolinfo()["minrelaytxfee"]
feerate_ceiling = max(max(fees_seen), float(mempoolMinFee), float(minRelaytxFee))
last_feerate = feerate_ceilingThis aligns Tapyrus with upstream's tolerance for the documented edge case without paying O(N²) per RPC call. (It does not eliminate flakiness for iterations 2+, but neither does the recursive enforcement when an intermediate target legitimately returns a higher value — the recursion only clips the current estimate.)
Suggested change: drop the fees.cpp recursion, mirror upstream's check_smart_estimates ceiling logic in the Python test. Or, if code-level enforcement is genuinely desired, implement it via memoisation / iteration rather than recursion.
B2. JSONRPCException catch in stop_node is broader than the documented case
except (subprocess.CalledProcessError, JSONRPCException) as e:
self.log.debug("Stop command failed, node may already be stopping: %s", e)The PR description says this handles "HTTP 503", but JSONRPCException matches any RPC error — authentication failures, parameter errors, server errors, etc. — and silently downgrades all of them to a debug-level log. A real shutdown bug that surfaces via a non-503 error would be masked.
Suggested: narrow the catch by inspecting e.error.get('code') for -342 (the specific 503 case observed in runs 789/790), or check '503' in str(e). Re-raise otherwise.
🔍 Minor observations (not blocking)
O1. lint-lock-ordering.py function-body extractor is naive
iter_function_bodies treats every top-level {...} block as a single "function body." In C++ source, namespace foo { ... } and class Foo { ... }; are top-level brace blocks, so AssertLockHeld in one method gets propagated to LOCK calls in other methods of the same namespace/class.
Reproduction:
namespace foo {
void fn1() { AssertLockHeld(cs_a); } // fn1 only
void fn2() { LOCK(cs_b); } // no precondition
}Script reports cs_a → cs_b (false ordering).
Current impact: zero, because the inferred orderings don't trigger any FORBIDDEN_PAIRS entries on current master. But this is fragile — a future code change could happen to produce a false-positive violation that blocks CI for no real reason.
Fix (future): use a proper parser (libclang, tree-sitter) or restrict AssertLockHeld scope to the innermost enclosing function (track function start via signature regex).
O2. lint-lock-ordering.py doesn't strip string literals
const char* msg = "}";would corrupt the brace depth counter. Rare in practice, but a clean implementation should tokenise.
O3. lint-lock-ordering.py TRY_LOCK handling is inconsistent across passes
Pass 2 (AssertLockHeld + LOCK) records TRY_LOCK as an acquisition edge (line 225–230), but Pass 3 (nested-brace) excludes TRY_LOCK from depth_locks (line 257). When TRY_LOCK succeeds, the held lock at outer depth is still followed by TRY_LOCK acquisition, which is a real ordering. Pass 3 should record it too.
O4. CannotSendRequest log level downgraded silently
Change from log.exception (full stack trace + ERROR level) to log.debug (silent unless -loglevel=debug). The comment justifies it ("already stopping"), but this also silences genuine CannotSendRequest bugs (e.g. socket exhaustion, race conditions). Consider log.info as a middle ground.
Recommended actions
Required before merge:
- B1: drop
fees.cpprecursion; mirror upstream'sfeerate_ceilingpattern incheck_estimatesinstead (or document why Tapyrus deliberately diverges from upstream) - B2: narrow
JSONRPCExceptioncatch to only the 503 case (e.error.get('code') == -342)
Recommended:
3. O1: at least add a comment to iter_function_bodies flagging the namespace/class FP risk so a future maintainer knows the boundary
4. O3: align TRY_LOCK handling between Pass 2 and Pass 3
Defer: O2 (string literal handling), O4 (log level choice)
Verification: lint script tested against current master (clean) and against 00bc440b9~1-state of net_processing.cpp (correctly reports the cs_inventory → cs_main violation at line 1394). Upstream block_policy_estimator.cpp and feature_fee_estimation.py fetched from bitcoin/bitcoin@master and compared line-by-line. CI failure logs downloaded via gh run view --log-failed --job <id> for runs 759, 760, 787, 790.
- fees.cpp: a longer confirmation target must never yield a higher fee estimate than a shorter one - test_node.py stop_node: catch ConnectionRefusedError and JSONRPCException (HTTP 503) so a mid-shutdown node does not fail the test - test_node.py assert_start_raises_init_error: kill() and wait() the process on timeout/exception so it releases the data-directory lock before the next start attempt - util_time_GetTime: replace BOOST_CHECK_EQUAL with a <=1-second window check to tolerate second boundaries between sequential clock reads - lint-all.sh: add lint-lock-ordering.py to the lint runner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
50ade58 to
de65433
Compare
AddressCI failures in 759, 760, 787 upto 790