Skip to content

Multiple PR landed with support for chimeric; RAW and the .d native formats#46

Merged
ypriverol merged 120 commits into
masterfrom
dev
Jun 2, 2026
Merged

Multiple PR landed with support for chimeric; RAW and the .d native formats#46
ypriverol merged 120 commits into
masterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Jun 1, 2026

Summary by CodeRabbit

  • New Features

    • Native support for Thermo .raw spectra files (bundled .NET runtime included in releases)
    • Native support for Bruker timsTOF .d directories via pure-Rust implementation
    • Chimeric precursor cascade mode for improved co-isolated peptide detection
    • MS1 spectrum capture and linking for mzML input files
  • Documentation

    • New benchmark comparison across multiple search engines on vendor-native formats
    • Expanded CLI documentation for file-format support and parameter defaults
    • Updated feature documentation for native input handling
  • Changes

    • Default precursor charge range updated to 2–5
    • Percolator output now includes additional precursor isotope and ranking feature columns

ypriverol added 30 commits May 28, 2026 15:02
Phased design to recover co-fragmented peptide IDs via the DDA+ approach
(full-isolation-window search FIRST, then MS1 targeted-XIC isotope
refinement, then greedy shared-fragment rescoring). Maps each DDA+ step to
concrete msgf-rust components:

 - Phase 1: parse isolation-window width + widen candidate enumeration to
   the full window + emit top-N distinct-peptide PSMs/scan. Behind
   --chimeric (default off => bit-identical). Reuses the existing
   bucket_index range scan + per-charge multi-queue machinery.
 - Phase 2: MS1 targeted-XIC + isotope KL-divergence as an ADDITIVE PIN
   feature (audit-safe), or an external handoff (lean-Rust option).
 - Phase 3: greedy shared-fragment rescoring (additive-first; bench-gated).
 - Cross-cutting: candidate explosion ties into the planned fragment-index
   speed enabler; measure Phase-1 wall first before building it.

Motivated by the PR #40 finding that scoring parity is exhausted, so ID
gains now require a new capability rather than parity fixes.
…arch + multi-PSM)

Bite-sized TDD tasks for Phase 1 (option A, existing bucket scan):
 1. Spectrum isolation-window offset fields
 2. mzML <isolationWindow> parsing (MS:1000828/829)
 3. --chimeric + --isolation-halfwidth CLI/SearchParams wiring
 4. widen candidate enumeration to the isolation window when chimeric
 5. retain + emit top-N distinct-peptide PSMs/scan + SpecId uniqueness
 6. workspace gate + VM bench decision point (off=bit-identical, on=PXD/TMT measure)

Every task keeps the chimeric=false path on existing code, guarded by the
bit-identical PIN golden test. Task 6 is the ship/defer decision (incl. whether
the fragment-index enabler becomes a prerequisite if wall is unacceptable).
Add `isolation_lower_offset: Option<f64>` and `isolation_upper_offset: Option<f64>` to `Spectrum`, initialized to `None` in every constructor/literal across the workspace. Bit-identical to baseline (no scoring path touched).
Add `chimeric: bool` and `chimeric_isolation_halfwidth_da: f64` fields to
`SearchParams` (defaulting to false/1.5) and the matching `--chimeric` /
`--isolation-halfwidth` CLI flags; wire them into SearchParams assembly.
When `--chimeric` is on and `--top-n` is at its default of 1, top-N is
automatically raised to 5 with an eprintln notice. No search behavior
changes; default off path is bit-identical.
…eric

Extract candidate_nominal_bounds(spec, z, params, shift_ppm). When
params.chimeric, derive the candidate nominal-mass window from the full
isolation window (selected m/z ± isolation offsets, or the configured
half-width fallback) converted to neutral nominal mass per charge, with
isotope-error widening only (the window already exceeds precursor tol).
When chimeric is off, the derivation is byte-identical to the original
inline code.

Test candidate_nominal_bounds_chimeric_spans_isolation_window asserts the
chimeric window strictly contains the standard one and that an off-precursor
co-isolated mass is reachable only under chimeric. Bit-identical PIN golden
gate green (chimeric defaults off); clippy clean.
… loop

Branch the per-candidate iso-offset match: under --chimeric use
matches_isolation_window (accept anywhere in the isolation window, hoisted
window bounds), else the existing matches_precursor tight check. Bit-identical
PIN golden gate green (chimeric off path unchanged). Also fixes a clippy
field-reassign-default in the isolation-window unit test.
A chimeric scan emits multiple distinct-peptide PSMs that can share a SpecE
rank (iter_ranked increments rank only on a distinct spec_e_value), which
would collide on the `spec_scan_rank` SpecId. Append the per-row emission
index under --chimeric only; the standard SpecId format is unchanged so the
bit-identical PIN golden still holds. Test asserts two same-SpecE
co-fragmented PSMs get distinct SpecIds.
…nement)

Five bite-sized tasks to add the MS1 isotope-envelope check that suppresses
the Phase-1 FDR inflation:
 1. averagine theoretical isotope envelope (new model::isotope)
 2. optional MS1 capture + MS2->MS1 linkage in the mzML reader
 3. observed precursor isotope envelope + KL-divergence + SNR features
 4. additive PrecursorIsotopeKL + PrecursorSNR PIN columns
 5. VM bench decision gate (does the KL feature deflate Astral's +94%? if not,
    Phase 3 shared-fragment rescoring is required)

v1 uses the single linked MS1 (apex); multi-scan XIC correlation deferred to
Phase 2b. Additive PIN columns + --chimeric-off bit-identical throughout.
Add `averagine_isotope_envelope(mass, n_isotopes) -> Vec<f64>` in a new
`crates/model/src/isotope.rs` module. Uses the averagine + Poisson model
(lambda ≈ mass * 4.76e-4 from 13C natural abundance) to compute normalized
precursor isotope peak intensities. Registered as `pub mod isotope` in lib.rs.
Three unit tests cover normalization, mass-dependent +1/+0 growth, and edge
cases (n=0, n=1). New module is unused by default; zero change to existing
pipeline output (bit-identical gate: ok).
…nt; fragment competition is the real missing discriminator
…ew (low overlap tentatively challenges fragment-theft premise)
… 38% fracmin>=0.5 vs BSA 13%)

Decisive Astral chimeric run (MSGF_CHIMERIC_OVERLAP=1, n=121,423 co-emitting
scans): mean fracmin 0.367, 38% >=0.5, bimodal with a near-total-overlap mode at
[0.9-1.0)=11.4%. BSA low-overlap pattern does NOT hold on real co-isolated data.
Fragment-theft premise validated for a substantial fraction -> Phase 3
(shared-fragment competition) is the relevant fix for those scans; ~28%
coincidental tail still needs per-scan FDR.
…ce filter spec

Approach A (filter + additive PIN columns, no score modification). Greedy
peak-claiming (rank-1 first, protected) drives a hard pre-Percolator filter on
UniqueMatchedIons (swept knob, decoy-symmetric) + 3 additive PIN columns.
Grounded in the 2026-05-29 Astral overlap result (theft confirmed, bimodal).
DoD: Astral canary returns to ~36.7k AND PXD beats Java, --chimeric off
bit-identical, wall within 3%. Research toward trustworthy chimeric; not a merge.
…l SpecEValue + Percolator on all features)

Drop the hand-tuned --chimeric-min-unique-ions filter. Discrimination is the
score itself at two layers: (1) in-engine residual SpecEValue re-score on
uniquely-claimed peaks deflates theft/coincidental rank>=2 peptides; (2)
Percolator on the full feature vector (re-scored RawScore/lnSpecEValue + additive
unique-evidence columns). Rule-2-safe: off bit-identical, rank-1 untouched, only
chimeric extra rows change.
… SpecEValue re-score

Two-layer discrimination, no parameter (per design 93178d7):
- Pure competition core (shared_fragment.rs): greedy peak-claiming most-confident
  first; per rank>=2 peptide compute unique-evidence (UniqueMatchedIons,
  UniqueExplainedFraction, SharedFracClaimed). 7 unit tests.
- match_engine hook (guarded on --chimeric): walk emitted PSMs best-first, claim
  peaks, and re-score each rank>=2 PSM's RawScore + GF SpecEValue on the residual
  (unclaimed) spectrum via rescore_residual_spec_e. A theft/coincidental peptide,
  stripped of stolen peaks, gets a worse SpecEValue and drops out of the FDR set
  on its own; no hard filter, no threshold. Decoy-symmetric.
- Additive PIN columns gated on --chimeric (off path byte-identical). Header
  gating test + existing schema-parity test green.

Smoke (BSA test.mgf): off has no Phase-3 cols, on emits 3; 55 rows shared>0,
37 fully-stolen rows deflated to negative residual RawScore / lnSpecE ~0.
Validation gate (Astral canary -> ~36.7k, PXD>Java) pending VM bench.
…ary (+111%, not deflated)

off vs on @1% FDR: PXD 14,808->18,306, Astral 36,715->77,444 (canary should be
~flat), TMT 9,605->9,362. Decoy fraction 0.83 on-runs (structural inflation
intact). Root cause: a per-PSM score change deflates spurious targets AND decoys
symmetrically -> q-value curve unchanged -> aggregate 1% count doesn't move. The
broken part is the PSM-level FDR model (Phase-2 requirement #2), not the
per-peptide score. Phase 3 refuted as a gate-clearer; impl kept as validated
negative result; --chimeric off byte-identical; nothing ships.
Approach A: separate target-decoy FDR for rank-1 vs rank>=2 strata (split PIN by
rank -> 2 Percolator runs -> sum @1%). Tests the structural fix after Phase 3
(per-PSM rescore) was refuted. Measure on both non-rescored Phase-1 PIN (model
alone) and Phase-3 rescored PIN (composition). Canary: Astral total ~36.7k; win:
PXD>14,974. No Rust production change (test-only NO_RESCORE env gate).
ypriverol added 12 commits June 1, 2026 12:34
… 0.7)

Wires thermorawfilereader as an optional dependency behind a non-default 'thermo'
feature. Default (pure-Rust) builds are unaffected — the Thermo dep tree (and its
.NET-bundle build deps) only compile when the feature is enabled; release CI turns
it on. The .NET RawFileReader assemblies are vendored inside dotnetrawfilereader-sys
(verified: lib/librawfilereader.*), so the build needs no .NET SDK; only opening a
.raw at runtime needs the .NET 8 runtime. thermo.rs reader + dispatch land next.
…spatch)

ThermoRawReader (crates/input/src/thermo.rs, feature 'thermo') wraps the official
RawFileReader via thermorawfilereader, yielding the same Spectrum model as the
mzML/MGF readers (native_id title, precursor m/z+charge+intensity, RT, centroided
peaks; MS2 by default, with_all_ms_levels for chimeric later). Iterates by index
(RawFileReader::get) so it owns the handle without a self-referential borrow.

CLI dispatch: .raw auto-detected by extension -> ThermoRawReader in the non-chimeric
streaming reader (feature-gated; a clear error when built without 'thermo'). is_mgf
now distinguishes MGF from .raw for the MGF-only ms-level warning and TSV title
synthesis; the MS-level filter message covers .raw too. --precursor-cal falls back to
off with a warning on .raw for now (calibration .raw support is a follow-up); --chimeric
on .raw degrades to a normal search (MS1 streaming is M2).

Default (pure-Rust) build unaffected and green; the thermo path builds on rustc>=1.88
with the .NET 8 runtime at runtime. Validated on real data: reading the 2.4GB Astral
.raw yields 127,225 spectra (3,686 MS1 + 123,539 MS2), an exact match to the mzML's
spectrumList count.
…lf-contained release

M2 — chimeric cascade on .raw:
- thermo.rs maps the precursor isolation window (RawFileReader lower/target/upper
  m/z -> model offsets) so co-isolation detection works on .raw.
- ThermoRawReader::read_with_ms1_chunked mirrors MzMLReader's: streams MS2 in
  bounded chunks each paired with an Ms1Link (per-chunk MS1 peaks + per-MS2 link
  to the preceding MS1, carried across chunk boundaries). Lets --chimeric consume
  .raw exactly like mzML.
- CLI: --chimeric now enabled for mzML OR .raw (MGF still degrades with a warning);
  the chimeric parser thread dispatches MzMLReader vs ThermoRawReader.

M3 — tests + docs:
- crates/input/tests/thermo_raw.rs: feature-gated integration tests (read MS2 +
  chunked MS1-link consistency), skipped unless MSGF_TEST_RAW points at a .raw
  (proprietary format, no fixture committed).
- README 'Reading Thermo .raw files' section (self-contained release vs
  build-from-source + .NET 8 install, auto-dispatch, --chimeric, container snippet);
  CLI table + --chimeric updated; crates/input/THERMO_LICENSE.txt (Thermo license).

Release — self-contained cross-platform packaging:
- main() auto-points DOTNET_ROOT at a bundled <exe_dir>/dotnet runtime when
  present (else honors an existing DOTNET_ROOT / system install).
- release.yml: linux-x64 / macOS x64+arm64 / windows-x64 built --features thermo
  (RUSTUP_TOOLCHAIN=stable overrides the repo 1.87 pin for thermo's >=1.88 deps)
  and bundle the matching .NET 8 runtime in dotnet/ -> users install nothing.
  linux-arm64 stays a plain no-thermo cross build.

Default (pure-Rust) build unaffected and green; M2 thermo build verified on the VM
(rustc 1.95). Chimeric .raw-vs-mzml parity validating on Astral.
…ywhere

The chimeric .raw reader (read_with_ms1_chunked) now explicitly skips MS3+ scans
with a visible count — only MS2 is searched and only MS1 links it, so a TMT
SPS-MS3 acquisition can never have its MS3 reporter-quant scans scored. Both the
chimeric and non-chimeric paths already default to MS2 (ms_level default = 2) and
honor --ms-level as an explicit override (mzML + .raw); the chimeric cascade is
always MS2 by construction. Clarified the --ms-level docs + README accordingly.

Verified on the VM (5000-spectrum cap): min-peaks filters .raw (45,996 vs 46,011),
runs report 'only MS2 spectra', and .raw matches mzML within noise (Δ24/46,011).
Charge --min/--max only gates charge-UNKNOWN spectra (determined charges are
searched as-is) — identical behavior for .raw and mzML (same engine path).
Charge-unknown spectra now try precursor charges 2..=5 by default (determined
charges from the spectrum are still searched as-is). Widens recall for higher
charge states common on modern instruments. Applies to all input formats.

Side effects (handled): the default PIN gains charge4/charge5 one-hot columns
(42 cols on the BSA fixture); the precursor_cal_off golden is regenerated from
the new default. The Java-parity schema test sets charge_range=2..=3 explicitly
and is unaffected. README charge defaults + column note updated. All tests green.
… auto-route, chimeric MS2-only)

Resolves the 4 issues from the code review (PR #44) + Codex adversarial review:

1. Precursor-less MS2 scans are now SKIPPED, not searched. convert() returns
   Option<Spectrum> and drops any MS2 whose precursor record is missing or has
   m/z <= 0 (both the iterator and the chunked chimeric reader honor it),
   matching the mzML reader instead of forwarding precursor_mz=0.0.

2. extract_peaks() now guarantees m/z-ascending order (sorts only when an
   inversion is present), upholding the Spectrum.peaks contract that
   coisolation's partition_point binary search relies on.

3. .raw param auto-routing: detect_activation_instrument() peeks the .raw's
   vendor activation (DissociationMethod) + analyzer (MassAnalyzer) and routes
   the bundled .param accordingly, so a CID/ETD/ion-trap .raw is no longer
   silently scored with HCD/QExactive. convert() also stamps each Spectrum's
   activation_method. Validated on the Astral .raw: auto-detects HCD + QExactive.

4. --chimeric is MS2-only on EVERY format: the mzML chimeric reader is hardcoded
   to MS2 (was with_ms_level_range(ms_level,...), which let --ms-level 3 widen to
   1..=3 and admit MS3); a non-2 --ms-level under --chimeric now warns and is
   ignored. Closes the TMT SPS-MS3 gap on the mzML path too.

Plus: mapping unit tests (map_dissociation/map_analyzer, no .NET needed) and
updated the two stale comments the review flagged. Default build + thermo build
(rustc 1.95) + all tests green.
Native Thermo .raw input (M1–M3) + self-contained cross-platform release
Add an optional, off-by-default `timstof` cargo feature that reads native
Bruker timsTOF .d (DDA-PASEF) data via the pure-Rust timsrust crate (the same
reader Sage uses) and produces the engine's model::Spectrum, so the search path
is format-agnostic. A .d is a directory (TDF SQLite + binary blob) read natively
with no vendor runtime and nothing to bundle.

TimsTofReader opens a .d by path, iterates the centroided MS2 spectra by index
(timsrust SpectrumReader::new + len/get), and maps each to a Spectrum: 1-based
scan number, precursor m/z/charge/intensity, RT in seconds, isolation
center+width to symmetric offsets, and ascending-sorted peaks. Precursor-less or
non-positive-m/z spectra are skipped, mirroring the mzML/Thermo readers. The dep
and all .d code sit behind cfg(feature = timstof); default builds read mzML/MGF
only and never pull in timsrust.
Detect a .d input by path extension (works on the .d directory) and add is_d
alongside is_mzml; is_mgf now means 'neither mzML nor .d'. In the non-chimeric
streaming reader, route .d to TimsTofReader (feature-gated), with a clear error
when the binary is built without --features timstof.

Chimeric and precursor-calibration on .d are out of scope: the DDA reader
exposes no MS1 stream, so --chimeric on .d degrades gracefully to a normal
search (like MGF), and --precursor-cal is skipped with a warning. The TSV writer
uses its scan-based (non-MGF) path for .d since the reader emits real scan
numbers.
Add the design doc (approach, why timsrust, scope/out-of-scope, PXD072598
benchmark dataset, the not-built-locally note), a feature-gated integration test
that opens a real .d only when MSGF_TEST_D points at one (no-op otherwise so CI
stays green), and a README 'Reading Bruker timsTOF .d files' section plus the
--spectrum flag update.
Integrate the timsTOF .d input alongside the now-merged Thermo .raw feature.
Resolved the shared dispatch/feature files: both 'thermo' and 'timstof' cargo
features coexist; the binary's format dispatch chains is_mzml / is_raw / is_d /
MGF, is_mgf excludes all native formats, --chimeric covers mzML+.raw (.d degrades
like MGF), --precursor-cal skips .raw and .d, and the README lists all four input
formats. Default + --features timstof builds green.
Native Bruker timsTOF .d input (timstof feature)
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Warning

Review limit reached

@ypriverol, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 19 minutes and 25 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e625ae88-25e3-4850-988a-e3e75f563945

📥 Commits

Reviewing files that changed from the base of the PR and between 059ed20 and 399bb6d.

📒 Files selected for processing (3)
  • crates/input/src/timstof.rs
  • crates/search/src/match_engine.rs
  • docs/benchmarks/2026-06-01-4engine-native-format.md
📝 Walkthrough

Walkthrough

This PR implements native Thermo .raw and Bruker timsTOF .d spectrum format support alongside an opt-in chimeric two-pass co-isolation cascade. Key additions include: native reader wrappers using vendor libraries with MS2 filtering and chunked iteration matching mzML semantics; mzML parser extensions for isolation window offset parsing and opt-in MS1 peak capture with per-MS2 linkage; isotope envelope KL/SNR matching for precursor validation; co-isolated precursor detection and targeted secondary peptide GF search on residual spectra; refactored PreparedSearch state for reusability across search passes; updated PSM data structures and PIN/TSV output to emit chimeric features and precursor overrides; CLI routing for native formats with auto-parameter detection and bundled .NET 8 runtime configuration; release workflow CI updates; and comprehensive documentation and benchmark configurations.

Changes

Chimeric cascade and native input integration

Layer / File(s) Summary
Native Thermo and timsTOF readers
crates/input/Cargo.toml, crates/input/src/thermo.rs, crates/input/src/timstof.rs, crates/input/tests/*
New reader modules wrapping thermorawfilereader and timsrust libraries with MS2-only filtering, precursor validation, centroid peak extraction, and chunked MS2 iteration with Ms1Link state matching mzML semantics. Integration tests validate spectrum streaming and linkage correctness.
Input module exposure
crates/input/src/lib.rs, crates/input/src/mgf.rs
Expose Thermo and TimsTOF readers as feature-gated public modules; update MGF reader to initialize isolation offset fields.
mzML isolation offsets and MS1 capture
crates/input/src/mzml.rs
Extend XML parser to extract isolation-window lower/upper offset cvParams; implement opt-in MS1 peak capture mode intercepting and storing MS1 spectra; introduce public Ms1Link data structure recording per-MS2 indices to most-recent preceding captured MS1; add with_ms1_capture, read_with_ms1, and read_with_ms1_chunked APIs with bounded-memory chunking and tolerant resync past parse errors.
Spectrum model isolation offsets and isotope envelope
crates/model/src/spectrum.rs, crates/model/src/isotope.rs, crates/model/src/lib.rs
Add optional isolation_lower_offset and isolation_upper_offset fields to Spectrum; implement averagine_isotope_envelope computing theoretical precursor isotope distribution via iterative Poisson with normalization.
Chimeric precursor isotope matching
crates/search/src/chimeric_features.rs
Implement precursor_isotope_match computing KL divergence and SNR-like score from observed vs theoretical precursor isotope envelopes; include sentinel handling for degenerate cases and epsilon-flooring for numerical stability.
Co-isolated precursor detection and secondary search
crates/search/src/coisolation.rs
Implement detect_coisolated to find additional precursor envelopes in isolation window excluding selected precursor; implement search_secondary performing targeted GF search on residual spectrum with matched-peak removal, precursor calibration shift handling, and deterministic tie-breaking; comprehensive unit tests covering competition and regression cases.
Search library and parameters
crates/search/Cargo.toml, crates/search/src/lib.rs, crates/search/src/search_params.rs, crates/search/src/precursor_matching.rs, crates/search/src/candidate_gen.rs, crates/search/src/mass_calibrator.rs, crates/search/src/precursor_cal.rs, crates/search/src/search_index.rs
Add chimeric and chimeric_isolation_halfwidth_da fields to SearchParams with defaults false and 1.5; add matches_isolation_window helper for precursor window matching; update test fixtures and documentation comments.
PSM data structures for chimeric
crates/search/src/psm.rs
Extend PsmFeatures with chimeric precursor isotope KL/SNR fields and rank-1-gated delta_raw_score; add PsmMatch::precursor_mz_override for secondary peptides; introduce TopNQueue::force_push for non-competitive secondary insertion.
Match engine refactoring: state reuse
crates/search/src/match_engine.rs (part 1)
Introduce PreparedParts and refactor one-time setup into PreparedSearch::prepare; add into_parts/from_parts/with_ms1_link APIs for reusable state across search passes; add ms1_link field to PreparedSearch; expose compute_spec_e_values_for_spectrum and run_pass2_coisolation at crate level.
Match engine: spectrum loop and cascade
crates/search/src/match_engine.rs (part 2)
Update run_chunk_inner to track best/second-best rank_score for delta_raw_score, add chimeric overlap diagnostic, refactor isotope handling; implement run_pass2_coisolation orchestrating co-isolated detection and secondary GF searches with force_push insertion; skip precursor envelope features on pass-1 path.
Match engine: feature computation
crates/search/src/match_engine.rs (part 3)
Update compute_psm_features to use partition-wide ion list for ion-current ratio (not b/y-only), add edge_score feature, compute and set delta_raw_score, fix top-7 error-stat units to absolute ppm, initialize chimeric precursor fields; add matched_peak_keys helper and refactor merge_unique_candidate_idxs to O(k) via FxHashSet.
PIN output format and row emission
crates/output/src/pin.rs
Extend PIN header with four additive feature columns (EdgeScore, PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore); emit row index in SpecId when params.chimeric enabled; use psm.precursor_mz_override for ExpMass computation; emit DeltaRawScore only on rank-1 rows. Updated tests verify column positioning and SpecId uniqueness under chimeric.
TSV output and schema parity
crates/output/src/tsv.rs, crates/output/tests/output_pin_schema_parity.rs
Update TSV precursor column to use precursor_mz_override when present; update schema parity test to expect four Rust-only additive columns and verify positioning between matchedIonRatio and Peptide.
msgf-rust binary: CLI and integration
crates/msgf-rust/src/bin/msgf-rust.rs, crates/msgf-rust/Cargo.toml
Add thermo and timstof feature flags; update CLI defaults (--charge-max 5); add --chimeric and --isolation-halfwidth options; implement format detection for .raw/.d alongside mzML/MGF; expand parameter auto-detection to include native readers; configure bundled .NET runtime setup for Thermo; disable --precursor-cal for native formats with warning; integrate two-pass chunk processing with run_pass2_coisolation and Ms1Link tracking; change TSV dispatch to use is_mgf.
Release workflow: thermo build matrix and .NET bundling
.github/workflows/release.yml
Add per-target thermo and dotnet_arch matrix fields; conditionally compile with --features thermo timstof on thermo targets; download/stage .NET 8 runtime via dotnet-install for target RID; bundle runtime under dotnet/ and include THERMO_LICENSE.txt only on thermo builds.
GF scoring updates and test fixtures
crates/scoring/src/gf/generating_function.rs, crates/scoring/src/gf/*, crates/scoring/src/scoring/*, crates/scoring/tests/*, crates/msgf-rust/src/bin/msgf-trace.rs
Add process-global dropped-nodes diagnostic counter; update msgf-trace --print-score-dist to use rank_score instead of score for GF threshold; refine documentation comments across GF and scoring modules for rank_score vs score roles and Java parity; update test fixtures to include isolation offset fields.
User documentation
README.md, DOCS.md, .claude/CLAUDE.md
Update README with four-engine benchmark statement and CLI table; expand DOCS input formats section documenting native Thermo .raw and timsTOF .d with feature/runtime notes; update calibration and chimeric cascade documentation; clarify CLI help for updated defaults and new options; document bundled parameter auto-detection and building instructions.
Benchmarks and configuration
docs/benchmarks/*, docs/parity-analysis/*, docs/superpowers/*
Add benchmarks README and new 4-engine native-format comparison report; include harmonized MSFragger (Astral/timsTOF) and Sage (Astral/timsTOF) configuration files; remove prior parity analysis and planning documents.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • bigbio/msgf-rust#38: Aligns GF/SpecE scoring thresholds to use Java-style rank_score instead of score in match_engine and msgf-trace, directly related to this PR's scoring updates.

Poem

🐰 A cascade of isotopes glimmer bright,
Native readers now dance in the .raw light,
Two passes, one prey, co-isolation's call—
Thermo and Bruker unite through it all!

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR lands three major opt-in capabilities together: a chimeric two-pass cascade search (--chimeric), native Thermo .raw input (feature thermo + bundled .NET 8), and native Bruker timsTOF .d input (feature timstof via the pure-Rust timsrust crate). It also extends the mzML reader to capture MS1 spectra for chimeric, adds several additive Percolator features (PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore), refreshes the README/CLI tables, drops a large volume of historical dev-jargon comments, and adds many design / parity-analysis notes documenting the cascade investigation.

Changes:

  • New --chimeric MS1-gated two-pass cascade (search + coisolation + averagine isotope features), new additive PIN columns, and a force_push queue API for secondary PSMs.
  • Native .raw (feature-gated, .NET 8) and .d (feature-gated, pure-Rust timsrust) input backends with extension-based dispatch and updated release CI bundling a self-contained .NET runtime.
  • Spectrum model carries isolation-window offsets; mzML reader emits Ms1Link; PSM model adds precursor_mz_override so secondaries report their own ExpMass/dm.

Reviewed changes

Copilot reviewed 83 out of 85 changed files in this pull request and generated no comments.

Show a summary per file
File Description
README.md Documents .raw / .d / --chimeric, restructures CLI flags into required/optional tables.
docs/superpowers/specs/, docs/design/, docs/parity-analysis/notes/* New design specs, plans, and bench notes for chimeric, fragment-index investigation, .raw, .d, native rescoring pipeline.
.claude/CLAUDE.md Updates "next planned work" with cascade PR + abandoned frag-index note.
.github/workflows/release.yml Adds --features thermo builds and bundles .NET 8 runtime per-target.
crates/input/Cargo.toml, src/lib.rs Adds optional thermo / timstof features and re-exports.
crates/input/src/timstof.rs (+ test) New pure-Rust Bruker .d reader via timsrust.
crates/input/THERMO_LICENSE.txt, tests/thermo_raw.rs Thermo license notice and live .raw integration test (env-gated).
crates/input/src/mgf.rs Fills in new isolation-window fields as None.
crates/model/src/spectrum.rs Adds isolation_lower_offset / isolation_upper_offset.
crates/model/src/isotope.rs (+ lib.rs) New averagine isotope envelope.
crates/model/src/{tolerance, aa_set}.rs Comment cleanups.
crates/scoring/src/gf/{generating_function, score_dist, primitive_graph, group}.rs Adds DROPPED_NODES counter, get_probability OOB→0.0 guard, comment cleanup, test fixture updates.
crates/scoring/src/scoring/{rank_scorer, psm_score}.rs Comment cleanup; test fixture additions for new spectrum fields.
crates/scoring/tests/* Fixture updates for the new spectrum fields.
crates/search/Cargo.toml Promotes input from dev-deps to deps (needed by new modules).
crates/search/src/lib.rs Wires coisolation and chimeric_features modules and re-exports run_pass2_coisolation.
crates/search/src/chimeric_features.rs Observed-vs-averagine KL + SNR scoring.
crates/search/src/precursor_matching.rs New matches_isolation_window for chimeric candidate gating.
crates/search/src/psm.rs New PsmFeatures (KL/SNR/DeltaRawScore), precursor_mz_override, force_push, comment cleanup.
crates/search/src/{search_params, mass_calibrator, precursor_cal, search_index, candidate_gen}.rs New chimeric / chimeric_isolation_halfwidth_da params + fixture updates + comment cleanup.
crates/search/tests/* Fixture updates for the new fields.
crates/output/src/tsv.rs Uses precursor_mz_override for ExpMass/dm in chimeric secondaries.
crates/output/tests/output_pin_schema_parity.rs Updated to expect 4 additive Rust-only columns.
crates/msgf-rust/Cargo.toml Adds thermo / timstof features.
crates/msgf-rust/src/bin/msgf-trace.rs Uses rank_score (node+cleavage+edge) for the trace's score threshold to align with the production SpecEValue path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (9)
crates/search/src/coisolation.rs (1)

164-177: ⚖️ Poor tradeoff

Consider grouping parameters into a context struct.

search_secondary has 12 parameters, which exceeds typical guidelines and hurts readability. The scoring/search context parameters (scorer, aa_set, enzyme, params, search_index, fragment_tolerance_da) could be grouped into a SecondarySearchContext struct to reduce the parameter count to ~6-7.

♻️ Example refactor
pub(crate) struct SecondarySearchContext<'a> {
    pub scorer: &'a RankScorer,
    pub aa_set: &'a AminoAcidSet,
    pub enzyme: Option<Enzyme>,
    pub params: &'a SearchParams,
    pub search_index: &'a SearchIndex,
    pub fragment_tolerance_da: f64,
}

pub(crate) fn search_secondary(
    spec: &Spectrum,
    primary: &Peptide,
    prior_claimed: &std::collections::HashSet<i64>,
    co: CoIsolated,
    candidates: &[Candidate],
    bucket_index: &BTreeMap<i32, Vec<usize>>,
    ctx: &SecondarySearchContext,
) -> Option<(PsmMatch, std::collections::HashSet<i64>)> {
    // Use ctx.scorer, ctx.aa_set, etc.
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/search/src/coisolation.rs` around lines 164 - 177, The function
search_secondary currently takes many related scoring/search params; refactor by
introducing a SecondarySearchContext struct containing scorer, aa_set, enzyme,
params, search_index, and fragment_tolerance_da, change search_secondary's
signature to accept ctx: &SecondarySearchContext instead of those individual
parameters, update the body of search_secondary to use ctx.scorer, ctx.aa_set,
ctx.enzyme, ctx.params, ctx.search_index, and ctx.fragment_tolerance_da, and
update all call sites to construct and pass a SecondarySearchContext instance
when invoking search_secondary.
crates/search/src/chimeric_features.rs (1)

117-133: 💤 Low value

Consider explicit NaN handling in median calculation.

The function uses partial_cmp(...).unwrap_or(Equal) which treats NaN as equal to any value during sorting, potentially placing NaNs unpredictably. While upstream parsers likely sanitize intensities, an explicit .filter(|&i| !i.is_nan()) after line 121 would make this more robust.

♻️ Optional defensive fix
 fn median_nonzero_intensity(ms1_peaks: &[(f64, f32)]) -> Option<f64> {
     let mut nz: Vec<f64> = ms1_peaks
         .iter()
         .map(|&(_, i)| i as f64)
         .filter(|&i| i > 0.0)
+        .filter(|&i| !i.is_nan())
         .collect();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/search/src/chimeric_features.rs` around lines 117 - 133, The
median_nonzero_intensity function currently sorts intensities using partial_cmp
and can mishandle NaNs; update the data preparation for nz in
median_nonzero_intensity to explicitly filter out NaN values (e.g., after
mapping intensities, add a .filter(|&i| i > 0.0 && !i.is_nan()) or equivalent)
so that NaNs are excluded before sorting and median computation, ensuring stable
ordering and correct median results.
crates/search/src/precursor_matching.rs (1)

65-110: 💤 Low value

Clarify the ppm error computation in the comment.

Lines 74-75 state that mass_error_* is "measured against the nearest in-window neutral mass (clamped)", but the implementation on line 102 computes mass_error_ppm = mass_error_da / peptide_mass * 1e6 using peptide_mass as the denominator, not nearest.

While the Da error is indeed measured against nearest (line 100: mass_error_da = peptide_mass - nearest), the ppm error is computed relative to the peptide's own mass. This is different from the original matches_precursor (line 50) which uses the spectrum neutral mass as the denominator.

This appears intentional for chimeric matching where the error metric should be peptide-centric, and the implementation is correct. However, the comment could be more precise.

📝 Suggested comment clarification
-/// `lo_mz` / `hi_mz` are the isolation-window m/z bounds (selected m/z minus the
-/// lower offset .. plus the upper offset). The reported `mass_error_*` is
-/// measured against the nearest in-window neutral mass (clamped), so a peptide
-/// matching anywhere inside the window reports a near-zero error rather than a
-/// large error against the selected precursor.
+/// `lo_mz` / `hi_mz` are the isolation-window m/z bounds (selected m/z minus the
+/// lower offset .. plus the upper offset). The reported `mass_error_da` is
+/// measured against the nearest in-window neutral mass (clamped); `mass_error_ppm`
+/// is computed relative to the peptide's mass. A peptide matching anywhere inside
+/// the window reports a near-zero error rather than a large error against the
+/// selected precursor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/search/src/precursor_matching.rs` around lines 65 - 110, The doc
comment for matches_isolation_window is ambiguous about how mass_error_* are
computed; update it to state that mass_error_da is measured against the nearest
in-window neutral mass (nearest) while mass_error_ppm is computed relative to
the peptide mass (peptide_mass) (i.e., mass_error_ppm = mass_error_da /
peptide_mass * 1e6), so readers know this function uses a peptide-centric ppm
metric instead of the spectrum neutral mass as in matches_precursor.
crates/scoring/src/gf/score_dist.rs (1)

79-95: 💤 Low value

Defensive bounds check masks potential caller bugs.

Returning 0.0 when idx >= p.len() prevents panics but silently handles scores outside the allocated range. The comment notes "callers normally guard score >= max_score themselves," which suggests this case shouldn't occur in correct usage. Consider whether a debug_assert!(idx < p.len(), "score {score} out of range [{}, {})", min_score, max_score) would help catch caller bugs during development while keeping the defensive return in release builds.

The current defensive approach prioritizes robustness, which is reasonable for production code, but the tradeoff is that it may hide logic errors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/scoring/src/gf/score_dist.rs` around lines 79 - 95, Add a debug-only
assertion to detect out-of-range callers in get_probability while preserving the
defensive 0.0 fallback: compute the index as currently done, then call
debug_assert!(idx < p.len(), "score {} out of range [{}, {})", score,
self.bound.min_score, self.bound.max_score) (referencing get_probability,
self.prob_distribution, and self.bound.min_score/max_score) before the existing
if idx >= p.len() check so development builds will surface caller bugs but
release behavior still returns 0.0.
docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md (1)

314-340: ⚡ Quick win

Verify the neutral-mass window derivation.

Lines 322-326 note that candidate_nominal_bounds returns nominal mass, and the code divides by INTEGER_MASS_SCALER to convert to neutral mass. The comment explicitly flags this as an integration subtlety requiring verification. The implementer must confirm the correct conversion or derive the window directly from precursor_mz and charge (as suggested in the NOTE at line 342). The Task 4 recall gate will catch a mismatch, but catching it earlier during local testing would be better.

Consider adding a small local test in Step 3 that verifies the window derivation matches window_cand_indices on a known example before running the full VM gate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md`
around lines 314 - 340, The neutral-mass window conversion used when replacing
the `cand_iter` arm in `run_chunk_inner` may be incorrect: confirm that
`candidate_nominal_bounds(spec, z, params, shift_ppm)` returns nominal (integer)
mass and that dividing by `model::mass::INTEGER_MASS_SCALER` yields the same
neutral-mass range used by `window_cand_indices`; if not, compute the window
directly from `precursor_mz` and charge (or adjust the conversion) so `lo`/`hi`
match `window_cand_indices` semantics. Update the `Some(si)` chimeric path (the
block where you call `si.query(...)` using
`scored_spec_for_charge(z).dump_active_peaks()` and
`scorer.param().data_type.instrument.is_high_resolution()` to pick tol) to use
the corrected neutral-mass bounds, then add a small unit/local test that
compares the derived `(lo,hi)` vs the bounds implied by `window_cand_indices`
for a known example to catch mismatches before the VM gate.
crates/scoring/src/gf/generating_function.rs (1)

22-33: 💤 Low value

Process-global counter accumulates across all GF computations.

The DROPPED_NODES counter is never reset and accumulates across the entire process lifetime. In long-running processes or test suites that invoke GF computation many times, the count will grow unboundedly. This is documented as intentional ("cumulative count since process start"), but consider whether per-computation counts (returned from compute or tracked per GeneratingFunction instance) would be more useful for diagnosing individual problematic graphs.

If the cumulative global count is the desired semantic (e.g., for process-level health metrics), this is fine as-is.

crates/search/src/match_engine.rs (2)

956-956: ⚡ Quick win

Inconsistent HashSet usage: prefer FxHashSet for performance.

Line 956 uses std::collections::HashSet while the rest of the file uses rustc_hash::FxHashSet (imported at line 31) for fast, non-cryptographic hashing. The claimed set is used in a per-scan loop that could run frequently under --chimeric.

For consistency and performance, change to FxHashSet.

♻️ Proposed fix
-        let mut claimed: std::collections::HashSet<i64> = std::collections::HashSet::new();
+        let mut claimed: FxHashSet<i64> = FxHashSet::default();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/search/src/match_engine.rs` at line 956, The local set named `claimed`
(created for the per-scan loop iterating `for co in cos`) is currently a
std::collections::HashSet; replace it with rustc_hash::FxHashSet to match the
rest of the file and gain faster non-cryptographic hashing: update the type used
where `claimed` is declared/initialized to `FxHashSet`, ensure the `FxHashSet`
symbol already imported (it is imported earlier in the file), and keep the same
element type and usage so the surrounding logic in the loop (`for co in cos`,
insertions, lookups, etc.) continues to compile and behave identically.

747-776: ⚖️ Poor tradeoff

Env-gated diagnostic has high complexity; consider optimization or clearer safety note.

The chimeric overlap diagnostic clones the entire queue, builds peptide sequences for the top-2 distinct peptides, and computes set intersections on matched peak keys. While double-gated (chim_overlap env var + params.chimeric), this is still expensive if accidentally enabled.

Consider either:

  • Adding a compile-time #[cfg(feature = "diagnostics")] gate so it's truly zero-cost in release builds, or
  • Adding a comment at the diagnostic's definition warning maintainers this is research-only and should not be enabled in production pipelines.

The diagnostic is valuable for research (as documented in the bench notes), but the current env-var-only gate means a mistyped environment variable could degrade production performance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/search/src/match_engine.rs` around lines 747 - 776, The diagnostic
block guarded by chim_overlap and params.chimeric in match_engine.rs performs an
expensive clone of queue and heavy set operations (using queue.clone(), building
peptide sequences from candidates, and matched_peak_keys/scored_spec_for_charge)
and should be made zero-cost in non-research builds; either add a compile-time
gate (e.g., wrap the entire block with #[cfg(feature = "diagnostics")] so it is
compiled out unless the diagnostics feature is enabled) or, if you prefer to
keep it runtime-gated, add a prominent comment above the chim_overlap diagnostic
explaining it is research-only, expensive, and must not be enabled in production
(mentioning the symbols queue.clone(), matched_peak_keys,
scored_spec_for_charge, and params.chimeric so maintainers can find it).
docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md (1)

50-55: 💤 Low value

Add language specifier to the fenced code block.

The shell command block lacks a language identifier. Add shell after the opening backticks for proper syntax highlighting.

📝 Proposed fix
-```
+```shell
 MSGF_CHIMERIC_OVERLAP=1 <astral chimeric run> 2> astral-overlap.log
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md`
around lines 50 - 55, The fenced code block containing the shell snippet
(starting with MSGF_CHIMERIC_OVERLAP=1 and referencing astral-overlap.log) lacks
a language identifier; update the opening triple backticks to include "shell"
(i.e., ```shell) so the block is properly syntax-highlighted, leaving the inner
lines unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/input/Cargo.toml`:
- Around line 15-29: The Cargo.toml pins an unavailable thermorawfilereader
version (thermorawfilereader = "0.7.0") which will break resolution; update the
dependency entry for thermorawfilereader to a published version (e.g., match
crates.io latest such as 0.6.0) or vendor the crate, and ensure the thermo
feature (thermo = ["dep:thermorawfilereader"]) still references that updated
dependency; also verify or document any MSRV/rustc requirement for the
thermo/timstof features if the chosen version imposes a minimum compiler
version.

In `@crates/input/src/timstof.rs`:
- Around line 160-168: Validate that raw.mz_values.len() ==
raw.intensities.len() and return an error (or propagate a Result) if they
differ, then build peaks by iterating indices and filter out any pairs where mz
is not finite (f64::is_finite) or intensity is not finite after casting
(f32::is_finite) before collecting into peaks; finally keep the existing check
using peaks.windows(2) and peaks.sort_by(...). Use the existing symbols peaks,
raw.mz_values, raw.intensities, and the sort_by/windows logic so you drop
non-finite points early and fail fast on malformed parallel arrays.

In `@crates/search/src/match_engine.rs`:
- Around line 702-736: The block guarded by the hardcoded constant
CASCADE_SKIP_MS1_FEATURE is dead code; replace the compile-time constant with a
runtime flag (e.g., add a boolean field
params.compute_precursor_isotope_features) and use that flag to conditionally
run the MS1 precursor isotope-envelope computation (the existing logic that
checks ms1_link, ms2_to_ms1, ms1_peaks, computes z/theo_mono_mz/tol_da and calls
precursor_isotope_match and assigns features.precursor_isotope_kl and
features.precursor_snr). Remove the CASCADE_SKIP_MS1_FEATURE constant, thread
the new params flag through callers as needed, and keep the rest of the existing
precursor_isotope_match flow unchanged so the feature can be toggled at runtime
for experiments.

In `@docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md`:
- Around line 7-10: The Markdown heading "Self-time (top)" is immediately
followed by a table which triggers MD058; insert a single blank line between the
heading line ("Self-time (top)") and the table header row ("| function | self% |
bucket |") so the heading is separated from the table and the linter passes.

In
`@docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md`:
- Around line 103-109: The markdown table starting with the header row "| Risk |
Mitigation |" is missing surrounding blank lines and triggers MD058; update the
document (the table block shown in the diff) by inserting a blank line both
immediately before the table header and immediately after the final table row so
the table is separated from adjacent paragraphs/blocks, which will satisfy
markdownlint.

In
`@docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md`:
- Around line 105-106: Insert a single blank line between the "## Risks &
mitigations" heading and the table that follows so the Markdown table does not
immediately follow the heading (fix the markdownlint MD058 warning); locate the
"## Risks & mitigations" heading in the document and add one empty line before
the pipe-delimited table starting with "| Risk | Mitigation |".

In `@docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md`:
- Around line 93-94: Add a blank line between the "## Risks & mitigations"
heading and the table that immediately follows so the Markdown renders correctly
(fix MD058). Locate the heading text "## Risks & mitigations" and insert an
empty line before the table rows starting with "| Risk | Mitigation |".

---

Nitpick comments:
In `@crates/scoring/src/gf/score_dist.rs`:
- Around line 79-95: Add a debug-only assertion to detect out-of-range callers
in get_probability while preserving the defensive 0.0 fallback: compute the
index as currently done, then call debug_assert!(idx < p.len(), "score {} out of
range [{}, {})", score, self.bound.min_score, self.bound.max_score) (referencing
get_probability, self.prob_distribution, and self.bound.min_score/max_score)
before the existing if idx >= p.len() check so development builds will surface
caller bugs but release behavior still returns 0.0.

In `@crates/search/src/chimeric_features.rs`:
- Around line 117-133: The median_nonzero_intensity function currently sorts
intensities using partial_cmp and can mishandle NaNs; update the data
preparation for nz in median_nonzero_intensity to explicitly filter out NaN
values (e.g., after mapping intensities, add a .filter(|&i| i > 0.0 &&
!i.is_nan()) or equivalent) so that NaNs are excluded before sorting and median
computation, ensuring stable ordering and correct median results.

In `@crates/search/src/coisolation.rs`:
- Around line 164-177: The function search_secondary currently takes many
related scoring/search params; refactor by introducing a SecondarySearchContext
struct containing scorer, aa_set, enzyme, params, search_index, and
fragment_tolerance_da, change search_secondary's signature to accept ctx:
&SecondarySearchContext instead of those individual parameters, update the body
of search_secondary to use ctx.scorer, ctx.aa_set, ctx.enzyme, ctx.params,
ctx.search_index, and ctx.fragment_tolerance_da, and update all call sites to
construct and pass a SecondarySearchContext instance when invoking
search_secondary.

In `@crates/search/src/match_engine.rs`:
- Line 956: The local set named `claimed` (created for the per-scan loop
iterating `for co in cos`) is currently a std::collections::HashSet; replace it
with rustc_hash::FxHashSet to match the rest of the file and gain faster
non-cryptographic hashing: update the type used where `claimed` is
declared/initialized to `FxHashSet`, ensure the `FxHashSet` symbol already
imported (it is imported earlier in the file), and keep the same element type
and usage so the surrounding logic in the loop (`for co in cos`, insertions,
lookups, etc.) continues to compile and behave identically.
- Around line 747-776: The diagnostic block guarded by chim_overlap and
params.chimeric in match_engine.rs performs an expensive clone of queue and
heavy set operations (using queue.clone(), building peptide sequences from
candidates, and matched_peak_keys/scored_spec_for_charge) and should be made
zero-cost in non-research builds; either add a compile-time gate (e.g., wrap the
entire block with #[cfg(feature = "diagnostics")] so it is compiled out unless
the diagnostics feature is enabled) or, if you prefer to keep it runtime-gated,
add a prominent comment above the chim_overlap diagnostic explaining it is
research-only, expensive, and must not be enabled in production (mentioning the
symbols queue.clone(), matched_peak_keys, scored_spec_for_charge, and
params.chimeric so maintainers can find it).

In `@crates/search/src/precursor_matching.rs`:
- Around line 65-110: The doc comment for matches_isolation_window is ambiguous
about how mass_error_* are computed; update it to state that mass_error_da is
measured against the nearest in-window neutral mass (nearest) while
mass_error_ppm is computed relative to the peptide mass (peptide_mass) (i.e.,
mass_error_ppm = mass_error_da / peptide_mass * 1e6), so readers know this
function uses a peptide-centric ppm metric instead of the spectrum neutral mass
as in matches_precursor.

In
`@docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md`:
- Around line 50-55: The fenced code block containing the shell snippet
(starting with MSGF_CHIMERIC_OVERLAP=1 and referencing astral-overlap.log) lacks
a language identifier; update the opening triple backticks to include "shell"
(i.e., ```shell) so the block is properly syntax-highlighted, leaving the inner
lines unchanged.

In `@docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md`:
- Around line 314-340: The neutral-mass window conversion used when replacing
the `cand_iter` arm in `run_chunk_inner` may be incorrect: confirm that
`candidate_nominal_bounds(spec, z, params, shift_ppm)` returns nominal (integer)
mass and that dividing by `model::mass::INTEGER_MASS_SCALER` yields the same
neutral-mass range used by `window_cand_indices`; if not, compute the window
directly from `precursor_mz` and charge (or adjust the conversion) so `lo`/`hi`
match `window_cand_indices` semantics. Update the `Some(si)` chimeric path (the
block where you call `si.query(...)` using
`scored_spec_for_charge(z).dump_active_peaks()` and
`scorer.param().data_type.instrument.is_high_resolution()` to pick tol) to use
the corrected neutral-mass bounds, then add a small unit/local test that
compares the derived `(lo,hi)` vs the bounds implied by `window_cand_indices`
for a known example to catch mismatches before the VM gate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 17d7e6db-037a-43e5-91f2-0138e0ab628c

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd86be and 37971fa.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • test-fixtures/parity/goldens/precursor_cal_off.tsv is excluded by !**/*.tsv
📒 Files selected for processing (83)
  • .claude/CLAUDE.md
  • .github/workflows/release.yml
  • README.md
  • crates/input/Cargo.toml
  • crates/input/THERMO_LICENSE.txt
  • crates/input/src/lib.rs
  • crates/input/src/mgf.rs
  • crates/input/src/mzml.rs
  • crates/input/src/thermo.rs
  • crates/input/src/timstof.rs
  • crates/input/tests/thermo_raw.rs
  • crates/input/tests/timstof_d_loads.rs
  • crates/model/src/aa_set.rs
  • crates/model/src/isotope.rs
  • crates/model/src/lib.rs
  • crates/model/src/spectrum.rs
  • crates/model/src/tolerance.rs
  • crates/msgf-rust/Cargo.toml
  • crates/msgf-rust/src/bin/msgf-rust.rs
  • crates/msgf-rust/src/bin/msgf-trace.rs
  • crates/output/src/pin.rs
  • crates/output/src/tsv.rs
  • crates/output/tests/output_pin_schema_parity.rs
  • crates/scoring/src/gf/generating_function.rs
  • crates/scoring/src/gf/group.rs
  • crates/scoring/src/gf/primitive_graph.rs
  • crates/scoring/src/gf/score_dist.rs
  • crates/scoring/src/scoring/psm_score.rs
  • crates/scoring/src/scoring/rank_scorer.rs
  • crates/scoring/src/scoring/scored_spectrum.rs
  • crates/scoring/tests/gf_graph_dp.rs
  • crates/scoring/tests/primitive_graph_arena_parity.rs
  • crates/search/Cargo.toml
  • crates/search/src/candidate_gen.rs
  • crates/search/src/chimeric_features.rs
  • crates/search/src/coisolation.rs
  • crates/search/src/lib.rs
  • crates/search/src/mass_calibrator.rs
  • crates/search/src/match_engine.rs
  • crates/search/src/precursor_cal.rs
  • crates/search/src/precursor_matching.rs
  • crates/search/src/psm.rs
  • crates/search/src/search_index.rs
  • crates/search/src/search_params.rs
  • crates/search/tests/mass_calibrator_integration.rs
  • crates/search/tests/match_engine_java_parity.rs
  • crates/search/tests/match_engine_smoke.rs
  • crates/search/tests/match_engine_specevalue.rs
  • crates/search/tests/precursor_matching.rs
  • docs/2026-05-28-psm-gain-state-and-roadmap.md
  • docs/design/2026-06-01-thermo-raw-input.md
  • docs/design/2026-06-01-timstof-d-input.md
  • docs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.md
  • docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md
  • docs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.md
  • docs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.md
  • docs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.md
  • docs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.md
  • docs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.md
  • docs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.md
  • docs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.md
  • docs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.md
  • docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md
  • docs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.md
  • docs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.md
  • docs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.md
  • docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md
  • docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md
  • docs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.md
  • docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.md
  • docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.md
  • docs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.md
  • docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md
  • docs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.md
  • docs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.md
  • docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md
  • docs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.md
  • docs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.md
  • docs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.md
  • docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md
  • docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md
  • docs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.md
  • test-fixtures/parity/goldens/precursor_cal_off.pin

Comment thread crates/input/Cargo.toml
Comment on lines +15 to +29
thermorawfilereader = { version = "0.7.0", optional = true }
timsrust = { version = "0.4.2", optional = true }

[features]
# Native Thermo .raw reading via the official RawFileReader (hosted .NET 8
# runtime, FlatBuffers). Optional + off-by-default at the crate level; the
# msgf-rust binary turns it on by default. Builds without the .NET SDK
# (assemblies are vendored); only opening a `.raw` needs the .NET 8 runtime.
thermo = ["dep:thermorawfilereader"]
# Native Bruker timsTOF `.d` reading via the pure-Rust `timsrust` crate (the
# same reader Sage uses). Optional + off-by-default. NO vendor runtime and NO
# bundling — a `.d` is read natively (TDF SQLite + binary blob). Default builds
# stay pure and read mzML/MGF only; the optional dep tree only compiles under
# this feature.
timstof = ["dep:timsrust"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 💤 Low value

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the pinned versions exist and are not yanked, plus their rust-version (MSRV).
for crate in thermorawfilereader timsrust; do
  echo "== $crate =="
  curl -s "https://crates.io/api/v1/crates/$crate" | jq -r '.versions[] | "\(.num)\tyanked=\(.yanked)\trust_version=\(.rust_version)"' | head -20
done

Repository: bigbio/msgf-rust

Length of output: 235


🌐 Web query:

crates.io thermorawfilereader 0.7.0 rust_version yanked

💡 Result:

Crate: thermorawfilereader Version: 0.7.0 Yanked: Not confirmed as yanked (no authoritative source found in this research showing that 0.7.0 is yanked). [1] Rust version / MSRV ("rust_version"): Not found in the sources retrieved here. (Docs.rs page shown for 0.7.0 doesn’t expose MSRV/rust_version in the captured snippets.) [1] What I did find: the crate currently shows version history listing 0.6.0 as the latest on crates.io, and a Docs.rs source page indicates there is a 0.7.0 build available on Docs.rs. [2][1]

Citations:


🌐 Web query:

crates.io timsrust 0.4.2 rust_version yanked

💡 Result:

Crates.io page for timsrust 0.4.2 does not indicate that version 0.4.2 is yanked; it shows a published release dated 2024-12-19.[1] The most authoritative sources found for this specific version (crates.io and docs.rs) only provide release metadata and do not label 0.4.2 as yanked.[1][2] What I could not confirm from the sources retrieved: an explicit “yanked” flag or yanked reason for timsrust 0.4.2 on crates.io. If you want, I can do a second targeted search specifically for the crates.io “yanked” field/state for that exact version.

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check repo Cargo.toml context for any rust-version/MSRV mention
echo "== crates/input/Cargo.toml =="
rg -n "rust-version|package\.rust-version|MSRV" crates/input/Cargo.toml || true
sed -n '1,120p' crates/input/Cargo.toml | nl -ba | sed -n '1,120p' || true

# Query crates.io API for exact versions to get yanked + rust_version
for pair in "thermorawfilereader 0.7.0" "timsrust 0.4.2"; do
  crate=$(echo "$pair" | awk '{print $1}')
  ver=$(echo "$pair" | awk '{print $2}')
  echo "== crates.io api: $crate $ver =="
  url="https://crates.io/api/v1/crates/${crate}/${ver}"
  # Print the key fields if present
  curl -s "$url" | jq -r '
    if .version == null then "version: null (unexpected response)" else
    "yanked=\(.version.yanked // "null")\nrust_version=\(.version.rust_version // "null")\nnum=\(.version.num // "null")"
    end
  '
done

Repository: bigbio/msgf-rust

Length of output: 336


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== crates/input/Cargo.toml (first 80 lines with numbers) =="
# nl may not exist; use cat -n
cat -n crates/input/Cargo.toml | sed -n '1,80p'

for pair in "thermorawfilereader 0.7.0" "timsrust 0.4.2"; do
  crate=$(echo "$pair" | awk '{print $1}')
  ver=$(echo "$pair" | awk '{print $2}')
  url="https://crates.io/api/v1/crates/${crate}/${ver}"
  echo
  echo "== crates.io api status: $crate $ver =="
  # Show HTTP status and content-type
  curl -sS -L -o /tmp/crates_resp.json -w "HTTP=%{http_code} CT=%{content_type}\n" "$url" || true
  echo "== crates.io api preview (first 300 bytes) =="
  head -c 300 /tmp/crates_resp.json | cat
  echo
  echo "== crates.io api parsed fields (if JSON) =="
  # Try parsing only if it looks like JSON
  if head -c 1 /tmp/crates_resp.json | grep -q '{'; then
    jq -r '
      "num=\(.version.num // "null")\nyanked=\((.version.yanked|tostring) // "null")\nrust_version=\(.version.rust_version // "null")"
    ' /tmp/crates_resp.json || true
  else
    echo "Response not JSON; skipping jq."
  fi
done

Repository: bigbio/msgf-rust

Length of output: 2443


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "rust-version|MSRV|1\.88|rustc.*1\.88" Cargo.toml **/Cargo.toml . || true

Repository: bigbio/msgf-rust

Length of output: 50445


🌐 Web query:

docs.rs thermorawfilereader 0.7.0 Minimum Supported Rust Version

💡 Result:

As of June 1, 2026, there is no public version 0.7.0 of the thermorawfilereader crate available on crates.io or docs.rs [1][2]. The latest released version of the thermorawfilereader crate is 0.6.0, published on March 31, 2026 [1]. Because version 0.7.0 does not exist, there is no defined Minimum Supported Rust Version (MSRV) for it [1]. For the existing 0.6.0 version, the crate documentation and metadata do not explicitly specify a formal MSRV [1]. The crate relies on the.NET 8 runtime to function, which is a primary prerequisite for its operation [1][3].

Citations:


🌐 Web query:

docs.rs timsrust 0.4.2 Minimum Supported Rust Version

💡 Result:

The timsrust crate version 0.4.2 does not explicitly specify a "rust-version" (Minimum Supported Rust Version or MSRV) in its Cargo.toml file [1]. While the crate uses the 2021 Rust edition [1], it does not impose a stricter lower bound on the compiler version via the Cargo manifest. Users can generally verify the current configuration by checking the official repository's Cargo.toml file [1].

Citations:


🌐 Web query:

crates.io thermorawfilereader 0.7.0 yanked

💡 Result:

Result: thermorawfilereader 0.7.0 does not appear to be yanked on crates.io; in fact, the crates.io crate page shows the latest published version as 0.6.0, with no 0.7.0 listed in its version history. [1] Evidence details: - The crates.io entry for thermorawfilereader lists “Latest version: 0.6.0” and shows version history entries including 0.6.0 (2026-03-31), 0.5.5, 0.5.4, etc., but not 0.7.0. [1] - The docs.rs page for “thermorawfilereader 0.7.0” is accessible via a “latest/source” URL, but it doesn’t provide any crates.io “yanked” indicator for version 0.7.0, and (critically) crates.io’s version list for the crate doesn’t show 0.7.0 as a published version to be yanked. [2][1] If you intended a different version (e.g., 0.6.0 or one of the earlier versions), tell me which version string you care about and I’ll re-check its crates.io yanked status.

Citations:


🌐 Web query:

crates.io timsrust 0.4.2 yanked

💡 Result:

For timsrust 0.4.2, I could not find any evidence that the version is yanked. The crates.io page for timsrust 0.4.2 loads as a normal published version, listing it as the latest version and showing the usual metadata for that release (e.g., downloads, created/updated dates) rather than indicating a yank status [1]. The Docs.rs entry for the same version also exists (docs build for 0.4.2), which is consistent with the crate/version being available normally, not removed via yanking [2]. If you meant “verify via the crates.io yank endpoint/flag,” crates.io’s UI elements for yanking weren’t exposed in the search snippets I retrieved, so I can’t cite a definitive “not yanked” flag value from crates.io itself—only that the version appears published and accessible via crates.io and docs.rs [1][2].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== root Cargo.toml rust-version =="
sed -n '1,40p' Cargo.toml | nl -ba

echo
echo "== README.md MSRV statements around thermo/timstof =="
rg -n "Build needs rustc|timsrust.*rustc|rustc ≥ 1\.88" README.md
sed -n '140,210p' README.md | nl -ba

echo
echo "== docs/design 2026-06-01-timstof-d-input.md relevant snippet =="
sed -n '110,170p' docs/design/2026-06-01-timstof-d-input.md | nl -ba

echo
echo "== occurrences of thermorawfilereader version pin =="
rg -n "thermorawfilereader\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" -S .

Repository: bigbio/msgf-rust

Length of output: 156


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== root Cargo.toml (first 60 lines) =="
cat -n Cargo.toml | sed -n '1,60p'

echo
echo "== README.md MSRV statements around thermo/timstof (match lines) =="
rg -n "Build needs rustc|timsrust.*rustc|rustc ≥ 1\.88|MSRV" README.md || true

echo
echo "== README.md thermo/timstof sections (approx 130-190) =="
sed -n '130,210p' README.md | cat -n

echo
echo "== docs/design/2026-06-01-timstof-d-input.md relevant snippet (110-170) =="
sed -n '110,170p' docs/design/2026-06-01-timstof-d-input.md | cat -n

echo
echo "== occurrences of thermorawfilereader version pins (only Cargo.toml files) =="
rg -n "thermorawfilereader\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" --glob "**/Cargo.toml" . || true

echo
echo "== occurrences of timsrust version pins (only Cargo.toml files) =="
rg -n "timsrust\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" --glob "**/Cargo.toml" . || true

Repository: bigbio/msgf-rust

Length of output: 12304


Fix thermorawfilereader pin: 0.7.0 likely doesn’t exist on crates.io, so the dependency may fail to resolve.

  • crates/input/Cargo.toml pins thermorawfilereader = { version = "0.7.0", optional = true } and timsrust = { version = "0.4.2", optional = true }.
  • crates.io’s thermorawfilereader page shows latest 0.6.0 and the retrieved version history does not list 0.7.0 → adjust the pin (or vendor) to an actually published version.
  • timsrust 0.4.2 appears published with no retrieved evidence of yanking; however, external MSRV metadata for the exact pinned versions couldn’t be pulled here (crates.io API is blocked with 403), so the repo’s documented rustc ≥ 1.88 requirement for the feature build remains repo-asserted, not evidenced from crate metadata.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/input/Cargo.toml` around lines 15 - 29, The Cargo.toml pins an
unavailable thermorawfilereader version (thermorawfilereader = "0.7.0") which
will break resolution; update the dependency entry for thermorawfilereader to a
published version (e.g., match crates.io latest such as 0.6.0) or vendor the
crate, and ensure the thermo feature (thermo = ["dep:thermorawfilereader"])
still references that updated dependency; also verify or document any MSRV/rustc
requirement for the thermo/timstof features if the chosen version imposes a
minimum compiler version.

Comment thread crates/input/src/timstof.rs Outdated
Comment thread crates/search/src/match_engine.rs Outdated
Comment thread docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md Outdated
Comment thread docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md Outdated
Comment thread docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md Outdated
Comment thread docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md Outdated
@ypriverol
Copy link
Copy Markdown
Member Author

Code review (+ adversarial pass)

Both a 5-agent review and a Codex adversarial review ran against the dev→master diff. No correctness blocker in the default mzML/MGF engine, but the native-format surface has real gaps worth fixing before master:

  1. (high) Bruker .d searches use the Orbitrap HCD_QExactive model, not a TOF/timsTOF model. Auto-routing (auto_route_eligible) only covers mzML + .raw, and the .d reader leaves activation/instrument unset, so the documented no-flags .d invocation falls through to resolve_bundled_paramHCD_QExactive_Tryp.param. timsTOF data is then scored with the wrong (Orbitrap) model. This is corroborated by the PR Native Bruker timsTOF .d input (timstof feature) #45 benchmark (4,002 PSMs at a target:decoy ratio of ~1.07 — weak discrimination consistent with a param mismatch). Fix: add .d auto-routing to a TOF model, or require explicit --instrument/--fragmentation for .d.

None => {
let auto_route_eligible = cli.fragmentation == Fragmentation::Auto && (is_mzml || is_raw);
// Detect (activation, instrument) from the input. mzML peeks the
// file; Thermo `.raw` reads the vendor activation/analyzer metadata
// so a CID/ETD/ion-trap `.raw` is not silently scored with the
// HCD/QExactive model.
let detected = if !auto_route_eligible {
None
} else if is_mzml {
detect_dominant_activation(&cli.spectrum)
.map(|m| (m, detect_instrument_type_for_path(&cli.spectrum)))
} else {
// is_raw
#[cfg(feature = "thermo")]
{
input::thermo::detect_activation_instrument(&cli.spectrum, 64)
}
#[cfg(not(feature = "thermo"))]
{
None
}
};
if auto_route_eligible {
match detected {

  1. (medium) The release workflow never builds the timstof feature, so published binaries cannot open .d. The matrix only toggles thermo; every build is --features thermo or plain. Prebuilt artifacts hit the "rebuild with --features timstof" error on any .d. Fix: enable timstof on the release targets (it's pure Rust, no runtime bundling), or drop .d from release docs until shipped.

include:
- target: x86_64-unknown-linux-gnu
os: ubuntu-latest
thermo: true
dotnet_arch: x64
- target: aarch64-unknown-linux-gnu
os: ubuntu-latest
thermo: false
linker_pkg: gcc-aarch64-linux-gnu
cargo_linker: aarch64-linux-gnu-gcc
- target: x86_64-apple-darwin
os: macos-13
thermo: true
dotnet_arch: x64
- target: aarch64-apple-darwin
os: macos-latest
thermo: true
dotnet_arch: arm64
- target: x86_64-pc-windows-msvc
os: windows-latest
thermo: true
dotnet_arch: x64

  1. (medium) Non-chimeric .raw/.d MS-level handling is inconsistent for non-default levels. The default (--ms-level 2) is safe, but the Thermo iterator routes any requested level through the MS2-only convert(): --ms-level 1 silently yields nothing (MS1 has no precursor → skipped), and --ms-level 3 would emit MS3 scans into the search (the MS3-skip guard exists only in the chimeric read_with_ms1_chunked, not the non-chimeric iterator) — exactly the TMT SPS-MS3 reporter scans the code elsewhere is careful to exclude. Fix: reject unsupported levels at the CLI boundary, or keep the iterator MS2-only.

fn next(&mut self) -> Option<Self::Item> {
while self.next < self.len {
let i = self.next;
self.next += 1;
let raw = match self.handle.get(i) {
Some(raw) => raw,
None => continue,
};
if let Some(want) = self.ms_level_filter {
if raw.ms_level() != want {
continue;
}
}
match Self::convert(&raw) {
// Skip MS2 scans with no usable precursor (mirrors mzML).
Some(spec) => return Some(Ok(spec)),

  1. (low) The --ms-level warning never fires for .d. if is_mgf && cli.ms_level != 2, but is_mgf = !is_mzml && !is_raw && !is_d, so a .d input (whose message text claims coverage) gets no warning. Cosmetic (the result is also unused). Fix: (is_mgf || is_d).

let mzml_warn_ms_level_emitted = if is_mgf && cli.ms_level != 2 {
eprintln!(
"WARN: --ms-level={} requested for an MGF / Bruker .d input; these \
inputs do not select an MS level here (MGF has none; the .d reader \
yields DDA MS2 only). The flag has no effect on this input.",
cli.ms_level

Also noted (cleanup, non-blocking): PrecursorIsotopeKL/PrecursorSNR ship as always-0.0 PIN columns (CASCADE_SKIP_MS1_FEATURE=true); the mzML read_with_ms1_chunked doc says "first parse error stops streaming" but the code resyncs+continues. Much of the rest was already reviewed in #42/#44/#45; #44's four findings are confirmed fixed in dev.

ypriverol added 2 commits June 1, 2026 16:46
Resolves the 3 adversarial-review findings on PR #46:

1. (high) Bruker .d was scored with the Orbitrap HCD_QExactive model. The param
   auto-router now routes .d (timsTOF DDA-PASEF = beam-type CID on a TOF analyzer)
   to (CID, TOF) -> CID_TOF_Tryp.param, instead of falling through to the
   HCD_QExactive default. Override with --fragmentation/--instrument.

2. (medium) Release binaries could not open .d: the release workflow built only
   --features thermo. The 4 native targets now build --features "thermo timstof"
   so prebuilt binaries read both .raw (bundled .NET 8) and .d (pure Rust). The
   linux-arm64 cross target stays plain (timsrust's zstd C dep complicates cross).

3. (medium) Non-chimeric .raw/.d MS-level: --ms-level 3 could search MS3 (e.g.
   TMT SPS-MS3 reporter scans) and --ms-level 1 yielded an empty run. The native
   .raw reader is now hardcoded to MS2; a non-2 --ms-level on .raw/.d warns and is
   ignored (the chimeric path was already MS2-only). --ms-level still applies to mzML.

Also fixes the misleading MGF/.d ms-level warning (it claimed .d coverage but its
is_mgf guard excluded .d; .d is now covered by the native warning above).

Default + --features timstof build green; param-resolver tests pass.
@ypriverol
Copy link
Copy Markdown
Member Author

Review findings addressed (971a345)

All three adversarial-review findings are fixed (Copilot generated no comments):

  1. (high) .d scored with the Orbitrap model → fixed. The param auto-router now routes .d (timsTOF DDA-PASEF = beam-type CID on a TOF analyzer) to (CID, TOF)CID_TOF_Tryp.param instead of the HCD_QExactive default. Override with --fragmentation/--instrument. (This is the param mismatch behind PR Native Bruker timsTOF .d input (timstof feature) #45's weak timsTOF yield.)

  2. (medium) Release binaries couldn't open .d → fixed. The 4 native targets now build --features "thermo timstof", so prebuilt binaries read both .raw (bundled .NET 8) and .d (pure Rust). linux-arm64 stays plain (timsrust's zstd C dep complicates the cross build).

  3. (medium) Non-chimeric .raw/.d MS-level → fixed. The native .raw reader is now hardcoded to MS2 (so --ms-level 3 can't search MS3 / TMT SPS-MS3 reporter scans, and --ms-level 1 can't yield an empty run); a non-2 --ms-level on .raw/.d warns and is ignored. --ms-level still applies to mzML.

Also fixed the misleading MGF/.d ms-level warning. Default + --features timstof builds green; param-resolver tests pass. (The --features "thermo timstof" combo is what the release builds; CI's default-feature build doesn't exercise it — worth a tag smoke-test before release.)

ypriverol added 3 commits June 1, 2026 17:09
Moves the dev-internal design specs, the Java<->Rust parity-analysis investigation
notes, and the dated psm-gain roadmap out of the public repo (preserved locally in
the workspace). The repo's docs now carry only user-facing material: README.md, the
DOCS.md CLI/column reference manual, and docs/CLI_MIGRATION.md. Removed 42 files
(docs/parity-analysis/, docs/design/, docs/2026-05-28-psm-gain-state-and-roadmap.md);
neutralized the one DOCS.md link that pointed into parity-analysis. Follows the
earlier removal of the internal specs directory.
…current defaults

- §1: new Input-formats table (mzML/MGF/.raw/.d, build + runtime reqs);
  add --precursor-cal, --chimeric, --isolation-halfwidth; fix --charge-max
  default (2-5) and --ms-level MS2-only wording for native formats
- §3a: PIN header now 38 cols (charge2..charge5 default)
- §4: native Thermo .raw (vendor metadata) and Bruker .d (CID_TOF_Tryp) routing
- §5: feature-gated build commands for thermo/timstof
- §8c: input-format row reflects native .raw/.d
Adds docs/benchmarks/ with the 2026-06-01 comparison of Java MS-GF+, Sage,
MSFragger, and msgf-rust on vendor-native data, harmonized parameters, uniform
Percolator -Y FDR, results (PSMs + distinct peptides + speed + RAM), findings,
caveats, and the per-engine config files (Sage JSON, MSFragger params, mods).
Linked from README.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
DOCS.md (1)

345-345: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align Rust toolchain requirements; current docs are contradictory.

Line 345 says Rust 1.85+ (pinned 1.87.0), but Line 357 says Thermo builds require rustc >= 1.88. Please reconcile these into one authoritative minimum (global or feature-specific) and keep the build commands consistent with that requirement.

Also applies to: 357-359

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@DOCS.md` at line 345, DOCS.md contains conflicting Rust version requirements
(mentions "Rust 1.85+" and that rust-toolchain.toml pins 1.87.0, while the
Thermo section requires rustc >= 1.88); reconcile these by choosing a single
authoritative minimum Rust version (e.g., set global minimum to 1.88 if Thermo
needs it) and update all mentions to that version, update rust-toolchain.toml
pin to match the chosen minimum (or note feature-specific higher minimums for
Thermo), and ensure all build commands and the Thermo section reference the same
version string so the document is consistent.
🧹 Nitpick comments (1)
docs/benchmarks/configs/msfragger-timstof.params (1)

14-15: 💤 Low value

Consider tighter fragment tolerance for modern timsTOF instruments.

The fragment_mass_tolerance is set to 20 ppm, which is conservative for modern timsTOF instruments (many achieve <10 ppm fragment accuracy). While this may be intentional for broader benchmark comparability, tighter tolerances (e.g., 10-15 ppm) could improve discrimination on high-accuracy timsTOF data.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/benchmarks/configs/msfragger-timstof.params` around lines 14 - 15,
Update the fragment_mass_tolerance parameter (fragment_mass_tolerance) from the
current 20 ppm to a tighter value appropriate for modern timsTOF data (suggest
10–15 ppm); modify the value in the msfragger-timstof.params config so
benchmarks on high-accuracy timsTOF runs use the tighter tolerance while
optionally documenting the reason for the change in a nearby comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/benchmarks/configs/sage-astral.json`:
- Around line 18-19: The fragment ion tolerance ("fragment_tol") in the
sage-astral.json benchmark is set to ±20 ppm which is likely too permissive for
Thermo Orbitrap Astral searches; update the "fragment_tol" value to ~10 ppm
(e.g., 10) to match common practice or, if ±20 ppm is deliberate for this
benchmark, add a brief justification comment in the config/docs explaining why a
wider tolerance is used (refer to the "fragment_tol" key in
docs/benchmarks/configs/sage-astral.json).

---

Outside diff comments:
In `@DOCS.md`:
- Line 345: DOCS.md contains conflicting Rust version requirements (mentions
"Rust 1.85+" and that rust-toolchain.toml pins 1.87.0, while the Thermo section
requires rustc >= 1.88); reconcile these by choosing a single authoritative
minimum Rust version (e.g., set global minimum to 1.88 if Thermo needs it) and
update all mentions to that version, update rust-toolchain.toml pin to match the
chosen minimum (or note feature-specific higher minimums for Thermo), and ensure
all build commands and the Thermo section reference the same version string so
the document is consistent.

---

Nitpick comments:
In `@docs/benchmarks/configs/msfragger-timstof.params`:
- Around line 14-15: Update the fragment_mass_tolerance parameter
(fragment_mass_tolerance) from the current 20 ppm to a tighter value appropriate
for modern timsTOF data (suggest 10–15 ppm); modify the value in the
msfragger-timstof.params config so benchmarks on high-accuracy timsTOF runs use
the tighter tolerance while optionally documenting the reason for the change in
a nearby comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13cf4fe1-5671-42c3-aba6-c1c742d20dea

📥 Commits

Reviewing files that changed from the base of the PR and between 37971fa and 059ed20.

⛔ Files ignored due to path filters (5)
  • docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-16629.log.gz is excluded by !**/*.gz
  • docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-23082.log.gz is excluded by !**/*.gz
  • docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-23272.log.gz is excluded by !**/*.gz
  • docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-34685.log.gz is excluded by !**/*.gz
  • docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-41522.log.gz is excluded by !**/*.gz
📒 Files selected for processing (35)
  • .github/workflows/release.yml
  • DOCS.md
  • README.md
  • crates/msgf-rust/src/bin/msgf-rust.rs
  • docs/benchmarks/2026-06-01-4engine-native-format.md
  • docs/benchmarks/README.md
  • docs/benchmarks/configs/mods.txt
  • docs/benchmarks/configs/msfragger-astral.params
  • docs/benchmarks/configs/msfragger-timstof.params
  • docs/benchmarks/configs/sage-astral.json
  • docs/benchmarks/configs/sage-timstof.json
  • docs/parity-analysis/notes/2026-05-25-precursor-cal-ship-gates.md
  • docs/parity-analysis/notes/2026-05-25-spece-tail-exploration.md
  • docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md
  • docs/parity-analysis/notes/2026-05-28-phase2-peak-rank-parity.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/aggregate-analysis.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/analyze.py
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.txt
  • docs/parity-analysis/snapshots/cal-shifts-2026-05-25.json
  • docs/superpowers/plans/2026-05-26-i5-score-psm-trace-plan.md
  • docs/superpowers/plans/2026-05-26-quality-cleanup-plan.md
  • docs/superpowers/plans/2026-05-28-id-rate-pxd001819-tmt-plan.md
  • docs/superpowers/specs/2026-05-26-i5-score-psm-trace-design.md
  • docs/superpowers/specs/2026-05-26-pr-v1-design.md
  • docs/superpowers/specs/2026-05-26-quality-cleanup-design.md
  • docs/superpowers/specs/2026-05-28-id-rate-pxd001819-tmt-design.md
💤 Files with no reviewable changes (24)
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.json
  • docs/superpowers/specs/2026-05-28-id-rate-pxd001819-tmt-design.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.txt
  • docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.json
  • docs/parity-analysis/notes/score-psm-trace-artifacts/aggregate-analysis.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.json
  • docs/superpowers/specs/2026-05-26-i5-score-psm-trace-design.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.txt
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.json
  • docs/parity-analysis/notes/2026-05-25-precursor-cal-ship-gates.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.txt
  • docs/superpowers/specs/2026-05-26-pr-v1-design.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.txt
  • docs/superpowers/plans/2026-05-26-quality-cleanup-plan.md
  • docs/superpowers/plans/2026-05-26-i5-score-psm-trace-plan.md
  • docs/parity-analysis/snapshots/cal-shifts-2026-05-25.json
  • docs/superpowers/plans/2026-05-28-id-rate-pxd001819-tmt-plan.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.json
  • docs/parity-analysis/notes/2026-05-28-phase2-peak-rank-parity.md
  • docs/parity-analysis/notes/2026-05-25-spece-tail-exploration.md
  • docs/parity-analysis/notes/score-psm-trace-artifacts/analyze.py
  • docs/superpowers/specs/2026-05-26-quality-cleanup-design.md
✅ Files skipped from review due to trivial changes (5)
  • docs/benchmarks/configs/mods.txt
  • docs/benchmarks/configs/msfragger-astral.params
  • docs/benchmarks/README.md
  • docs/benchmarks/2026-06-01-4engine-native-format.md
  • README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/release.yml
  • crates/msgf-rust/src/bin/msgf-rust.rs

Comment on lines +18 to +19
"precursor_tol": { "ppm": [-10.0, 10.0] },
"fragment_tol": { "ppm": [-20.0, 20.0] },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the typical fragment ion tolerance in ppm for Thermo Fisher Astral Orbitrap MS2 data in proteomics searches?

💡 Result:

For proteomics database searches of data acquired on the Thermo Fisher Orbitrap Astral mass spectrometer, a fragment ion mass tolerance of 10 ppm is commonly used [1]. While instrument performance specifications indicate that the Astral analyzer achieves mass accuracy with an RMS drift of less than 5 ppm over 24 hours under external calibration [2][3], practical proteomics search parameters often utilize a slightly wider window—frequently 10 ppm—to ensure high sensitivity and robust identification across varied experimental conditions [1]. Selection of the specific tolerance may also depend on the software, search engine, and whether the data is being integrated or quantified, with some workflows or specific software modules recommending adjustments to account for factors such as space charging or peak apex alignment [1]. Researchers should always verify optimal parameters for their specific sample complexity and software pipeline, as practices can evolve with new instrument versions and algorithmic improvements [4].

Citations:


Revisit fragment_tol for Astral MS2 search window

In docs/benchmarks/configs/sage-astral.json (lines 18-19), fragment_tol is set to ±20 ppm; for Thermo Orbitrap Astral proteomics database searches, ~10 ppm is commonly used for fragment ion tolerance, so ±20 ppm may be overly permissive unless the benchmark has a documented reason—either tighten to ~10 ppm or add justification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/benchmarks/configs/sage-astral.json` around lines 18 - 19, The fragment
ion tolerance ("fragment_tol") in the sage-astral.json benchmark is set to ±20
ppm which is likely too permissive for Thermo Orbitrap Astral searches; update
the "fragment_tol" value to ~10 ppm (e.g., 10) to match common practice or, if
±20 ppm is deliberate for this benchmark, add a brief justification comment in
the config/docs explaining why a wider tolerance is used (refer to the
"fragment_tol" key in docs/benchmarks/configs/sage-astral.json).

ypriverol added 2 commits June 2, 2026 07:50
CodeRabbit PR #46 review:
- input/timstof.rs: pair mz/intensity over the overlapping prefix only (no
  read past a malformed length mismatch) and drop non-finite / non-positive-m/z
  / negative-intensity peaks before they reach the scorer. Adds two regression
  tests. Mirrors the thermo.rs peak guards.
- search/match_engine.rs: remove the dead per-PSM MS1 isotope-feature block
  (CASCADE_SKIP_MS1_FEATURE was hardcoded true so it never ran) plus its now
  orphaned const/import/local. precursor_isotope_kl/snr stay at 0.0 defaults
  (PIN schema unchanged); Pass 2 co-isolation still uses MS1 via coisolation.rs.
  Behaviour is bit-identical (dead code path).
ProteoBench Astral .raw + HYE FASTA (proteobench.cubimed.rub.de), PRIDE
PXD072598 timsTOF .d (ftp.pride.ebi.ac.uk), and the UniProt reviewed
Human+Yeast proteomes used for the timsTOF search.
@ypriverol ypriverol merged commit ab9fac4 into master Jun 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants