Merge feature/predictor-finetune: native Chronologer RT adapter + per-task fine-tuning by theGreatHerrLebert · Pull Request #397 · theGreatHerrLebert/rustims

theGreatHerrLebert · 2026-05-19T15:39:09Z

Brings feature/predictor-finetune onto main so the native Chronologer RT adapter (imspy_predictors/rt/chronologer.py) and the per-task RT/CCS/intensity fine-tune work land alongside the calibrate_nce NCE calibration already on main.

Why: feature/predictor-finetune branched from 488a8305 (before PR #395), so no branch currently has both Chronologer and calibrate_nce. sagepy-rescore pins imspy-predictors at a branch == main, so it cannot reach Chronologer until this lands.

Conflict resolution: intensity/predictors.py auto-merged cleanly -- calibrate_nce is preserved. The only conflict was a one-hunk cosmetic difference in ccs/predictors.py (IM fine-tune verbose-logging cadence); took the feature-branch version (epoch % 5, im label) which matches the RT/intensity report cadence. Fine-tune history tracking is present on both sides and merged cleanly.

After merge: repin sagepy-rescore pyproject to main and reinstall.

…tims_read_pasef_msms_for_frame_v2 Adds Rust wrappers around two previously-unexposed Bruker SDK functions: * tims_extract_centroided_spectrum_for_frame_v2(handle, frame_id, scan_lo, scan_hi, callback, user_data) — returns Bruker's built-in centroided peak list (m/z, intensity) for one (frame, scan range) tile. This is the function DiaTracer uses to start with peaks rather than raw events. * tims_read_pasef_msms_for_frame_v2 — DDA-PASEF per-frame fragment reader; included for future DDA work, not used by the DIA dump. Signatures recovered from gtluu/pyTDFSDK init_tdf_sdk.py + the MSMS_SPECTRUM_FUNCTOR / MSMS_SPECTRUM_FUNCTION callback typedefs from ctypes_data_structures.py. Callback marshalling is done through a process-level Mutex + Option<Vec<...>> trampoline since Rust closures don't compose with libloading symbols + C function pointers. Adds rustdf/examples/dump_bruker_centroids.rs — a one-shot CLI that reads dia_ms_ms_windows + dia_ms_ms_info from analysis.tdf, then extracts a centroided spectrum per (MS2 frame, DIA quad scan-range) and writes a TSV with the peak arrays. Smoke test on 10 MS2 frames of O240206: * 26 extract calls * 838 peaks per call (mean) * 324 calls/s → full-file estimate (29k calls) lands ~90 seconds vs ~50 minutes for the current event-clustering pseudo-assembly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1.0 of the calibrated-library pipeline: take q ≤ 0.01 PSMs from a spectrum-centric DIA-PASEF rescore, fine-tune the pretrained RT predictor (UnifiedPeptideModel via DeepChromatographyApex) to predict OBSERVED RT in minutes for THIS run, measure RT MAE on a peptide-level holdout (20% of unique sequences, seeded). Smoke test on real O240206 (26,855 anchor PSMs, 21,484 train / 5,371 holdout, RTX 5090, ~30s wall): baseline (linearly projected pretrained): 1.535 min MAE fine-tuned: 0.425 min MAE delta: -1.11 min (-72%) Held-out peptides only — sequences the predictor never saw during fine-tuning. Shows the calibration hypothesis works: q ≤ 0.01 PSMs from a single run are usable training data for a per-run RT model that generalises within the same instrument run. Implementation notes: - Joins rescored_canonical.csv (observed `rt`) with rescored_canonical.tdc.csv (q_value + decoy) on (spec_idx, match_idx). Aggregates per peptide via mean observed RT. - Bypasses the legacy DeepChromatographyApex.fine_tune_model() — that method calls bare `self.model(tokens)` which on UnifiedPeptideModel returns a dict, breaking l1_loss. The script's custom_finetune_loop() calls model.predict_rt(tokens) which extracts the 'rt' tensor. - Supervision target is observed `rt` in minutes (not retention_time_projected, which is sage's projection into ITS predictor space — different scale than imspy_predictors' RT model). - Pre-fine-tune baseline projects pretrained predictions linearly onto observed RT via least-squares so MAE measures prediction quality rather than the scale offset between the two model output spaces. Output: rt_finetuned.pt (state_dict + metrics + args + pretrained state_dict for diff) + metrics.json. Next: Phase 1.1 expands to multi-task (RT + CCS + intensity) on UnifiedPeptideModel, single state_dict per run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Joint fine-tune of RT + CCS heads on UnifiedPeptideModel using one DIA-PASEF run's q ≤ 0.01 PSMs. Same encoder is shared so a single training pass calibrates both heads against this run's anchor set. Smoke test on real O240206 (26,855 anchor peptides, 21,484 train / 5,371 holdout, RTX 5090, ~30 s wall, 30 epochs): RT baseline 1.535 min fine-tuned 0.468 min -69% ✓ CCS Å² baseline 31.22 fine-tuned 30.14 -3.5% marginal 1/K0* baseline 0.0655 fine-tuned 0.0708 +8% (worse) *predicted CCS converted via Mason-Schamp (ccs_to_one_over_k0). Diagnosis: RT calibrates cleanly, matching the Phase 1.0 standalone result. The CCS head's physics-informed SquareRootProjectionLayer encodes a strong inductive bias for CCS in Å²; fine-tune on observed CCS makes ~3% headway in 30 epochs, but predicted-CCS → 1/K0 round-trip gets slightly worse than just linearly projecting the pretrained CCS onto observed-CCS. Net read: linear projection of pretrained CCS is already a strong baseline for library use; meaningful CCS fine-tune would need either much longer training (~100s of epochs) or a 1/K0-native head that drops the SquareRootProjection physics layer. Implementation notes: - Side-loads observed 1/K0 from <stem>.pseudo.bin's env_apex_scan + TimsDataset.scan_to_inverse_mobility (one-frame LUT). The rescore CSV's ims/predicted_ims columns are zero — sage isn't passed inverse_ion_mobility from build_query in rescore_canonical.py. Right long-term fix is upstream (1-line plumbing in rescore_canonical.py); side-load avoids re-running rescore. - Supervises CCS in Å² (head's native scale); converts predicted CCS → 1/K0 via imspy_core.chemistry.mobility.ccs_to_one_over_k0_par for library-relevant reporting. - Saves both pretrained + fine-tuned state_dicts plus the linear projection params (rt_projection, ccs_projection) to the .pt. Library-time inference: load checkpoint, predict, optionally apply the projection (esp. for CCS where the fine-tune is marginal). Phase 1.1 result: RT calibration is publication-grade. CCS + linear projection is sufficient for library generation; chasing the extra 3-5% from a real CCS fine-tune is not worth the architecture work now. Pivoting to Phase 1.2 (fragment intensity) — that's where the biggest peptide-centric search lift comes from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lti-task fix Three new scripts for the per-run calibration pipeline (Phase 1.1' / 1.2), plus a data-prep bug fix on the existing multi-task script. NEW finetune_dia_pasef_ccs.py (Phase 1.1' standalone CCS diagnostic) - Side-loads observed 1/K0 from .pseudo.bin's env_apex_scan + TimsDataset.scan_to_inverse_mobility (one-frame LUT) - Converts 1/K0 → CCS Å² via Mason-Schamp (one_over_k0_to_ccs_par) — supervises the CCS head in its native output scale, then converts predicted CCS → 1/K0 at inference for library use. - Aggregates per-(peptide, charge) — CCS depends on charge, peptides observed at z=2 and z=3 carry distinct CCS values; mixing them via per-sequence aggregation gives meaningless supervision. - Smoke test on real O240206 (26,676 anchor pairs, 21,341 train / 5,335 hold, RTX 5090, 45 epochs, ~52s wall): CCS Å² baseline+proj 13.24 → fine-tuned 6.85 −48.3% 1/K0 baseline+proj 0.0269 → fine-tuned 0.0139 −48.4% z=2: 8.65 → 4.41 Å² MAE (publication-grade per-instrument calibration) z=3: 24.34 → 12.74 Å² MAE - SquareRootProjectionLayer slopes/intercepts moved meaningfully (z=2 slope 12.96→10.58, z=3 slope 15.61→16.17) — confirms the physics-prior is trainable when targets are correct. NEW finetune_dia_pasef_intensity.py (Phase 1.2 — fragment intensity) - Reads the per-PSM rescored_canonical.fragments.parquet (sage's annotate_matches dump, persisted by patched rescore_canonical.py) - Encodes each PSM's fragments → canonical Prosit 174-dim layout (29 positions × {y,b} × {+1,+2,+3}) via intensity_target_encoder - Fine-tunes UnifiedPeptideModel intensity head w/ masked_spectral_distance - PSM-level training units (NOT aggregated) — fragmentation depends on charge + CE; sequence-level aggregation would conflate them - Holdout by sequence (no leak), eval via spectral angle similarity - Smoke test on real O240206 (26,676 PSMs, 20 epochs, ~26s wall): Spectral angle baseline 0.3797 → fine-tuned 0.6199 +63% median 0.3736 → 0.6482 - End-to-end re-rescore with fine-tuned weights (set INTENSITY_WEIGHTS_PATH env var on rescore_canonical.py): 26,676 → 27,773 peptides @ 1% FDR (+1,097, +4.1%), decoy fraction 0.0099 unchanged. NEW intensity_target_encoder.py - Library + CLI sanity-check entry point. Encoder follows the verified mapping (verified by physics on real data, see history): sagepy observed_fragments_map() key = (ion_int, frag_charge, ordinal) ion_int 0 → b, ion_int 1 → y The user explicitly flagged this class of bug as the "ugly hotspot" where a silent swap survives undetected through the loss curve. - Sanity checks (s1/s2/s3) all PASSED on the real parquet: s1: round-trip encode→decode matches input set on 5 PSMs s2: distribution of fill rate per (ord, ion, charge) follows biology (peak at ord 3-5, tail to ord 25, 0% at ord 29) s3: spot check on short/mid/long peptides — max emitted ordinal equals peptide L-1 (no fragments past peptide length) MODIFIED finetune_dia_pasef_multi.py — bug fix + reframe as DIAGNOSTIC - Per-(peptide, charge) aggregation instead of per-sequence (CCS bug) - Holdout split by unique sequence (not row index) to avoid same peptide leaking across train/test at different charges - Docstring rewritten to flag the ARCHITECTURAL LIMITATION discovered here: off-the-shelf checkpoints (rt/, ccs/, intensity/) are each pre-trained standalone, with their own encoder weights paired with their own head. Loading from rt/best_model.pt with tasks=['rt','ccs'] gets a fresh-init CCS head (encoder + RT head only loaded). Linear projection then fits a random head's output to observed CCS for a ~30 Å² baseline — vs the 13 Å² baseline the standalone CCS script gets from ccs/best_model.pt where encoder + CCS head are correctly paired. - Production calibration path is the 3 per-task scripts; multi-task remains for future research (joint-pretrained encoder). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase A (joint_init_baseline.py): does merging the 3 pretrained heads into one UnifiedPeptideModel regress per-task baselines? Empirical test across 3 base-encoder choices on real O240206 holdout: base RT MAE Δ% CCS Å² Δ% Intensity SA Δ% intensity +133.1% -1.9% -0.1% rt +1.9% +2.0% -0.2% ← winner ccs +114.5% +0.0% -0.0% RT is the picky one — only the rt-paired encoder gives non-degraded RT baseline. CCS (anchored by SquareRootProjectionLayer physics prior) and intensity (anchored by charge + collision-energy inputs) are nearly encoder-insensitive. Joint init from rt/best_model.pt + swapped CCS & intensity heads → all three baselines within 2% of standalone. Joint IS viable. Phase B (finetune_dia_pasef_joint.py): 5-flavor fine-tune ablation on the joint init. None match per-task on all three metrics: flavor RT min CCS Å² 1/K0 SA Per-task reference 0.43 6.85 0.0139 0.620 B1 heads-only encoder-frozen 0.971 11.20 0.0233 0.5448 B2 joint static weights 0.427 7.54 0.0153 0.5360 B3 split LR (heads/encoder 100×) 0.537 10.12 0.0207 0.5515 B4 uncertainty weighting (Kendall) 0.496 6.83 0.0138 0.5784 B5 sequential intensity → joint 0.582 9.10 0.0185 0.5610 B4 is the best joint flavor — matches CCS exactly (6.83 vs 6.85) and 1/K0 (0.0138 vs 0.0139), but loses RT by 15% and SA by 7%. SA is the metric most directly tied to library-free search performance (intensity features drive sage's discriminator), so a 7% gap projects to ~30-50% loss of the +1,097 peptide gain we measured from intensity fine-tune. Why intensity always loses in joint: 174-dim convolutional decoder is the most encoder-hungry head; joint distributes encoder capacity across three tasks, per-task gets the encoder fully devoted to fragments. Decision: stay with three per-task scripts as production. Joint-model deployment cost (3 checkpoints, 3 forward passes at library-gen time) is small vs the SA hit. B4 stays as the best diagnostic flavor for future revisits (e.g., when calibration data is scarce, when more tasks are added, or when a unified pre-trained encoder lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds ``Chronologer`` as a drop-in RT predictor alongside ``DeepChromatographyApex``. Roughly 4× tighter median residual on timsTOF anchor PSMs (~8 s vs ~25 s on real-o240206) when fine-tuned on q ≤ 0.01 spec-centric hits. Implementation reproduces the upstream Searle Lab residual-CNN architecture (Apache-2.0; Pino LK et al., U. Wisconsin-Madison) inline — no upstream package dependency. We avoid vendoring weights; ``Chronologer.from_checkpoint`` loads a base ``.pt`` that the user supplies. The full attribution + citation block is at the top of chronologer.py. API mirrors DeepChromatographyApex where it can: * ``Chronologer.from_base(checkpoint_path, scale_init=0.79, bias_init=0.69)`` * ``predictor.predict(seqs, batch_size=4096) → np.ndarray`` (minutes) * ``predictor.fine_tune(df, epochs=50, lr=1e-4, patience=8, val_frac=0.2)`` * ``predictor.fit_kde_correction(n_bins=2000)`` — optional KDE post-correction that prefers upstream ``chronologer.chronologer_utils.kde_alignment.KDE_align`` when the import is available, falls back to a local KDE otherwise. * ``predictor.fine_tune_psms(psm_collection, q_threshold=0.01)`` — the API sagepy-rescore calls. * ``predictor.save_checkpoint(path)`` — saves model state_dict + KDE params so library generation can load via build_library's --chrono-checkpoint flag. Tokenizer accepts both ``[UNIMOD:n]`` and ``(UniMod:n)`` mod brackets via ``unimod_to_chronologer`` (also exported), aligning with the DiaNN probe canonicalization rule from project_diann_probe_stage0_2026_05_09.

Brings ``DeepPeptideIntensityPredictor`` to API-parity with the RT and IM heads. sagepy-rescore.predict_and_finetune was already calling fine_tune_psms on intensity; without this commit it falls through to a NotImplementedError on the rustims main tree. Three pieces: * ``observed_fragments_to_intensity_target(sequence, charge, fragments)`` builds a 174-vec Prosit target from a sagepy ``Fragments`` view of the observed ions. Layout is ordinal-major ([y1+1, y1+2, y1+3, b1+1, b1+2, b1+3, y2+1, ...]) to match the model's native output. Impossible slots (frag > seq_len-1, frag_charge > prec_charge) are marked -1.0 so masked_spectral_distance ignores them; valid but unmatched slots stay 0.0 so the model learns the presence/absence pattern, not just the matched magnitudes. Intensities normalized by per-spectrum max so the loss is scale-invariant. * ``_ion_to_text`` is a small shim around sagepy's IonType enum so it works whether the binding emits ``"b"``, ``"y"``, ``"IonType(B)"``, or the str repr. * ``DeepPeptideIntensityPredictor.fine_tune_model(data, batch_size=64, epochs=50, learning_rate=1e-4, patience=5)`` matches the training loop from scripts/finetune_timstof.py: AdamW + GradScaler + grad-clip 1.0 + masked_spectral_distance. ``DeepPeptideIntensityPredictor.fine_tune_psms(psm_collection, q_threshold=0.01, ...)`` filters to rank-1 q≤threshold targets and dispatches to fine_tune_model. State is held on ``self._finetune_history`` so the sagepy-rescore HTML report can plot it. Validated on real-o240206 fc3-imcons0_5 (412k spectra, 1.76M sage PSMs): full FT pushes mokapot peptide yield from 30,947 (baseline FT) → 33,027 (+14.4% vs rescore_canonical 28,857 baseline) when paired with xgboost mokapot model.

DeepPeptideIonMobilityApex.fine_tune_model now records per-epoch {epochs, train_loss, val_loss} on self._finetune_history. sagepy-rescore's report reads this to plot per-head convergence curves alongside RT and intensity, so all three predictor heads now expose the same telemetry channel. Other small tweaks: * divide accumulated train_loss by num batches before recording (previously the per-epoch print/value was an un-normalized sum that grew with batch count) * drop the print-every-10 cadence to every-5 to match the RT and intensity heads — keeps log-parser regexes aligned across all three heads in generate_report_post_hoc.py

Critical layout bug in observed_fragments_to_intensity_target, the new label builder used by fine_tune_psms. The function was writing observed fragment intensities at ordinal-major slots: slot = (ordinal-1)*6 + (charge-1 if y else 3+charge-1) so e.g. b1+1 went to slot 3, y2+1 went to slot 6, y1+2 went to slot 1. But the canonical Prosit/imspy 174-vec layout (from both imspy_simulation.utility.flatten_prosit_array and Rust's rustdf::sim::utility::reshape_prosit_array) is charge-major, y-before-b inside each charge block: [0:29] y+1 [29:58] b+1 [58:87] y+2 [87:116] b+2 [116:145] y+3 [145:174] b+3 so the correct slot is slot = (charge-1)*58 + (0 if y else 29) + (ordinal-1) Effect: FT trained the model on labels at the wrong slots. Loss still decreased because the scrambling was consistent across labels, so mokapot's per-PSM cosine features (which compare the FT'd-and-scrambled prediction against the same- scrambled observed) still went up — this is why v3 peptide yield (33,027) lifted +14.4% over rescore_canonical despite the bug. Mokapot features are scramble-invariant within a PSM. But the LIBRARY uses the canonical (correct) decoder to map the model's 174-vec to per-fragment intensities. So the library's predicted per-fragment intensities are gibberish. Smoking gun (3-way audit on v3 anchors, n=588): extracted ↔ sage_observed: median spearman +0.71 ✓ extracted ↔ predicted: median spearman -0.12 ✗ sage_obs ↔ predicted: median spearman -0.21 ✗ Pep-centric extraction is sound (matches sage). Predictor is mis-calibrated *for downstream library use* because of the scrambled FT. Fix is the slot formula. Mask handling rewritten to mark impossible (ordinal > seq_len-1) and impossible (frag_charge > precursor_charge) slots in the charge-major layout. After this commit: * v3 intensity_finetuned.pt is invalid; must re-FT. * Any library built with that ckpt has wrong per-fragment intensities; rebuild required. * Spec-centric peptide yield numbers from v3 stack still stand (mokapot features are scramble-invariant), but any library-assisted pipeline run on the v3 library is untrusted until rebuild.

…e-major)" This reverts commit 8843d1d.

All three predictor heads (Chronologer RT, intensity, IM) split train/val by random PSM index. Each peptide appears at many spectra (107k PSMs ↔ ~33k unique modseqs in our test set), so a PSM-level split puts the same modseq in both folds. The predictors are deterministic per input (RT: seq→time; intensity: (seq,charge)→174vec; IM: (seq,charge)→1/K0), so val loss collapses to the instrument noise floor (~10s RT, ~3% intensity, ~0.01 1/K0) — that's NOT generalization, it's memorization being measured. Fix: group-aware split. * RT (chronologer.py): group key = sequence_modified * Intensity (predictors.py): group key = (sequence_modified, charge) * IM (ccs/predictors.py): group key = (sequence_modified, charge) All PSMs of a given group go to the same fold via: uniq, inv = np.unique(group_keys, return_inverse=True) perm_groups = rng_np.permutation(n_groups) val_groups = set(perm_groups[:n_val_groups]) mask_val = [g in val_groups for g in inv] Spec-centric mokapot yields are unaffected (the rescore features use the model's prediction vs sage's matched fragments per-PSM, so memorized fit and generalized fit give identical features). But the LIBRARY built with these predictors covers the full FASTA digest (mostly unseen sequences) — so library quality depends on true generalization, which we've been measuring incorrectly. Today's chronologer FT showed val_L1 = 10s, which we now expect to be 2-4× looser on a held-out peptide set. The model is still good, just not as good as the curves suggested. After commit: rebuild venv editable + restart full recovery pipeline with fixed split.

Brings the native Chronologer RT adapter (imspy_predictors/rt/chronologer.py) and the per-task RT/CCS/intensity fine-tune work onto main, alongside the calibrate_nce NCE calibration already on main. intensity/predictors.py auto-merged (calibrate_nce preserved). The only conflict was in ccs/predictors.py -- the IM fine-tune verbose-logging cadence; took the feature-branch version (epoch % 5, 'im' label) which matches the RT/intensity report cadence. Fine-tune history tracking is present on both sides and merged cleanly.

theGreatHerrLebert and others added 12 commits April 27, 2026 16:53

Revert "intensity FT: fix target-vector layout (ordinal-major → charg…

a481fff

…e-major)" This reverts commit 8843d1d.

theGreatHerrLebert merged commit a6008d0 into main May 19, 2026
2 checks passed

theGreatHerrLebert mentioned this pull request May 20, 2026

imspy_predictors RT: default to Chronologer #398

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge feature/predictor-finetune: native Chronologer RT adapter + per-task fine-tuning#397

Merge feature/predictor-finetune: native Chronologer RT adapter + per-task fine-tuning#397
theGreatHerrLebert merged 12 commits into
mainfrom
merge/predictor-finetune

theGreatHerrLebert commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

theGreatHerrLebert commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant