Skip to content

Merge feature/predictor-finetune: native Chronologer RT adapter + per-task fine-tuning#397

Merged
theGreatHerrLebert merged 12 commits into
mainfrom
merge/predictor-finetune
May 19, 2026
Merged

Merge feature/predictor-finetune: native Chronologer RT adapter + per-task fine-tuning#397
theGreatHerrLebert merged 12 commits into
mainfrom
merge/predictor-finetune

Conversation

@theGreatHerrLebert
Copy link
Copy Markdown
Owner

Brings feature/predictor-finetune onto main so the native Chronologer RT adapter (imspy_predictors/rt/chronologer.py) and the per-task RT/CCS/intensity fine-tune work land alongside the calibrate_nce NCE calibration already on main.

Why: feature/predictor-finetune branched from 488a8305 (before PR #395), so no branch currently has both Chronologer and calibrate_nce. sagepy-rescore pins imspy-predictors at a branch == main, so it cannot reach Chronologer until this lands.

Conflict resolution: intensity/predictors.py auto-merged cleanly -- calibrate_nce is preserved. The only conflict was a one-hunk cosmetic difference in ccs/predictors.py (IM fine-tune verbose-logging cadence); took the feature-branch version (epoch % 5, im label) which matches the RT/intensity report cadence. Fine-tune history tracking is present on both sides and merged cleanly.

After merge: repin sagepy-rescore pyproject to main and reinstall.

theGreatHerrLebert and others added 12 commits April 27, 2026 16:53
…tims_read_pasef_msms_for_frame_v2

Adds Rust wrappers around two previously-unexposed Bruker SDK functions:

* tims_extract_centroided_spectrum_for_frame_v2(handle, frame_id,
  scan_lo, scan_hi, callback, user_data) — returns Bruker's built-in
  centroided peak list (m/z, intensity) for one (frame, scan range)
  tile. This is the function DiaTracer uses to start with peaks rather
  than raw events.

* tims_read_pasef_msms_for_frame_v2 — DDA-PASEF per-frame fragment
  reader; included for future DDA work, not used by the DIA dump.

Signatures recovered from gtluu/pyTDFSDK init_tdf_sdk.py + the
MSMS_SPECTRUM_FUNCTOR / MSMS_SPECTRUM_FUNCTION callback typedefs from
ctypes_data_structures.py. Callback marshalling is done through a
process-level Mutex + Option<Vec<...>> trampoline since Rust closures
don't compose with libloading symbols + C function pointers.

Adds rustdf/examples/dump_bruker_centroids.rs — a one-shot CLI that
reads dia_ms_ms_windows + dia_ms_ms_info from analysis.tdf, then
extracts a centroided spectrum per (MS2 frame, DIA quad scan-range)
and writes a TSV with the peak arrays.

Smoke test on 10 MS2 frames of O240206:
  * 26 extract calls
  * 838 peaks per call (mean)
  * 324 calls/s
  → full-file estimate (29k calls) lands ~90 seconds
  vs ~50 minutes for the current event-clustering pseudo-assembly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1.0 of the calibrated-library pipeline: take q ≤ 0.01 PSMs from a
spectrum-centric DIA-PASEF rescore, fine-tune the pretrained RT predictor
(UnifiedPeptideModel via DeepChromatographyApex) to predict OBSERVED RT
in minutes for THIS run, measure RT MAE on a peptide-level holdout
(20% of unique sequences, seeded).

Smoke test on real O240206 (26,855 anchor PSMs, 21,484 train / 5,371
holdout, RTX 5090, ~30s wall):

    baseline (linearly projected pretrained):  1.535 min MAE
    fine-tuned:                                 0.425 min MAE
    delta:                                     -1.11 min (-72%)

Held-out peptides only — sequences the predictor never saw during
fine-tuning. Shows the calibration hypothesis works: q ≤ 0.01 PSMs
from a single run are usable training data for a per-run RT model
that generalises within the same instrument run.

Implementation notes:
- Joins rescored_canonical.csv (observed `rt`) with rescored_canonical.tdc.csv
  (q_value + decoy) on (spec_idx, match_idx). Aggregates per peptide
  via mean observed RT.
- Bypasses the legacy DeepChromatographyApex.fine_tune_model() — that
  method calls bare `self.model(tokens)` which on UnifiedPeptideModel
  returns a dict, breaking l1_loss. The script's custom_finetune_loop()
  calls model.predict_rt(tokens) which extracts the 'rt' tensor.
- Supervision target is observed `rt` in minutes (not
  retention_time_projected, which is sage's projection into ITS
  predictor space — different scale than imspy_predictors' RT model).
- Pre-fine-tune baseline projects pretrained predictions linearly onto
  observed RT via least-squares so MAE measures prediction quality
  rather than the scale offset between the two model output spaces.

Output: rt_finetuned.pt (state_dict + metrics + args + pretrained
state_dict for diff) + metrics.json.

Next: Phase 1.1 expands to multi-task (RT + CCS + intensity) on
UnifiedPeptideModel, single state_dict per run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joint fine-tune of RT + CCS heads on UnifiedPeptideModel using one
DIA-PASEF run's q ≤ 0.01 PSMs. Same encoder is shared so a single
training pass calibrates both heads against this run's anchor set.

Smoke test on real O240206 (26,855 anchor peptides, 21,484 train /
5,371 holdout, RTX 5090, ~30 s wall, 30 epochs):

    RT       baseline 1.535 min  fine-tuned 0.468 min  -69%  ✓
    CCS Ų   baseline 31.22      fine-tuned 30.14      -3.5%  marginal
    1/K0*    baseline 0.0655     fine-tuned 0.0708     +8% (worse)
    *predicted CCS converted via Mason-Schamp (ccs_to_one_over_k0).

Diagnosis: RT calibrates cleanly, matching the Phase 1.0 standalone
result. The CCS head's physics-informed SquareRootProjectionLayer
encodes a strong inductive bias for CCS in Ų; fine-tune on observed
CCS makes ~3% headway in 30 epochs, but predicted-CCS → 1/K0 round-trip
gets slightly worse than just linearly projecting the pretrained CCS
onto observed-CCS. Net read: linear projection of pretrained CCS is
already a strong baseline for library use; meaningful CCS fine-tune
would need either much longer training (~100s of epochs) or a
1/K0-native head that drops the SquareRootProjection physics layer.

Implementation notes:
- Side-loads observed 1/K0 from <stem>.pseudo.bin's env_apex_scan +
  TimsDataset.scan_to_inverse_mobility (one-frame LUT). The rescore
  CSV's ims/predicted_ims columns are zero — sage isn't passed
  inverse_ion_mobility from build_query in rescore_canonical.py.
  Right long-term fix is upstream (1-line plumbing in
  rescore_canonical.py); side-load avoids re-running rescore.
- Supervises CCS in Ų (head's native scale); converts predicted
  CCS → 1/K0 via imspy_core.chemistry.mobility.ccs_to_one_over_k0_par
  for library-relevant reporting.
- Saves both pretrained + fine-tuned state_dicts plus the linear
  projection params (rt_projection, ccs_projection) to the .pt.
  Library-time inference: load checkpoint, predict, optionally
  apply the projection (esp. for CCS where the fine-tune is marginal).

Phase 1.1 result: RT calibration is publication-grade. CCS + linear
projection is sufficient for library generation; chasing the extra
3-5% from a real CCS fine-tune is not worth the architecture work
now. Pivoting to Phase 1.2 (fragment intensity) — that's where the
biggest peptide-centric search lift comes from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lti-task fix

Three new scripts for the per-run calibration pipeline (Phase 1.1' / 1.2),
plus a data-prep bug fix on the existing multi-task script.

NEW finetune_dia_pasef_ccs.py (Phase 1.1' standalone CCS diagnostic)
  - Side-loads observed 1/K0 from .pseudo.bin's env_apex_scan +
    TimsDataset.scan_to_inverse_mobility (one-frame LUT)
  - Converts 1/K0 → CCS Ų via Mason-Schamp (one_over_k0_to_ccs_par)
    — supervises the CCS head in its native output scale, then converts
    predicted CCS → 1/K0 at inference for library use.
  - Aggregates per-(peptide, charge) — CCS depends on charge, peptides
    observed at z=2 and z=3 carry distinct CCS values; mixing them
    via per-sequence aggregation gives meaningless supervision.
  - Smoke test on real O240206 (26,676 anchor pairs, 21,341 train /
    5,335 hold, RTX 5090, 45 epochs, ~52s wall):
        CCS Ų  baseline+proj 13.24 → fine-tuned 6.85   −48.3%
        1/K0    baseline+proj 0.0269 → fine-tuned 0.0139 −48.4%
        z=2: 8.65 → 4.41 Ų MAE  (publication-grade per-instrument calibration)
        z=3: 24.34 → 12.74 Ų MAE
  - SquareRootProjectionLayer slopes/intercepts moved meaningfully
    (z=2 slope 12.96→10.58, z=3 slope 15.61→16.17) — confirms the
    physics-prior is trainable when targets are correct.

NEW finetune_dia_pasef_intensity.py (Phase 1.2 — fragment intensity)
  - Reads the per-PSM rescored_canonical.fragments.parquet (sage's
    annotate_matches dump, persisted by patched rescore_canonical.py)
  - Encodes each PSM's fragments → canonical Prosit 174-dim layout
    (29 positions × {y,b} × {+1,+2,+3}) via intensity_target_encoder
  - Fine-tunes UnifiedPeptideModel intensity head w/ masked_spectral_distance
  - PSM-level training units (NOT aggregated) — fragmentation depends
    on charge + CE; sequence-level aggregation would conflate them
  - Holdout by sequence (no leak), eval via spectral angle similarity
  - Smoke test on real O240206 (26,676 PSMs, 20 epochs, ~26s wall):
        Spectral angle  baseline 0.3797 → fine-tuned 0.6199  +63%
        median          0.3736 → 0.6482
  - End-to-end re-rescore with fine-tuned weights (set INTENSITY_WEIGHTS_PATH
    env var on rescore_canonical.py): 26,676 → 27,773 peptides @ 1% FDR
    (+1,097, +4.1%), decoy fraction 0.0099 unchanged.

NEW intensity_target_encoder.py
  - Library + CLI sanity-check entry point. Encoder follows the verified
    mapping (verified by physics on real data, see history):
        sagepy observed_fragments_map() key = (ion_int, frag_charge, ordinal)
        ion_int 0 → b, ion_int 1 → y
    The user explicitly flagged this class of bug as the "ugly hotspot"
    where a silent swap survives undetected through the loss curve.
  - Sanity checks (s1/s2/s3) all PASSED on the real parquet:
      s1: round-trip encode→decode matches input set on 5 PSMs
      s2: distribution of fill rate per (ord, ion, charge) follows
          biology (peak at ord 3-5, tail to ord 25, 0% at ord 29)
      s3: spot check on short/mid/long peptides — max emitted ordinal
          equals peptide L-1 (no fragments past peptide length)

MODIFIED finetune_dia_pasef_multi.py — bug fix + reframe as DIAGNOSTIC
  - Per-(peptide, charge) aggregation instead of per-sequence (CCS bug)
  - Holdout split by unique sequence (not row index) to avoid same
    peptide leaking across train/test at different charges
  - Docstring rewritten to flag the ARCHITECTURAL LIMITATION discovered
    here: off-the-shelf checkpoints (rt/, ccs/, intensity/) are each
    pre-trained standalone, with their own encoder weights paired
    with their own head. Loading from rt/best_model.pt with
    tasks=['rt','ccs'] gets a fresh-init CCS head (encoder + RT head
    only loaded). Linear projection then fits a random head's output
    to observed CCS for a ~30 Ų baseline — vs the 13 Ų baseline the
    standalone CCS script gets from ccs/best_model.pt where encoder +
    CCS head are correctly paired.
  - Production calibration path is the 3 per-task scripts; multi-task
    remains for future research (joint-pretrained encoder).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase A (joint_init_baseline.py): does merging the 3 pretrained heads
into one UnifiedPeptideModel regress per-task baselines? Empirical test
across 3 base-encoder choices on real O240206 holdout:

    base         RT MAE Δ%   CCS Ų Δ%   Intensity SA Δ%
    intensity    +133.1%     -1.9%       -0.1%
    rt           +1.9%       +2.0%       -0.2%      ← winner
    ccs          +114.5%     +0.0%       -0.0%

RT is the picky one — only the rt-paired encoder gives non-degraded RT
baseline. CCS (anchored by SquareRootProjectionLayer physics prior) and
intensity (anchored by charge + collision-energy inputs) are nearly
encoder-insensitive. Joint init from rt/best_model.pt + swapped CCS &
intensity heads → all three baselines within 2% of standalone. Joint IS
viable.

Phase B (finetune_dia_pasef_joint.py): 5-flavor fine-tune ablation on
the joint init. None match per-task on all three metrics:

    flavor                                RT min  CCS Ų  1/K0    SA
    Per-task reference                    0.43    6.85    0.0139  0.620
    B1 heads-only encoder-frozen          0.971   11.20   0.0233  0.5448
    B2 joint static weights               0.427   7.54    0.0153  0.5360
    B3 split LR (heads/encoder 100×)      0.537   10.12   0.0207  0.5515
    B4 uncertainty weighting (Kendall)    0.496   6.83    0.0138  0.5784
    B5 sequential intensity → joint       0.582   9.10    0.0185  0.5610

B4 is the best joint flavor — matches CCS exactly (6.83 vs 6.85) and
1/K0 (0.0138 vs 0.0139), but loses RT by 15% and SA by 7%. SA is the
metric most directly tied to library-free search performance (intensity
features drive sage's discriminator), so a 7% gap projects to ~30-50%
loss of the +1,097 peptide gain we measured from intensity fine-tune.

Why intensity always loses in joint: 174-dim convolutional decoder is
the most encoder-hungry head; joint distributes encoder capacity across
three tasks, per-task gets the encoder fully devoted to fragments.

Decision: stay with three per-task scripts as production. Joint-model
deployment cost (3 checkpoints, 3 forward passes at library-gen time)
is small vs the SA hit. B4 stays as the best diagnostic flavor for
future revisits (e.g., when calibration data is scarce, when more
tasks are added, or when a unified pre-trained encoder lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``Chronologer`` as a drop-in RT predictor alongside
``DeepChromatographyApex``. Roughly 4× tighter median residual
on timsTOF anchor PSMs (~8 s vs ~25 s on real-o240206) when
fine-tuned on q ≤ 0.01 spec-centric hits.

Implementation reproduces the upstream Searle Lab residual-CNN
architecture (Apache-2.0; Pino LK et al., U. Wisconsin-Madison)
inline — no upstream package dependency. We avoid vendoring
weights; ``Chronologer.from_checkpoint`` loads a base ``.pt``
that the user supplies. The full attribution + citation
block is at the top of chronologer.py.

API mirrors DeepChromatographyApex where it can:
  * ``Chronologer.from_base(checkpoint_path, scale_init=0.79, bias_init=0.69)``
  * ``predictor.predict(seqs, batch_size=4096) → np.ndarray``  (minutes)
  * ``predictor.fine_tune(df, epochs=50, lr=1e-4, patience=8, val_frac=0.2)``
  * ``predictor.fit_kde_correction(n_bins=2000)`` — optional KDE post-correction
    that prefers upstream ``chronologer.chronologer_utils.kde_alignment.KDE_align``
    when the import is available, falls back to a local KDE otherwise.
  * ``predictor.fine_tune_psms(psm_collection, q_threshold=0.01)`` —
    the API sagepy-rescore calls.
  * ``predictor.save_checkpoint(path)`` — saves model state_dict + KDE
    params so library generation can load via build_library's
    --chrono-checkpoint flag.

Tokenizer accepts both ``[UNIMOD:n]`` and ``(UniMod:n)`` mod
brackets via ``unimod_to_chronologer`` (also exported), aligning
with the DiaNN probe canonicalization rule from
project_diann_probe_stage0_2026_05_09.
Brings ``DeepPeptideIntensityPredictor`` to API-parity with the
RT and IM heads. sagepy-rescore.predict_and_finetune was
already calling fine_tune_psms on intensity; without this
commit it falls through to a NotImplementedError on the rustims
main tree.

Three pieces:

* ``observed_fragments_to_intensity_target(sequence, charge,
  fragments)`` builds a 174-vec Prosit target from a sagepy
  ``Fragments`` view of the observed ions. Layout is
  ordinal-major ([y1+1, y1+2, y1+3, b1+1, b1+2, b1+3, y2+1,
  ...]) to match the model's native output. Impossible slots
  (frag > seq_len-1, frag_charge > prec_charge) are marked
  -1.0 so masked_spectral_distance ignores them; valid but
  unmatched slots stay 0.0 so the model learns the
  presence/absence pattern, not just the matched magnitudes.
  Intensities normalized by per-spectrum max so the loss is
  scale-invariant.

* ``_ion_to_text`` is a small shim around sagepy's IonType
  enum so it works whether the binding emits ``"b"``, ``"y"``,
  ``"IonType(B)"``, or the str repr.

* ``DeepPeptideIntensityPredictor.fine_tune_model(data,
  batch_size=64, epochs=50, learning_rate=1e-4, patience=5)``
  matches the training loop from scripts/finetune_timstof.py:
  AdamW + GradScaler + grad-clip 1.0 + masked_spectral_distance.
  ``DeepPeptideIntensityPredictor.fine_tune_psms(psm_collection,
  q_threshold=0.01, ...)`` filters to rank-1 q≤threshold targets
  and dispatches to fine_tune_model.

State is held on ``self._finetune_history`` so the
sagepy-rescore HTML report can plot it.

Validated on real-o240206 fc3-imcons0_5 (412k spectra,
1.76M sage PSMs): full FT pushes mokapot peptide yield from
30,947 (baseline FT) → 33,027 (+14.4% vs rescore_canonical
28,857 baseline) when paired with xgboost mokapot model.
DeepPeptideIonMobilityApex.fine_tune_model now records
per-epoch {epochs, train_loss, val_loss} on
self._finetune_history. sagepy-rescore's report reads this
to plot per-head convergence curves alongside RT and
intensity, so all three predictor heads now expose the
same telemetry channel.

Other small tweaks:
* divide accumulated train_loss by num batches before
  recording (previously the per-epoch print/value was an
  un-normalized sum that grew with batch count)
* drop the print-every-10 cadence to every-5 to match the
  RT and intensity heads — keeps log-parser regexes aligned
  across all three heads in generate_report_post_hoc.py
Critical layout bug in observed_fragments_to_intensity_target,
the new label builder used by fine_tune_psms.

The function was writing observed fragment intensities at
ordinal-major slots:

    slot = (ordinal-1)*6 + (charge-1 if y else 3+charge-1)

so e.g. b1+1 went to slot 3, y2+1 went to slot 6, y1+2 went to
slot 1. But the canonical Prosit/imspy 174-vec layout (from
both imspy_simulation.utility.flatten_prosit_array and Rust's
rustdf::sim::utility::reshape_prosit_array) is charge-major,
y-before-b inside each charge block:

    [0:29]    y+1
    [29:58]   b+1
    [58:87]   y+2
    [87:116]  b+2
    [116:145] y+3
    [145:174] b+3

so the correct slot is

    slot = (charge-1)*58 + (0 if y else 29) + (ordinal-1)

Effect: FT trained the model on labels at the wrong slots.
Loss still decreased because the scrambling was consistent
across labels, so mokapot's per-PSM cosine features (which
compare the FT'd-and-scrambled prediction against the same-
scrambled observed) still went up — this is why v3 peptide
yield (33,027) lifted +14.4% over rescore_canonical despite
the bug. Mokapot features are scramble-invariant within a
PSM.

But the LIBRARY uses the canonical (correct) decoder to map
the model's 174-vec to per-fragment intensities. So the
library's predicted per-fragment intensities are gibberish.

Smoking gun (3-way audit on v3 anchors, n=588):
  extracted   ↔ sage_observed:   median spearman +0.71  ✓
  extracted   ↔ predicted:       median spearman -0.12  ✗
  sage_obs    ↔ predicted:       median spearman -0.21  ✗

Pep-centric extraction is sound (matches sage). Predictor
is mis-calibrated *for downstream library use* because of
the scrambled FT.

Fix is the slot formula. Mask handling rewritten to mark
impossible (ordinal > seq_len-1) and impossible
(frag_charge > precursor_charge) slots in the charge-major
layout.

After this commit:
  * v3 intensity_finetuned.pt is invalid; must re-FT.
  * Any library built with that ckpt has wrong per-fragment
    intensities; rebuild required.
  * Spec-centric peptide yield numbers from v3 stack still
    stand (mokapot features are scramble-invariant), but any
    library-assisted pipeline run on the v3 library is
    untrusted until rebuild.
All three predictor heads (Chronologer RT, intensity, IM) split
train/val by random PSM index. Each peptide appears at many spectra
(107k PSMs ↔ ~33k unique modseqs in our test set), so a PSM-level
split puts the same modseq in both folds. The predictors are
deterministic per input (RT: seq→time; intensity: (seq,charge)→174vec;
IM: (seq,charge)→1/K0), so val loss collapses to the instrument noise
floor (~10s RT, ~3% intensity, ~0.01 1/K0) — that's NOT generalization,
it's memorization being measured.

Fix: group-aware split.
  * RT (chronologer.py): group key = sequence_modified
  * Intensity (predictors.py): group key = (sequence_modified, charge)
  * IM (ccs/predictors.py): group key = (sequence_modified, charge)

All PSMs of a given group go to the same fold via:
  uniq, inv = np.unique(group_keys, return_inverse=True)
  perm_groups = rng_np.permutation(n_groups)
  val_groups = set(perm_groups[:n_val_groups])
  mask_val = [g in val_groups for g in inv]

Spec-centric mokapot yields are unaffected (the rescore features use
the model's prediction vs sage's matched fragments per-PSM, so
memorized fit and generalized fit give identical features). But the
LIBRARY built with these predictors covers the full FASTA digest
(mostly unseen sequences) — so library quality depends on true
generalization, which we've been measuring incorrectly.

Today's chronologer FT showed val_L1 = 10s, which we now expect to
be 2-4× looser on a held-out peptide set. The model is still good,
just not as good as the curves suggested.

After commit: rebuild venv editable + restart full recovery pipeline
with fixed split.
Brings the native Chronologer RT adapter (imspy_predictors/rt/chronologer.py)
and the per-task RT/CCS/intensity fine-tune work onto main, alongside the
calibrate_nce NCE calibration already on main.

intensity/predictors.py auto-merged (calibrate_nce preserved). The only
conflict was in ccs/predictors.py -- the IM fine-tune verbose-logging cadence;
took the feature-branch version (epoch % 5, 'im' label) which matches the
RT/intensity report cadence. Fine-tune history tracking is present on both
sides and merged cleanly.
@theGreatHerrLebert theGreatHerrLebert merged commit a6008d0 into main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant