Freed heavy fit objects and large dataProcess intermediates early by tonywu1999 · Pull Request #207 · Vitek-Lab/MSstats

tonywu1999 · 2026-05-14T15:23:51Z

User description

Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected.
Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...).
Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs.
Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised.
Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

PR Type

Bug fix, Enhancement, Tests

Description

Free large pipeline inputs earlier
Strip heavy fields from fit objects
Avoid repeated copies in variance updates
Add fit memory regression tests

Diagram Walkthrough

flowchart LR
  raw["raw and peptides_dict inputs"]
  proc["dataProcess() processing pipeline"]
  fits["Stripped survreg and lm objects"]
  var["In-place unequal variance updates"]
  tests["Memory optimization regression tests"]
  raw -- "freed earlier" --> proc
  fits -- "reduce object size" --> proc
  var -- "avoids repeated copies" --> proc
  proc -- "validated by" --> tests

File Walkthrough

Relevant files

Bug fix

dataProcess.R `Free pipeline intermediates and fit objects early` R/dataProcess.R Remove `raw` after initial preparation Remove `peptides_dict` after normalization Run `gc(verbose = FALSE)` before output assembly Drop local `survival_fit` and `fit` sooner	+11/-3

Enhancement

utils_imputation.R Strip unused `survreg` fields after fitting R/utils_imputation.R Clear `fit$y` after `survreg()` fitting Clear `fit$linear.predictors` before returning Keep downstream prediction behavior intact	+3/-1
utils_summarization.R Reduce `lm` footprint and variance-copy churn R/utils_summarization.R Clear `linear_model$model` before returning Convert `input` to `data.frame` once upfront Add and remove helper columns in place Avoid repeated full-frame copies per iteration	+17/-15

Tests

test_memory_optimization_fits.R `Add regression tests for memory-stripped fits` inst/tinytest/test_memory_optimization_fits.R Add tests for stripped `survreg` fields Verify stripped survival fits still predict Add tests for stripped `lm$model` Verify `summary()` and `vcov()` still work	+111/-0

* Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected. * Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...). * Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs. * Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised. * Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-05-14T15:24:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7709991f-e6df-42de-b82b-4cd9d4e91f91

📥 Commits

Reviewing files that changed from the base of the PR and between 5693fc9 and 9edb718.

📒 Files selected for processing (3)

R/dataProcess.R
R/utils_summarization.R
inst/tinytest/test_memory_optimization_fits.R

🚧 Files skipped from review as they are similar to previous changes (3)

inst/tinytest/test_memory_optimization_fits.R
R/dataProcess.R
R/utils_summarization.R

📝 Walkthrough

Walkthrough

This PR frees large intermediates and fitted-model internals during data processing and summarization, calls gc() before summarization, fixes a conditional in .fitSurvival(), refactors unequal-variance updates to avoid repeated data.frame copies, and adds tests that validate stripped fit objects retain required functionality while using less memory.

Changes

Memory Optimization in Data Processing and Fitting

Layer / File(s)	Summary
Pipeline intermediate cleanup and GC `R/dataProcess.R`	Removes large intermediates (`raw`, `peptides_dict`) immediately after use and calls `gc(verbose = FALSE)` before summarization. Explicitly removes per-iteration fitted objects (`survival_fit`, `fit`) in both linear and TMP summarization paths after their results are extracted.
Survival fit return cleanup and conditional fix `R/utils_imputation.R`	Adds missing brace to correctly terminate a nested branch in `.fitSurvival()` and nulls `fit$y` and `fit$linear.predictors` on the returned `survreg`-like object to reduce retained memory while leaving coefficients/scale for use.
Unequal-variance adjustment refactor `R/utils_summarization.R`	Refactors `.updateUnequalVariances()` to convert `input` to a `data.frame` once, add intermediate columns (`abs.resids`, `fitted`, `loess.fitted`, `weight`) via in-place assignments, remove them by setting to `NULL`, and re-fit weighted `lm()` each iteration — avoiding repeated full data.frame reconstruction.
Tests validating memory-optimized fits `inst/tinytest/test_memory_optimization_fits.R`	Adds tests asserting that `.fitSurvival()` strips `$y` and `$linear.predictors` but preserves coefficients/scale and predictability; and that `.fitLinearModel()` strips `$model` while preserving coefficients/qr/residuals and summary/vcov behavior; tests compare object.size() to unstripped fits.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Vitek-Lab/MSstats#193: Modifies the same summarization pipeline (R/dataProcess.R) and may overlap with model-cleanup locations.
Vitek-Lab/MSstats#190: Changes survival-model fitting in R/utils_imputation.R; closely related to the .fitSurvival() branch that was fixed and stripped here.

Suggested labels

Review effort 3/5

Poem

🐰 I hopped through the code with tidy delight,

Cleared bulky fits and made memory light,
Raw crumbs swept up, predictions still sing,
Tests nod and approve — a neat little spring.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description includes custom user notes and a diagram but lacks the required template sections: 'Motivation and Context', properly filled 'Changes' with bullet points, 'Testing' details, and unchecked Checklist items.	Complete the description using the template: fill 'Motivation and Context' with problem context, populate 'Changes' with detailed bullet points, describe tests in 'Testing' section, and address the Checklist items.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: freeing heavy fit objects (survreg/lm) and early freeing of large dataProcess intermediates (raw, peptides_dict).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstats/work/20260514_free-heavy-objects

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T15:26:26Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Unguarded cleanup In `MSstatsSummarizeSingleTMP`, `survival_fit` is created inside a `tryCatch` and the code already handles the failure case by setting `converged = FALSE`. The new unconditional `rm(survival_fit)` runs even when fitting failed, so this path now emits `object 'survival_fit' not found` warnings and can become a hard failure in environments that promote warnings to errors. rm(survival_fit)

github-actions · 2026-05-14T15:27:55Z

PR Code Suggestions ✨

No code suggestions found for the PR.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_summarization.R`:
- Around line 208-223: The loess iterations reuse stale fitted values because
input[["fitted"]] is only set from the initial fit; after each weighted refit
(wls.fit) update the stored fitted values so subsequent loess calls use
consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit
(the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or
input[["fitted"]] <- wls.fit$fitted.values before recalculating
input[["abs.resids"]] and proceeding to the next iteration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d6992f3-0ec8-4947-80c1-53b71f9e9caa

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 5693fc9.

📒 Files selected for processing (4)

R/dataProcess.R
R/utils_imputation.R
R/utils_summarization.R
inst/tinytest/test_memory_optimization_fits.R

coderabbitai · 2026-05-14T15:29:47Z

    for (i in seq_len(num_iter)) {
        if (i == 1) {
-            abs.resids = data.frame(abs.resids = abs(fit$residuals))
-            fitted = data.frame(fitted = fit$fitted.values)
-            input = data.frame(input, 
-                               "abs.resids" = abs.resids, 
-                               "fitted" = fitted)
+            input[["abs.resids"]] = abs(fit$residuals)
+            input[["fitted"]] = fit$fitted.values
        }
        fit.loess = loess(abs.resids ~ fitted, data = input)
-        loess.fitted = data.frame(loess.fitted = fitted(fit.loess))
-        input = data.frame(input, "loess.fitted" = loess.fitted)
-        ## loess fitted valuaes are predicted sd
-        input$weight = 1 / (input$loess.fitted ^ 2)
-        input = input[, !(colnames(input) %in% "abs.resids")]
+        input[["loess.fitted"]] = fitted(fit.loess)
+        ## loess fitted values are predicted sd
+        input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
+        input[["abs.resids"]] = NULL
        ## re-fit using weight
        wls.fit = lm(formula(fit), data = input, weights = weight)
-        abs.resids = data.frame(abs.resids = abs(wls.fit$residuals))
-        input = data.frame(input, "abs.resids" = abs.resids)
-        input = input[, -which(colnames(input) %in% c("loess.fitted", "weight"))]
+        input[["abs.resids"]] = abs(wls.fit$residuals)
+        input[["loess.fitted"]] = NULL
+        input[["weight"]] = NULL
    }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Refresh fitted after each weighted refit.

At Line 220 you refresh abs.resids from wls.fit, but fitted is never refreshed after Line 211. For num_iter > 1, later loess iterations use stale fitted values and inconsistent residual/fitted pairs.

Suggested fix

for (i in seq_len(num_iter)) { if (i == 1) { input[["abs.resids"]] = abs(fit$residuals) input[["fitted"]] = fit$fitted.values } fit.loess = loess(abs.resids ~ fitted, data = input) input[["loess.fitted"]] = fitted(fit.loess) ## loess fitted values are predicted sd input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2) input[["abs.resids"]] = NULL ## re-fit using weight wls.fit = lm(formula(fit), data = input, weights = weight) input[["abs.resids"]] = abs(wls.fit$residuals) + input[["fitted"]] = wls.fit$fitted.values input[["loess.fitted"]] = NULL input[["weight"]] = NULL }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/utils_summarization.R` around lines 208 - 223, The loess iterations reuse stale fitted values because input[["fitted"]] is only set from the initial fit; after each weighted refit (wls.fit) update the stored fitted values so subsequent loess calls use consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit (the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or input[["fitted"]] <- wls.fit$fitted.values before recalculating input[["abs.resids"]] and proceeding to the next iteration.

github-actions Bot added the Review effort 2/5 label May 14, 2026

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Removing the unnecessary comments

9edb718

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freed heavy fit objects and large dataProcess intermediates early#207

Freed heavy fit objects and large dataProcess intermediates early#207
tonywu1999 wants to merge 2 commits into
develfrom
MSstats/work/20260514_free-heavy-objects

tonywu1999 commented May 14, 2026 •

edited by github-actions Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tonywu1999 commented May 14, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tonywu1999 commented May 14, 2026 •

edited by github-actions Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading