Skip to content

Freed heavy fit objects and large dataProcess intermediates early#207

Open
tonywu1999 wants to merge 2 commits into
develfrom
MSstats/work/20260514_free-heavy-objects
Open

Freed heavy fit objects and large dataProcess intermediates early#207
tonywu1999 wants to merge 2 commits into
develfrom
MSstats/work/20260514_free-heavy-objects

Conversation

@tonywu1999
Copy link
Copy Markdown
Contributor

@tonywu1999 tonywu1999 commented May 14, 2026

User description

  • Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected.
  • Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...).
  • Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs.
  • Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised.
  • Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

Bug fix, Enhancement, Tests


Description

  • Free large pipeline inputs earlier

  • Strip heavy fields from fit objects

  • Avoid repeated copies in variance updates

  • Add fit memory regression tests


Diagram Walkthrough

flowchart LR
  raw["raw and peptides_dict inputs"]
  proc["dataProcess() processing pipeline"]
  fits["Stripped survreg and lm objects"]
  var["In-place unequal variance updates"]
  tests["Memory optimization regression tests"]
  raw -- "freed earlier" --> proc
  fits -- "reduce object size" --> proc
  var -- "avoids repeated copies" --> proc
  proc -- "validated by" --> tests
Loading

File Walkthrough

Relevant files
Bug fix
dataProcess.R
Free pipeline intermediates and fit objects early               

R/dataProcess.R

  • Remove raw after initial preparation
  • Remove peptides_dict after normalization
  • Run gc(verbose = FALSE) before output assembly
  • Drop local survival_fit and fit sooner
+11/-3   
Enhancement
utils_imputation.R
Strip unused `survreg` fields after fitting                           

R/utils_imputation.R

  • Clear fit$y after survreg() fitting
  • Clear fit$linear.predictors before returning
  • Keep downstream prediction behavior intact
+3/-1     
utils_summarization.R
Reduce `lm` footprint and variance-copy churn                       

R/utils_summarization.R

  • Clear linear_model$model before returning
  • Convert input to data.frame once upfront
  • Add and remove helper columns in place
  • Avoid repeated full-frame copies per iteration
+17/-15 
Tests
test_memory_optimization_fits.R
Add regression tests for memory-stripped fits                       

inst/tinytest/test_memory_optimization_fits.R

  • Add tests for stripped survreg fields
  • Verify stripped survival fits still predict
  • Add tests for stripped lm$model
  • Verify summary() and vcov() still work
+111/-0 

* Stripped survreg() $y / $linear.predictors and lm() $model after
  fitting in .fitSurvival and .fitLinearModel; downstream predict /
  summary / vcov paths are unaffected.
* Rewrote .updateUnequalVariances to mutate a single data.frame in
  place instead of rebuilding it ~5x per iteration via data.frame(...).
* Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside
  dataProcess() so the ~250-400 MB held by function arguments is freed
  before MSstatsSummarizationOutput runs.
* Added rm(survival_fit) and rm(fit) inside the per-protein linear /
  TMP summarizers so local fit objects drop as soon as survival /
  result is materialised.
* Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a,
  1b) — 14 assertions, all green.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7709991f-e6df-42de-b82b-4cd9d4e91f91

📥 Commits

Reviewing files that changed from the base of the PR and between 5693fc9 and 9edb718.

📒 Files selected for processing (3)
  • R/dataProcess.R
  • R/utils_summarization.R
  • inst/tinytest/test_memory_optimization_fits.R
🚧 Files skipped from review as they are similar to previous changes (3)
  • inst/tinytest/test_memory_optimization_fits.R
  • R/dataProcess.R
  • R/utils_summarization.R

📝 Walkthrough

Walkthrough

This PR frees large intermediates and fitted-model internals during data processing and summarization, calls gc() before summarization, fixes a conditional in .fitSurvival(), refactors unequal-variance updates to avoid repeated data.frame copies, and adds tests that validate stripped fit objects retain required functionality while using less memory.

Changes

Memory Optimization in Data Processing and Fitting

Layer / File(s) Summary
Pipeline intermediate cleanup and GC
R/dataProcess.R
Removes large intermediates (raw, peptides_dict) immediately after use and calls gc(verbose = FALSE) before summarization. Explicitly removes per-iteration fitted objects (survival_fit, fit) in both linear and TMP summarization paths after their results are extracted.
Survival fit return cleanup and conditional fix
R/utils_imputation.R
Adds missing brace to correctly terminate a nested branch in .fitSurvival() and nulls fit$y and fit$linear.predictors on the returned survreg-like object to reduce retained memory while leaving coefficients/scale for use.
Unequal-variance adjustment refactor
R/utils_summarization.R
Refactors .updateUnequalVariances() to convert input to a data.frame once, add intermediate columns (abs.resids, fitted, loess.fitted, weight) via in-place assignments, remove them by setting to NULL, and re-fit weighted lm() each iteration — avoiding repeated full data.frame reconstruction.
Tests validating memory-optimized fits
inst/tinytest/test_memory_optimization_fits.R
Adds tests asserting that .fitSurvival() strips $y and $linear.predictors but preserves coefficients/scale and predictability; and that .fitLinearModel() strips $model while preserving coefficients/qr/residuals and summary/vcov behavior; tests compare object.size() to unstripped fits.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Vitek-Lab/MSstats#193: Modifies the same summarization pipeline (R/dataProcess.R) and may overlap with model-cleanup locations.
  • Vitek-Lab/MSstats#190: Changes survival-model fitting in R/utils_imputation.R; closely related to the .fitSurvival() branch that was fixed and stripped here.

Suggested labels

Review effort 3/5

Poem

🐰 I hopped through the code with tidy delight,

Cleared bulky fits and made memory light,
Raw crumbs swept up, predictions still sing,
Tests nod and approve — a neat little spring.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description includes custom user notes and a diagram but lacks the required template sections: 'Motivation and Context', properly filled 'Changes' with bullet points, 'Testing' details, and unchecked Checklist items. Complete the description using the template: fill 'Motivation and Context' with problem context, populate 'Changes' with detailed bullet points, describe tests in 'Testing' section, and address the Checklist items.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: freeing heavy fit objects (survreg/lm) and early freeing of large dataProcess intermediates (raw, peptides_dict).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstats/work/20260514_free-heavy-objects

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Unguarded cleanup

In MSstatsSummarizeSingleTMP, survival_fit is created inside a tryCatch and the code already handles the failure case by setting converged = FALSE. The new unconditional rm(survival_fit) runs even when fitting failed, so this path now emits object 'survival_fit' not found warnings and can become a hard failure in environments that promote warnings to errors.

rm(survival_fit)

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_summarization.R`:
- Around line 208-223: The loess iterations reuse stale fitted values because
input[["fitted"]] is only set from the initial fit; after each weighted refit
(wls.fit) update the stored fitted values so subsequent loess calls use
consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit
(the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or
input[["fitted"]] <- wls.fit$fitted.values before recalculating
input[["abs.resids"]] and proceeding to the next iteration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d6992f3-0ec8-4947-80c1-53b71f9e9caa

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 5693fc9.

📒 Files selected for processing (4)
  • R/dataProcess.R
  • R/utils_imputation.R
  • R/utils_summarization.R
  • inst/tinytest/test_memory_optimization_fits.R

Comment thread R/utils_summarization.R
Comment on lines 208 to 223
for (i in seq_len(num_iter)) {
if (i == 1) {
abs.resids = data.frame(abs.resids = abs(fit$residuals))
fitted = data.frame(fitted = fit$fitted.values)
input = data.frame(input,
"abs.resids" = abs.resids,
"fitted" = fitted)
input[["abs.resids"]] = abs(fit$residuals)
input[["fitted"]] = fit$fitted.values
}
fit.loess = loess(abs.resids ~ fitted, data = input)
loess.fitted = data.frame(loess.fitted = fitted(fit.loess))
input = data.frame(input, "loess.fitted" = loess.fitted)
## loess fitted valuaes are predicted sd
input$weight = 1 / (input$loess.fitted ^ 2)
input = input[, !(colnames(input) %in% "abs.resids")]
input[["loess.fitted"]] = fitted(fit.loess)
## loess fitted values are predicted sd
input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
input[["abs.resids"]] = NULL
## re-fit using weight
wls.fit = lm(formula(fit), data = input, weights = weight)
abs.resids = data.frame(abs.resids = abs(wls.fit$residuals))
input = data.frame(input, "abs.resids" = abs.resids)
input = input[, -which(colnames(input) %in% c("loess.fitted", "weight"))]
input[["abs.resids"]] = abs(wls.fit$residuals)
input[["loess.fitted"]] = NULL
input[["weight"]] = NULL
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Refresh fitted after each weighted refit.

At Line 220 you refresh abs.resids from wls.fit, but fitted is never refreshed after Line 211. For num_iter > 1, later loess iterations use stale fitted values and inconsistent residual/fitted pairs.

Suggested fix
     for (i in seq_len(num_iter)) {
         if (i == 1) {
             input[["abs.resids"]] = abs(fit$residuals)
             input[["fitted"]] = fit$fitted.values
         }
         fit.loess = loess(abs.resids ~ fitted, data = input)
         input[["loess.fitted"]] = fitted(fit.loess)
         ## loess fitted values are predicted sd
         input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
         input[["abs.resids"]] = NULL
         ## re-fit using weight
         wls.fit = lm(formula(fit), data = input, weights = weight)
         input[["abs.resids"]] = abs(wls.fit$residuals)
+        input[["fitted"]] = wls.fit$fitted.values
         input[["loess.fitted"]] = NULL
         input[["weight"]] = NULL
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_summarization.R` around lines 208 - 223, The loess iterations reuse
stale fitted values because input[["fitted"]] is only set from the initial fit;
after each weighted refit (wls.fit) update the stored fitted values so
subsequent loess calls use consistent residual/fitted pairs — e.g. inside the
loop after computing wls.fit (the variable wls.fit) assign input[["fitted"]] <-
fitted(wls.fit) or input[["fitted"]] <- wls.fit$fitted.values before
recalculating input[["abs.resids"]] and proceeding to the next iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants