Freed heavy fit objects and large dataProcess intermediates early#207
Freed heavy fit objects and large dataProcess intermediates early#207tonywu1999 wants to merge 2 commits into
Conversation
* Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected. * Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...). * Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs. * Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised. * Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR frees large intermediates and fitted-model internals during data processing and summarization, calls gc() before summarization, fixes a conditional in ChangesMemory Optimization in Data Processing and Fitting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_summarization.R`:
- Around line 208-223: The loess iterations reuse stale fitted values because
input[["fitted"]] is only set from the initial fit; after each weighted refit
(wls.fit) update the stored fitted values so subsequent loess calls use
consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit
(the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or
input[["fitted"]] <- wls.fit$fitted.values before recalculating
input[["abs.resids"]] and proceeding to the next iteration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d6992f3-0ec8-4947-80c1-53b71f9e9caa
📒 Files selected for processing (4)
R/dataProcess.RR/utils_imputation.RR/utils_summarization.Rinst/tinytest/test_memory_optimization_fits.R
| for (i in seq_len(num_iter)) { | ||
| if (i == 1) { | ||
| abs.resids = data.frame(abs.resids = abs(fit$residuals)) | ||
| fitted = data.frame(fitted = fit$fitted.values) | ||
| input = data.frame(input, | ||
| "abs.resids" = abs.resids, | ||
| "fitted" = fitted) | ||
| input[["abs.resids"]] = abs(fit$residuals) | ||
| input[["fitted"]] = fit$fitted.values | ||
| } | ||
| fit.loess = loess(abs.resids ~ fitted, data = input) | ||
| loess.fitted = data.frame(loess.fitted = fitted(fit.loess)) | ||
| input = data.frame(input, "loess.fitted" = loess.fitted) | ||
| ## loess fitted valuaes are predicted sd | ||
| input$weight = 1 / (input$loess.fitted ^ 2) | ||
| input = input[, !(colnames(input) %in% "abs.resids")] | ||
| input[["loess.fitted"]] = fitted(fit.loess) | ||
| ## loess fitted values are predicted sd | ||
| input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2) | ||
| input[["abs.resids"]] = NULL | ||
| ## re-fit using weight | ||
| wls.fit = lm(formula(fit), data = input, weights = weight) | ||
| abs.resids = data.frame(abs.resids = abs(wls.fit$residuals)) | ||
| input = data.frame(input, "abs.resids" = abs.resids) | ||
| input = input[, -which(colnames(input) %in% c("loess.fitted", "weight"))] | ||
| input[["abs.resids"]] = abs(wls.fit$residuals) | ||
| input[["loess.fitted"]] = NULL | ||
| input[["weight"]] = NULL | ||
| } |
There was a problem hiding this comment.
Refresh fitted after each weighted refit.
At Line 220 you refresh abs.resids from wls.fit, but fitted is never refreshed after Line 211. For num_iter > 1, later loess iterations use stale fitted values and inconsistent residual/fitted pairs.
Suggested fix
for (i in seq_len(num_iter)) {
if (i == 1) {
input[["abs.resids"]] = abs(fit$residuals)
input[["fitted"]] = fit$fitted.values
}
fit.loess = loess(abs.resids ~ fitted, data = input)
input[["loess.fitted"]] = fitted(fit.loess)
## loess fitted values are predicted sd
input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
input[["abs.resids"]] = NULL
## re-fit using weight
wls.fit = lm(formula(fit), data = input, weights = weight)
input[["abs.resids"]] = abs(wls.fit$residuals)
+ input[["fitted"]] = wls.fit$fitted.values
input[["loess.fitted"]] = NULL
input[["weight"]] = NULL
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_summarization.R` around lines 208 - 223, The loess iterations reuse
stale fitted values because input[["fitted"]] is only set from the initial fit;
after each weighted refit (wls.fit) update the stored fitted values so
subsequent loess calls use consistent residual/fitted pairs — e.g. inside the
loop after computing wls.fit (the variable wls.fit) assign input[["fitted"]] <-
fitted(wls.fit) or input[["fitted"]] <- wls.fit$fitted.values before
recalculating input[["abs.resids"]] and proceeding to the next iteration.
User description
See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Bug fix, Enhancement, Tests
Description
Free large pipeline inputs earlier
Strip heavy fields from fit objects
Avoid repeated copies in variance updates
Add fit memory regression tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Free pipeline intermediates and fit objects earlyR/dataProcess.R
rawafter initial preparationpeptides_dictafter normalizationgc(verbose = FALSE)before output assemblysurvival_fitandfitsoonerutils_imputation.R
Strip unused `survreg` fields after fittingR/utils_imputation.R
fit$yaftersurvreg()fittingfit$linear.predictorsbefore returningutils_summarization.R
Reduce `lm` footprint and variance-copy churnR/utils_summarization.R
linear_model$modelbefore returninginputtodata.frameonce upfronttest_memory_optimization_fits.R
Add regression tests for memory-stripped fitsinst/tinytest/test_memory_optimization_fits.R
survregfieldslm$modelsummary()andvcov()still work