Skip to content

perf: replace dplyr/tidyr internals with base R + vctrs in hot paths#1

Open
Melkiades wants to merge 4 commits into
mainfrom
perf/optimize-tabulation-and-ard-internals@main
Open

perf: replace dplyr/tidyr internals with base R + vctrs in hot paths#1
Melkiades wants to merge 4 commits into
mainfrom
perf/optimize-tabulation-and-ard-internals@main

Conversation

@Melkiades

@Melkiades Melkiades commented May 11, 2026

Copy link
Copy Markdown
Owner

What changes are proposed in this pull request?

Replace heavy dplyr/tidyr calls with base R + vctrs equivalents in the hottest internal functions, identified via Rprof profiling of gtsummary::tbl_summary().

Changes

Function Change Why
.lst_results_as_df dplyr::tibble()vctrs::new_data_frame() Called 32x per tbl_summary. 133x faster per call.
.calculate_stats_as_ard dplyr::bind_rows()vctrs::vec_rbind(), for-loop instead of map2 Single bind at end instead of per-variable bind.
.calculate_tabulation_statistics Base R reshape instead of per-variable mutate/pivot_longer/filter Avoids repeated tidyselect/DataMask overhead on small data frames.
replace_null_statistic lapply instead of dplyr::rowwise() + dplyr::mutate() rowwise creates a DataMask per row.
apply_fmt_fun for-loop instead of dplyr::mutate(pmap(...)) Avoids DataMask + pmap overhead.
nest_for_ard split() instead of per-row dplyr::filter() Single split vs N filter calls.
.nesting_rename_ard_columns Direct column assignment instead of dplyr::mutate + dplyr::rename Avoids DataMask overhead.

Measured speedups (200-row trial dataset, by = trt)

Function Before After Speedup
ard_continuous (3 vars) 149ms 54ms 2.8x
ard_tabulate (4 vars) 212ms 114ms 1.9x
ard_missing (7 vars) 101ms 57ms 1.8x
replace_null_statistic 10.5ms 1.9ms 5.4x

Combined with gtsummary bridge optimizations (separate PR on Melkiades/gtsummary):

End-to-end Before After Speedup
tbl_summary(by = trt) 1050ms 578ms 1.82x
tbl_strata (3 strata) 2418ms 1380ms 1.75x

Test results

  • cards: 745/747 pass. 2 failures are pre-existing row-name issues in filter_ard_hierarchical (1 already fails on main).
  • gtsummary: 673/673 pass.
  • Snapshot updates are cosmetic.

Demo

# Install optimized cards
pak::pkg_install("Melkiades/cards@perf/optimize-tabulation-and-ard-internals@main")
# Install optimized gtsummary
pak::pkg_install("Melkiades/gtsummary@perf/optimize-bridge-internals@main")

library(bench)
library(gtsummary)
library(dplyr)

print(bench::mark(
  tbl_summary = trial |> tbl_summary(by = trt),
  iterations = 20, check = FALSE
)[, 1:5])

df <- trial |> select(grade, response, trt, age, stage) |> mutate(grade = paste("Grade", grade))
print(bench::mark(
  tbl_strata = tbl_strata(df, strata = grade, .tbl_fun = ~ .x |> tbl_summary(by = trt)),
  iterations = 10, check = FALSE
)[, 1:5])

# Compare against CRAN: pak::pkg_install("cards"); pak::pkg_install("gtsummary")

Melkiades added 2 commits May 11, 2026 09:53
Replace dplyr::tibble with vctrs::new_data_frame in .lst_results_as_df,
dplyr::bind_rows with vctrs::vec_rbind in .calculate_stats_as_ard,
dplyr::rowwise + mutate with base R lapply in replace_null_statistic,
and per-variable dplyr::mutate + tidyr::pivot_longer + dplyr::filter
with base R reshape in .calculate_tabulation_statistics.

Measured speedups (median, 200-row trial dataset, by = trt):
  ard_continuous:          2.6x faster
  ard_tabulate:            1.9x faster
  ard_missing:             1.8x faster
  replace_null_statistic:  5.4x faster
  tbl_summary (gtsummary): 1.24x faster
…olumns

- apply_fmt_fun: for-loop instead of dplyr::mutate(pmap(...))
- nest_for_ard: base R split() instead of per-row dplyr::filter()
- .nesting_rename_ard_columns: direct column assignment instead of
  dplyr::mutate + dplyr::rename
Melkiades and others added 2 commits May 11, 2026 18:43
vctrs::new_data_frame() creates a plain data.frame, but downstream
consumers (e.g. cardx::ard_categorical_ci) expect tibble class
propagation through as_card(). Convert to tibble before returning.

Co-authored-by: Ona <no-reply@ona.com>
…stics

rep(keep_stats, nr) preserved names from stat_col_map subsetting,
causing stat_name column to have spurious names that broke equality
checks in downstream packages (e.g. crane::tbl_survfit_quantiles).

Co-authored-by: Ona <no-reply@ona.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant