Perf optimizations + polars 0.54 upgrade (with benchmarks) by ptournaris-ai · Pull Request #43 · bluegroundltd/rustic-witcher

ptournaris-ai · 2026-06-23T18:38:53Z

Summary

Performance optimizations for the anonymization hot path, plus the polars 0.48 → 0.54 upgrade, with a reproducible benchmark harness.

Four commits, reviewable in order:

perf: config cache + vectorized sanitize + drop per-column clone
perf: scan parquet lazily from S3 with predicate/slice pushdown
deps: upgrade polars 0.48.1 → 0.54.4
bench: benchmark harnesses + metrics
deps: update all dependencies to latest (cargo update)
ci: Floci S3 integration workflow + Rust 1.95 → 1.96 toolchain bump

Performance changes

Memoize anonymization config — load_config_for did a disk read + TOML parse on every Parquet file; now parsed once and shared as an Arc (process-wide cache).
Vectorize sanitize_null_bytes — replaced the row-by-row Vec<Option<String>> rebuild with a single columnar when/then/otherwise expression.
Drop per-column Series clone — the transform loop now moves the produced Series into the DataFrame instead of cloning it.
Lazy S3 scan with pushdown — replaced "download whole object into memory → ParquetReader::finish() → filter afterwards" with LazyFrame::scan_parquet(s3://…) that pushes the configured filter and keep_num_of_records limit into the read, so only matching row groups/columns are fetched and decoded. collect() runs in spawn_blocking. Credentials are resolved via the default aws-config chain (IRSA/IMDS/env/SSO) and handed to polars, matching get_object auth.

Measured (Criterion medians, 2M-row synthetic DataFrames, 24-core x86-64 — see `docs/BENCHMARKS.md`)

Hot path	OLD	NEW	Speedup
`sanitize_null_bytes`	194 ms	18.8 ms	~10.3×
read + selective filter	16.6 ms	11.0 ms	~1.5×
read + `keep_num_of_records`	15.7 ms	3.93 ms	~4.0×
config load (per 64 files)	7.42 ms	1.55 µs	~4,800×

Each benchmark compares the exact OLD vs NEW function body; equivalence tests assert NEW is byte-identical to OLD before timings are trusted. The S3 scan path is additionally validated end-to-end against a local Floci/LocalStack S3 endpoint (benchmarks/s3-integration).

polars 0.48 → 0.54

Required API migration (mechanical): get_columns()→columns(), get_column_names_str()→get_column_names(), scan_parquet now takes PlRefPath, ChunkedArray .into_iter()→.iter(), with_column takes Column not Series, DataFrame::new(height, cols), AnyValue::Decimal(value, precision, scale), drop new_streaming/streaming features (add regex), MSRV → 1.88.

⚠️ Dependency note: the polars bump depends on dms-cdc-operator being on polars 0.54. Until nikoshet/rust-dms-cdc-operator#41 is merged & released, this branch pins it via [patch.crates-io] to a fork branch. Do not merge the polars bump commit until that release lands; the first two perf commits are safe on polars 0.48 today and can be split off if you prefer to ship them first.

The private submodules (rustic-bg-whole-table-transformator, rustic-local-data-importer-cli) were not reachable from this environment and still need the same mechanical polars-0.54 fixes applied.

Dependency refresh

The last commit runs cargo update, moving Cargo.lock to the latest
semver-compatible versions of every dependency (aws-sdk-s3 1.88→1.137, aws-config
1.6→1.8, clap 4.5→4.6, bon 3.6→3.9, indexmap 2.9→2.14, fake 4.0→4.4, plus
transitive deps). Every direct dependency's latest published version already
satisfies the existing requirements, so no manifest version strings changed.

CI

Adds .github/workflows/s3-integration-workflow.yaml: on a GitHub-hosted runner it
stands up Floci via docker compose, waits for the S3 endpoint, and runs the
scan_parquet integration test (predicate + slice pushdown). Path-scoped to
benchmarks/s3-integration/** and the operator scan file, plus workflow_dispatch.
Also bumps the pinned Rust toolchain 1.95.0 → 1.96.0 (latest stable, 2026-05-25)
across rust-toolchain.toml and the existing workflows.

Verification

cargo check --workspace --all-targets is clean on 0.54 (with submodules trimmed locally); 14 transform-crate unit tests and the microbench equivalence tests pass; the S3 integration test passes against Floci. DB/S3-backed integration tests that need the docker-compose fixtures were not run.

…rop per-column Series clone

Replace the eager 'download whole object into Vec<u8> then ParquetReader::finish() then filter' path with LazyFrame::scan_parquet over s3://, pushing the configured filter and keep_num_of_records limit into the read so only matching row groups and needed columns are fetched and decoded. collect() runs in spawn_blocking to keep the async runtime free and avoid a nested-runtime panic. Credentials are resolved via the default aws-config provider chain (IRSA/IMDS/env/SSO) and handed to Polars, matching get_object auth.

Points dms-cdc-operator at the polars-0.54 fork branch via [patch.crates-io] until nikoshet/rust-dms-cdc-operator#41 is released (the DataframeOperator trait's DataFrame type pins our polars version). polars 0.54 API migration: - drop removed 'new_streaming'/'streaming' features, add 'regex' (contains_literal) - LazyFrame::scan_parquet now takes PlRefPath - DataFrame::get_columns() -> columns(); get_column_names_str() -> get_column_names() - StringChunked/Int32 iteration: .into_iter() -> .iter() - DataFrame::with_column takes Column, not Series (Column::from) - DataFrame::new(cols) -> DataFrame::new(height, cols) (tests) Verified: cargo check --workspace --all-targets clean; 14 transform-crate unit tests pass on 0.54. DB/S3 integration tests not run (need fixtures).

- benchmarks/microbench: criterion OLD-vs-NEW for the hot paths (sanitize, scan+filter, scan+slice, config cache) with equivalence tests gating timings - benchmarks/s3-integration: end-to-end scan_parquet test against a local Floci/LocalStack S3 endpoint (docker-compose included) - docs/BENCHMARKS.md: results table (~9.4x sanitize, ~4x record-reduction, ~4900x config) and how to reproduce - workspace 'exclude' so the standalone harnesses stay out of the build graph

- hotpath.rs imported the pre-rename crate name (rw_bench); use rustic_witcher_microbench - BENCHMARKS.md updated with medians measured on this run (24-core x86-64)

Refreshes Cargo.lock to the latest semver-compatible versions of every dependency (aws-sdk-s3 1.88->1.137, aws-config 1.6->1.8, clap 4.5->4.6, bon 3.6->3.9, indexmap 2.9->2.14, fake 4.0->4.4, and transitive deps). All direct deps' latest published versions already satisfy the existing version requirements, so no manifest version bumps were needed. Verified: cargo check --workspace --all-targets clean; 14 transform-crate unit tests pass on the updated lock.

- new s3-integration-workflow.yaml: stands up Floci via docker compose on a GitHub-hosted runner and runs the scan_parquet integration test (path-scoped to benchmarks/s3-integration + the operator scan file; workflow_dispatch too) - bump pinned Rust 1.95.0 -> 1.96.0 (latest stable, 2026-05-25) across rust-toolchain.toml, tests-workflow.yaml, format-workflow.yaml, and the new workflow

ptournaris-ai added 4 commits June 23, 2026 18:56

perf: cache anonymization config, vectorise null-byte sanitization, d…

020deef

…rop per-column Series clone

ptournaris-ai requested a review from a team as a code owner June 23, 2026 18:38

ptournaris-ai mentioned this pull request Jun 23, 2026

Perf optimizations + polars 0.54 upgrade (with benchmarks) pavlospt/rustic-witcher#1

Closed

ptournaris-ai added 3 commits June 23, 2026 21:44

bench: fix harness crate import + record measured medians

bce70d7

- hotpath.rs imported the pre-rename crate name (rw_bench); use rustic_witcher_microbench - BENCHMARKS.md updated with medians measured on this run (24-core x86-64)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf optimizations + polars 0.54 upgrade (with benchmarks)#43

Perf optimizations + polars 0.54 upgrade (with benchmarks)#43
ptournaris-ai wants to merge 7 commits into
bluegroundltd:mainfrom
ptournaris-ai:perf/polars-0.54

ptournaris-ai commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptournaris-ai commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance changes

Measured (Criterion medians, 2M-row synthetic DataFrames, 24-core x86-64 — see docs/BENCHMARKS.md)

polars 0.48 → 0.54

Dependency refresh

CI

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ptournaris-ai commented Jun 23, 2026 •

edited

Loading

Measured (Criterion medians, 2M-row synthetic DataFrames, 24-core x86-64 — see `docs/BENCHMARKS.md`)