Skip to content

Perf optimizations + polars 0.54 upgrade (with benchmarks)#43

Open
ptournaris-ai wants to merge 7 commits into
bluegroundltd:mainfrom
ptournaris-ai:perf/polars-0.54
Open

Perf optimizations + polars 0.54 upgrade (with benchmarks)#43
ptournaris-ai wants to merge 7 commits into
bluegroundltd:mainfrom
ptournaris-ai:perf/polars-0.54

Conversation

@ptournaris-ai

@ptournaris-ai ptournaris-ai commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Performance optimizations for the anonymization hot path, plus the polars 0.48 → 0.54 upgrade, with a reproducible benchmark harness.

Four commits, reviewable in order:

  1. perf: config cache + vectorized sanitize + drop per-column clone
  2. perf: scan parquet lazily from S3 with predicate/slice pushdown
  3. deps: upgrade polars 0.48.1 → 0.54.4
  4. bench: benchmark harnesses + metrics
  5. deps: update all dependencies to latest (cargo update)
  6. ci: Floci S3 integration workflow + Rust 1.95 → 1.96 toolchain bump

Performance changes

  • Memoize anonymization configload_config_for did a disk read + TOML parse on every Parquet file; now parsed once and shared as an Arc (process-wide cache).
  • Vectorize sanitize_null_bytes — replaced the row-by-row Vec<Option<String>> rebuild with a single columnar when/then/otherwise expression.
  • Drop per-column Series clone — the transform loop now moves the produced Series into the DataFrame instead of cloning it.
  • Lazy S3 scan with pushdown — replaced "download whole object into memory → ParquetReader::finish() → filter afterwards" with LazyFrame::scan_parquet(s3://…) that pushes the configured filter and keep_num_of_records limit into the read, so only matching row groups/columns are fetched and decoded. collect() runs in spawn_blocking. Credentials are resolved via the default aws-config chain (IRSA/IMDS/env/SSO) and handed to polars, matching get_object auth.

Measured (Criterion medians, 2M-row synthetic DataFrames, 24-core x86-64 — see docs/BENCHMARKS.md)

Hot path OLD NEW Speedup
sanitize_null_bytes 194 ms 18.8 ms ~10.3×
read + selective filter 16.6 ms 11.0 ms ~1.5×
read + keep_num_of_records 15.7 ms 3.93 ms ~4.0×
config load (per 64 files) 7.42 ms 1.55 µs ~4,800×

Each benchmark compares the exact OLD vs NEW function body; equivalence tests assert NEW is byte-identical to OLD before timings are trusted. The S3 scan path is additionally validated end-to-end against a local Floci/LocalStack S3 endpoint (benchmarks/s3-integration).

polars 0.48 → 0.54

Required API migration (mechanical): get_columns()columns(), get_column_names_str()get_column_names(), scan_parquet now takes PlRefPath, ChunkedArray .into_iter().iter(), with_column takes Column not Series, DataFrame::new(height, cols), AnyValue::Decimal(value, precision, scale), drop new_streaming/streaming features (add regex), MSRV → 1.88.

⚠️ Dependency note: the polars bump depends on dms-cdc-operator being on polars 0.54. Until nikoshet/rust-dms-cdc-operator#41 is merged & released, this branch pins it via [patch.crates-io] to a fork branch. Do not merge the polars bump commit until that release lands; the first two perf commits are safe on polars 0.48 today and can be split off if you prefer to ship them first.

The private submodules (rustic-bg-whole-table-transformator, rustic-local-data-importer-cli) were not reachable from this environment and still need the same mechanical polars-0.54 fixes applied.

Dependency refresh

The last commit runs cargo update, moving Cargo.lock to the latest
semver-compatible versions of every dependency (aws-sdk-s3 1.88→1.137, aws-config
1.6→1.8, clap 4.5→4.6, bon 3.6→3.9, indexmap 2.9→2.14, fake 4.0→4.4, plus
transitive deps). Every direct dependency's latest published version already
satisfies the existing requirements, so no manifest version strings changed.

CI

Adds .github/workflows/s3-integration-workflow.yaml: on a GitHub-hosted runner it
stands up Floci via docker compose, waits for the S3 endpoint, and runs the
scan_parquet integration test (predicate + slice pushdown). Path-scoped to
benchmarks/s3-integration/** and the operator scan file, plus workflow_dispatch.
Also bumps the pinned Rust toolchain 1.95.0 → 1.96.0 (latest stable, 2026-05-25)
across rust-toolchain.toml and the existing workflows.

Verification

cargo check --workspace --all-targets is clean on 0.54 (with submodules trimmed locally); 14 transform-crate unit tests and the microbench equivalence tests pass; the S3 integration test passes against Floci. DB/S3-backed integration tests that need the docker-compose fixtures were not run.

Replace the eager 'download whole object into Vec<u8> then ParquetReader::finish()
then filter' path with LazyFrame::scan_parquet over s3://, pushing the configured
filter and keep_num_of_records limit into the read so only matching row groups and
needed columns are fetched and decoded. collect() runs in spawn_blocking to keep the
async runtime free and avoid a nested-runtime panic. Credentials are resolved via the
default aws-config provider chain (IRSA/IMDS/env/SSO) and handed to Polars, matching
get_object auth.
Points dms-cdc-operator at the polars-0.54 fork branch via [patch.crates-io]
until nikoshet/rust-dms-cdc-operator#41 is released (the DataframeOperator
trait's DataFrame type pins our polars version).

polars 0.54 API migration:
- drop removed 'new_streaming'/'streaming' features, add 'regex' (contains_literal)
- LazyFrame::scan_parquet now takes PlRefPath
- DataFrame::get_columns() -> columns(); get_column_names_str() -> get_column_names()
- StringChunked/Int32 iteration: .into_iter() -> .iter()
- DataFrame::with_column takes Column, not Series (Column::from)
- DataFrame::new(cols) -> DataFrame::new(height, cols) (tests)

Verified: cargo check --workspace --all-targets clean; 14 transform-crate
unit tests pass on 0.54. DB/S3 integration tests not run (need fixtures).
- benchmarks/microbench: criterion OLD-vs-NEW for the hot paths (sanitize,
  scan+filter, scan+slice, config cache) with equivalence tests gating timings
- benchmarks/s3-integration: end-to-end scan_parquet test against a local
  Floci/LocalStack S3 endpoint (docker-compose included)
- docs/BENCHMARKS.md: results table (~9.4x sanitize, ~4x record-reduction,
  ~4900x config) and how to reproduce
- workspace 'exclude' so the standalone harnesses stay out of the build graph
- hotpath.rs imported the pre-rename crate name (rw_bench); use rustic_witcher_microbench
- BENCHMARKS.md updated with medians measured on this run (24-core x86-64)
Refreshes Cargo.lock to the latest semver-compatible versions of every
dependency (aws-sdk-s3 1.88->1.137, aws-config 1.6->1.8, clap 4.5->4.6,
bon 3.6->3.9, indexmap 2.9->2.14, fake 4.0->4.4, and transitive deps).
All direct deps' latest published versions already satisfy the existing
version requirements, so no manifest version bumps were needed.

Verified: cargo check --workspace --all-targets clean; 14 transform-crate
unit tests pass on the updated lock.
- new s3-integration-workflow.yaml: stands up Floci via docker compose on a
  GitHub-hosted runner and runs the scan_parquet integration test (path-scoped
  to benchmarks/s3-integration + the operator scan file; workflow_dispatch too)
- bump pinned Rust 1.95.0 -> 1.96.0 (latest stable, 2026-05-25) across
  rust-toolchain.toml, tests-workflow.yaml, format-workflow.yaml, and the new
  workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant