Skip to content

Perf optimizations + polars 0.54 upgrade (with benchmarks)#1

Closed
ptournaris-ai wants to merge 4 commits into
pavlospt:mainfrom
ptournaris-ai:perf/polars-0.54
Closed

Perf optimizations + polars 0.54 upgrade (with benchmarks)#1
ptournaris-ai wants to merge 4 commits into
pavlospt:mainfrom
ptournaris-ai:perf/polars-0.54

Conversation

@ptournaris-ai

Copy link
Copy Markdown

Summary

Performance optimizations for the anonymization hot path, plus the polars 0.48 → 0.54 upgrade, with a reproducible benchmark harness.

Four commits, reviewable in order:

  1. perf: config cache + vectorized sanitize + drop per-column clone
  2. perf: scan parquet lazily from S3 with predicate/slice pushdown
  3. deps: upgrade polars 0.48.1 → 0.54.4
  4. bench: benchmark harnesses + metrics

Performance changes

  • Memoize anonymization configload_config_for did a disk read + TOML parse on every Parquet file; now parsed once and shared as an Arc (process-wide cache).
  • Vectorize sanitize_null_bytes — replaced the row-by-row Vec<Option<String>> rebuild with a single columnar when/then/otherwise expression.
  • Drop per-column Series clone — the transform loop now moves the produced Series into the DataFrame instead of cloning it.
  • Lazy S3 scan with pushdown — replaced "download whole object into memory → ParquetReader::finish() → filter afterwards" with LazyFrame::scan_parquet(s3://…) that pushes the configured filter and keep_num_of_records limit into the read, so only matching row groups/columns are fetched and decoded. collect() runs in spawn_blocking. Credentials are resolved via the default aws-config chain (IRSA/IMDS/env/SSO) and handed to polars, matching get_object auth.

Measured (medians, 2M rows — see docs/BENCHMARKS.md)

Hot path OLD NEW Speedup
sanitize_null_bytes 179 ms 19 ms ~9.4×
read + selective filter 17.0 ms 12.4 ms ~1.4×
read + keep_num_of_records 16.1 ms 4.0 ms ~4.0×
config load (per 64 files) 7.46 ms 1.53 µs ~4,900×

Each benchmark compares the exact OLD vs NEW function body; equivalence tests assert NEW is byte-identical to OLD before timings are trusted. The S3 scan path is additionally validated end-to-end against a local Floci/LocalStack S3 endpoint (benchmarks/s3-integration).

polars 0.48 → 0.54

Required API migration (mechanical): get_columns()columns(), get_column_names_str()get_column_names(), scan_parquet now takes PlRefPath, ChunkedArray .into_iter().iter(), with_column takes Column not Series, DataFrame::new(height, cols), AnyValue::Decimal(value, precision, scale), drop new_streaming/streaming features (add regex), MSRV → 1.88.

⚠️ Dependency note: the polars bump depends on dms-cdc-operator being on polars 0.54. Until nikoshet/rust-dms-cdc-operator#41 is merged & released, this branch pins it via [patch.crates-io] to a fork branch. Do not merge the polars bump commit until that release lands; the first two perf commits are safe on polars 0.48 today and can be split off if you prefer to ship them first.

The private submodules (rustic-bg-whole-table-transformator, rustic-local-data-importer-cli) were not reachable from this environment and still need the same mechanical polars-0.54 fixes applied.

Verification

cargo check --workspace --all-targets is clean on 0.54 (with submodules trimmed locally); 14 transform-crate unit tests and the microbench equivalence tests pass; the S3 integration test passes against Floci. DB/S3-backed integration tests that need the docker-compose fixtures were not run.

Replace the eager 'download whole object into Vec<u8> then ParquetReader::finish()
then filter' path with LazyFrame::scan_parquet over s3://, pushing the configured
filter and keep_num_of_records limit into the read so only matching row groups and
needed columns are fetched and decoded. collect() runs in spawn_blocking to keep the
async runtime free and avoid a nested-runtime panic. Credentials are resolved via the
default aws-config provider chain (IRSA/IMDS/env/SSO) and handed to Polars, matching
get_object auth.
Points dms-cdc-operator at the polars-0.54 fork branch via [patch.crates-io]
until nikoshet/rust-dms-cdc-operator#41 is released (the DataframeOperator
trait's DataFrame type pins our polars version).

polars 0.54 API migration:
- drop removed 'new_streaming'/'streaming' features, add 'regex' (contains_literal)
- LazyFrame::scan_parquet now takes PlRefPath
- DataFrame::get_columns() -> columns(); get_column_names_str() -> get_column_names()
- StringChunked/Int32 iteration: .into_iter() -> .iter()
- DataFrame::with_column takes Column, not Series (Column::from)
- DataFrame::new(cols) -> DataFrame::new(height, cols) (tests)

Verified: cargo check --workspace --all-targets clean; 14 transform-crate
unit tests pass on 0.54. DB/S3 integration tests not run (need fixtures).
- benchmarks/microbench: criterion OLD-vs-NEW for the hot paths (sanitize,
  scan+filter, scan+slice, config cache) with equivalence tests gating timings
- benchmarks/s3-integration: end-to-end scan_parquet test against a local
  Floci/LocalStack S3 endpoint (docker-compose included)
- docs/BENCHMARKS.md: results table (~9.4x sanitize, ~4x record-reduction,
  ~4900x config) and how to reproduce
- workspace 'exclude' so the standalone harnesses stay out of the build graph
@ptournaris-ai

Copy link
Copy Markdown
Author

Superseded by upstream PR bluegroundltd#43.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant