[codex] replace backfill chunking with smart planner#107
[codex] replace backfill chunking with smart planner#107
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbd3a19868
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (sortKey.category === 'string') { | ||
| return `unhex('${Buffer.from(value, 'latin1').toString('hex')}')` | ||
| } |
There was a problem hiding this comment.
Encode string SQL bounds with UTF-8 bytes
The string branch in formatBound builds predicates from Buffer.from(value, 'latin1'), which changes the byte sequence for UTF-8 text values. For example, a bound like é is emitted as unhex('e9') even though ClickHouse String stores UTF-8 bytes (c3a9), so range filters can miss rows or create overlapping chunks whenever sort-key values contain non-ASCII characters. This directly affects both planning probes and execution SQL that rely on these bounds.
Useful? React with 👍 / 👎.
| if (sortKey.boundaryEncoding === 'hex-latin1') { | ||
| return Buffer.from(value, 'latin1').toString('hex') | ||
| } |
There was a problem hiding this comment.
Preserve UTF-8 when persisting string chunk bounds
encodeBoundary serializes string boundaries with Buffer.from(value, 'latin1'), which is lossy for non-Latin-1 data (e.g., emoji and many Unicode code points). Because plans are written with this encoding and later decoded for run/resume, persisted chunk ranges can mutate after round-trip and cause resumed backfills to read the wrong key intervals.
Useful? React with 👍 / 👎.
|
Superseded by #108, which is a strict superset containing the critical correctness fixes (uncompressed-bytes sizing, parallel-replicas count inflation, string boundary truncation) plus E2E coverage. |
Summary
@chkit/plugin-backfill/sdkinstead of the package rootWhy
The previous backfill planner only handled coarse partition splitting with a single sort-key range fallback. The new smart planner supports string-prefix, temporal, quantile, and equal-width strategies, including secondary-dimension handoff for hot keys, which matches the new prototype behavior and gives much better confidence in expected chunk layouts.
Impact
Backfill planning now produces smarter chunk boundaries for skewed data and multi-column sort keys.
Public API note: planner/runtime internals are no longer exported from
@chkit/plugin-backfill. They now live under@chkit/plugin-backfill/sdk.Root Cause
The older planner could only split by partition and a single coarse sort-key range, which broke down on hot keys and skewed distributions. During review, a few edge cases also surfaced in the first port: collapsed equal-width boundaries, brittle
sorting_keyparsing for expression-based sort keys, and temporal refinement extending past parent slice bounds.Validation
bun test /Users/marc/Workspace/chkit/packages/plugin-backfill/srcbun run --cwd /Users/marc/Workspace/chkit/packages/plugin-backfill typecheckbun verify