Skip to content

[codex] replace backfill chunking with smart planner#107

Closed
KeKs0r wants to merge 5 commits intomainfrom
new-smart-chunking
Closed

[codex] replace backfill chunking with smart planner#107
KeKs0r wants to merge 5 commits intomainfrom
new-smart-chunking

Conversation

@KeKs0r
Copy link
Copy Markdown
Contributor

@KeKs0r KeKs0r commented Apr 1, 2026

Summary

  • replace the old backfill partition/range chunking path with the new smart recursive chunk planner
  • expose planner/runtime internals from @chkit/plugin-backfill/sdk instead of the package root
  • add fixture-driven integration coverage for each chunking strategy and mixed multi-dimension scenarios
  • fix review findings around degenerate equal-width slices, sort-key expression parsing, temporal slice bounds, and package export validation

Why

The previous backfill planner only handled coarse partition splitting with a single sort-key range fallback. The new smart planner supports string-prefix, temporal, quantile, and equal-width strategies, including secondary-dimension handoff for hot keys, which matches the new prototype behavior and gives much better confidence in expected chunk layouts.

Impact

Backfill planning now produces smarter chunk boundaries for skewed data and multi-column sort keys.

Public API note: planner/runtime internals are no longer exported from @chkit/plugin-backfill. They now live under @chkit/plugin-backfill/sdk.

Root Cause

The older planner could only split by partition and a single coarse sort-key range, which broke down on hot keys and skewed distributions. During review, a few edge cases also surfaced in the first port: collapsed equal-width boundaries, brittle sorting_key parsing for expression-based sort keys, and temporal refinement extending past parent slice bounds.

Validation

  • bun test /Users/marc/Workspace/chkit/packages/plugin-backfill/src
  • bun run --cwd /Users/marc/Workspace/chkit/packages/plugin-backfill typecheck
  • bun verify

@KeKs0r KeKs0r marked this pull request as ready for review April 14, 2026 10:29
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbd3a19868

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +20 to +22
if (sortKey.category === 'string') {
return `unhex('${Buffer.from(value, 'latin1').toString('hex')}')`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Encode string SQL bounds with UTF-8 bytes

The string branch in formatBound builds predicates from Buffer.from(value, 'latin1'), which changes the byte sequence for UTF-8 text values. For example, a bound like é is emitted as unhex('e9') even though ClickHouse String stores UTF-8 bytes (c3a9), so range filters can miss rows or create overlapping chunks whenever sort-key values contain non-ASCII characters. This directly affects both planning probes and execution SQL that rely on these bounds.

Useful? React with 👍 / 👎.

Comment on lines +14 to +16
if (sortKey.boundaryEncoding === 'hex-latin1') {
return Buffer.from(value, 'latin1').toString('hex')
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve UTF-8 when persisting string chunk bounds

encodeBoundary serializes string boundaries with Buffer.from(value, 'latin1'), which is lossy for non-Latin-1 data (e.g., emoji and many Unicode code points). Because plans are written with this encoding and later decoded for run/resume, persisted chunk ranges can mutate after round-trip and cause resumed backfills to read the wrong key intervals.

Useful? React with 👍 / 👎.

@KeKs0r
Copy link
Copy Markdown
Contributor Author

KeKs0r commented Apr 14, 2026

Superseded by #108, which is a strict superset containing the critical correctness fixes (uncompressed-bytes sizing, parallel-replicas count inflation, string boundary truncation) plus E2E coverage.

@KeKs0r KeKs0r closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant