Skip to content

BREAKING CHANGE: v0.0.6 -> --randomize/-R impl in chaintools split#5

Merged
alejandrogzi merged 1 commit into
masterfrom
v0.0.6
Jun 10, 2026
Merged

BREAKING CHANGE: v0.0.6 -> --randomize/-R impl in chaintools split#5
alejandrogzi merged 1 commit into
masterfrom
v0.0.6

Conversation

@alejandrogzi

@alejandrogzi alejandrogzi commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Added

  • --randomize/-R (with optional --seed <u64>) for chaintools split. When
    enabled, chains are distributed across output files in random order instead of input
    order, so when the input is sorted by id/score the largest chains spread evenly across
    files instead of all landing in the first one.
  • Hand-rolled RNG, no new dependencySplitMix64 PRNG + Lemire-bounded
    next_bounded + Fisher–Yates shuffle_indices, plus a time-based default_seed()
    (reusing the already-imported SystemTime). Seed is logged at info level so every
    run is reproducible.
  • OutputPlan.byte_range: Range → byte_ranges: Vec<Range>. The non-randomized path
    is untouched and just as fast: it still produces a single contiguous zero-copy range.
    The randomized path shuffles a chain-index permutation, partitions it with the existing
    --files/--chunks math, then merge_chain_ranges maps each file's chains to a
    minimal set of byte ranges (sorted + contiguous-coalesced to minimize writes).
  • Robustness fixmerge_chain_ranges makes chain 0 own byte 0, so the per-chain
    ranges form a gap-free partition of the input. This preserves any preamble before the
    first chain header (which the naive per-chain mapping would have dropped) and guarantees
    no bytes are lost or duplicated.
  • Writerwrite_output_slice now concatenates multiple ranges into the same
    BufWriter/GzEncoder, preserving zero-copy slicing and gzip support.

Changed

  • CLI flags in src/cli/split.rs — added -R/--randomize (bool) and --seed (optional
    u64, with clap requires = "randomize" so --seed alone is an error).

Documentation

  • Both flags documented in assets/tools/split.md.

Notes

  • 82 binary tests + 67 lib/integration tests pass, including 7 new ones (permutation
    validity, determinism, range merging, preamble preservation, chain preservation,
    reproducibility, and seed-without-randomize rejection). gzip-feature tests pass too.
  • End-to-end: confirmed reproducibility (--seed 42 byte-identical across runs),
    fresh-seed logging, all chains present exactly once, and visible redistribution of
    chains across files.

@alejandrogzi alejandrogzi merged commit 14e663b into master Jun 10, 2026
2 checks passed
@alejandrogzi alejandrogzi deleted the v0.0.6 branch June 10, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant