BREAKING CHANGE: v0.0.6 -> --randomize/-R impl in chaintools split#5
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added
--randomize/-R(with optional--seed <u64>) forchaintools split. Whenenabled, chains are distributed across output files in random order instead of input
order, so when the input is sorted by id/score the largest chains spread evenly across
files instead of all landing in the first one.
SplitMix64PRNG + Lemire-boundednext_bounded+ Fisher–Yatesshuffle_indices, plus a time-baseddefault_seed()(reusing the already-imported
SystemTime). Seed is logged atinfolevel so everyrun is reproducible.
OutputPlan.byte_range: Range → byte_ranges: Vec<Range>. The non-randomized pathis untouched and just as fast: it still produces a single contiguous zero-copy range.
The randomized path shuffles a chain-index permutation, partitions it with the existing
--files/--chunksmath, thenmerge_chain_rangesmaps each file's chains to aminimal set of byte ranges (sorted + contiguous-coalesced to minimize writes).
merge_chain_rangesmakes chain 0 own byte 0, so the per-chainranges form a gap-free partition of the input. This preserves any preamble before the
first chain header (which the naive per-chain mapping would have dropped) and guarantees
no bytes are lost or duplicated.
write_output_slicenow concatenates multiple ranges into the sameBufWriter/GzEncoder, preserving zero-copy slicing and gzip support.Changed
src/cli/split.rs— added-R/--randomize(bool) and--seed(optionalu64, with claprequires = "randomize"so--seedalone is an error).Documentation
assets/tools/split.md.Notes
validity, determinism, range merging, preamble preservation, chain preservation,
reproducibility, and seed-without-randomize rejection). gzip-feature tests pass too.
--seed 42byte-identical across runs),fresh-seed logging, all chains present exactly once, and visible redistribution of
chains across files.