Remap Performance Optimizations on Large Repositories#16
Conversation
…dd tests for new functionality
The commit-map file produced by git-filter-repo starts with an 'old new' header line. The new SHA length validation rejects it. Skip the header when it appears on the first line. perf(commitremap): hex lookup table and byte-level sliding window Replace branch-heavy isHexByte with a [256]bool lookup table for branchless hex byte validation. Replace JSON-walk replaceSHA with byte-level replaceSHABytes sliding window that finds and replaces SHAs everywhere in file content (URLs, markdown, composite strings). Remove old replaceSHA function. Add commitMapSHALen validation, benchmark suite, and updated tests. refactor(bench): migrate benchmarks to b.Loop() (Go 1.24+) - Replace for i := 0; i < b.N; i++ with b.Loop() in all 5 benchmarks - Fix dead-code elimination risk in BenchmarkIsHexByte - Move make+copy setup outside timed loop in BenchmarkReplaceSHABytes - Remove unnecessary b.ResetTimer() calls (b.Loop auto-excludes setup)
@amenocal we may want to consider not using pgzip, and just using serial gzip compression to avoid having to import an outside package. In my tests, I removed large attachments, so I would assume pgzip would scale for larger compression jobs (ex. my .tar.gz was only 4.4gb vs 40gb w/ attachments) |
|
Also, not sure if it's worth considering, but sometimes shas can be referenced by their first 7 chars (see PR description above 😅 ). We could search for this as well, but we run into a birthday paradox problem with the likely hood of false positives increasing as the number of commits grows. We would have to add additional, specific logic that takes into account when shortened hashes are used (such as in the example above in links with the ../commit/ prefix). I think the benefits of a fast generic algorithm on all full length shas outweights the benefits of having to build custom logic for shortened shas. If we merge this PR, I recommend closing #9. |
Add issue_comments, pull_request_reviews, pull_request_review_comments, pull_request_review_threads, and commit_comments to the default prefix list. These metadata files can contain commit SHA references that need remapping. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
amenocal
left a comment
There was a problem hiding this comment.
Good stuff 🚀 , minor changes that I recommend
| // | ||
| // numWorkers controls how many goroutines process files in parallel. | ||
| // If numWorkers <= 0, it defaults to runtime.NumCPU(). | ||
| func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, numWorkers int) (Stats, error) { |
There was a problem hiding this comment.
Would adding something like an ProcessOptions struct work better since this is technically an optional value?
type ProcessOptions struct {
NumWorkers int // 0 = NumCPU
}
func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, opts ProcessOptions)
gives us the flexibility for the future too
| // Collect all files to process | ||
| var allFiles []string | ||
| for _, prefix := range prefixes { | ||
| pattern := filepath.Join(archiveDir, prefix+"_*.json") |
There was a problem hiding this comment.
should this use the same pattern as shouldRemap <prefix>_<digits>.json ? the way shouldRemap is doing it is probably better and more aligned with our archive format.
There was a problem hiding this comment.
That makes sense! Where do think shouldRemap should be moved to?
There was a problem hiding this comment.
Is this file better suited for pkg/archive ?
Also I don't see this being called anywhere 😅 should we add an option for this CLI ? or were you just thinking of leaving this solely as a pkg ?
There was a problem hiding this comment.
That's a great point! Yes it's probably better suited for pkg/archive! This isn't called anywhere in this application, because I had intended it to be called by gh-history-rewrite-migration, however, you're right, there's no reason it shouldn't be called here too 😅 I'll push an update to this repo as well!
@pmartindev I would agree. I'd rather use a standard go library rather than an outside pacakge. unless |
Only advantage is parallelization for compressing the tar. gzip doesn't offer parallelization. Both packages offer fast compression, which creates larger archives. So the trade off would be on very large tar files where compression is slow serially. I think we would need to test on very large archives to see how gzip performs. |
@pmartindev I think there is value in it then. I also see this comment on their README
Which I think the likelihood of this is quite high 😅 for the archives that we will handling |
Remap Performance Optimizations
Summary
This PR introduces a series of cumulative performance optimizations to the SHA remap phase, achieving a 98% reduction in processing time, from 54 minutes down to 49 seconds. Tests were ran against a monorepo-scale archive (1.8M commits, 95K metadata files, 4.4GB compressed).
Also, intrinsically resolves #12 and includes changes for #11.
Test Environment
runtime.NumCPU()= 16Benchmark Results
54m 00s30m 19s24m 41s18m 42s0m 49sOptimizations
1. Parallel file processing
Process metadata files concurrently using goroutines instead of sequentially. The number of goroutines can be configured with the
--threadscli flag. If unspecified, the default is0. The application interprets0to use the goroutines equal to the number of available CPUs (runtime.NumCPU()).2. Byte sliding window
old newheader line in commit map parsing.3. Hex lookup table
Replaced per-byte hex validation with a 256-entry precomputed lookup table (
isHexByte), eliminating branch-heavy range checks in the hot loop.4. Streaming tar-to-tar remap
Eliminates the extract → modify → re-tar pipeline entirely.
pgzip) atBestSpeedwithNumCPUconcurrency.Testing
StreamRemap(round-trip fidelity, passthrough, safety checks, ordering)replaceSHABytes,ParseCommitMap,ProcessFiles,isHexByte)