Skip to content

Remap Performance Optimizations on Large Repositories#16

Draft
pmartindev wants to merge 6 commits into
mainfrom
pmartindev/perf-4-streaming
Draft

Remap Performance Optimizations on Large Repositories#16
pmartindev wants to merge 6 commits into
mainfrom
pmartindev/perf-4-streaming

Conversation

@pmartindev
Copy link
Copy Markdown

@pmartindev pmartindev commented May 22, 2026

Remap Performance Optimizations

Summary

This PR introduces a series of cumulative performance optimizations to the SHA remap phase, achieving a 98% reduction in processing time, from 54 minutes down to 49 seconds. Tests were ran against a monorepo-scale archive (1.8M commits, 95K metadata files, 4.4GB compressed).

Also, intrinsically resolves #12 and includes changes for #11.

Test Environment

Spec Value
CPU Apple M4 Max (16 cores)
RAM 64 GB
Storage SSD (APFS)
OS macOS
Go runtime.NumCPU() = 16

Benchmark Results

Optimization Remap Time Improvement
Baseline (single-threaded) 54m 00s
Threaded per CPU 30m 19s ▼ 44%
+ Byte sliding window 24m 41s ▼ 54%
+ Hex lookup table 18m 42s ▼ 65%
+ Streaming tar-to-tar 0m 49s ▼ 98%
image

Optimizations

1. Parallel file processing

f66d96e

Process metadata files concurrently using goroutines instead of sequentially. The number of goroutines can be configured with the --threads cli flag. If unspecified, the default is 0. The application interprets 0 to use the goroutines equal to the number of available CPUs (runtime.NumCPU()).

2. Byte sliding window

b21eff5, 30171da

  • Replaced regex-based SHA matching with a byte-level sliding window that scans for valid hex (SHA) sequences directly. Instead of loading and parsing json objects, the file is loaded as a byte array and shifts according to the predetermined length of a valid sha1 or sha256.
  • Fix: Skip git-filter-repo's old new header line in commit map parsing.

3. Hex lookup table

30171da

Replaced per-byte hex validation with a 256-entry precomputed lookup table (isHexByte), eliminating branch-heavy range checks in the hot loop.

4. Streaming tar-to-tar remap

724695d

Eliminates the extract → modify → re-tar pipeline entirely.

  • Reads the input file entry-by-entry. If the entry is one of the predefined sha bearing entries (pull requests, issue events, etc.) it remaps the json in memory, then writes rewritten output directly to a new tar file. If it is not, the output is directly copied to the new tar file from memory.
  • Only 2 file handles instead of 190,000+ filesystem operations.
  • Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

Testing

  • ✅ 7 unit tests for StreamRemap (round-trip fidelity, passthrough, safety checks, ordering)
  • ✅ Go benchmarks for all hot-path functions (replaceSHABytes, ParseCommitMap, ProcessFiles, isHexByte)
  • ✅ End-to-end validation against 12 migration archives (34/34 checks passing)

The commit-map file produced by git-filter-repo starts with an
'old new' header line. The new SHA length validation rejects it.
Skip the header when it appears on the first line.

perf(commitremap): hex lookup table and byte-level sliding window

Replace branch-heavy isHexByte with a [256]bool lookup table for
branchless hex byte validation. Replace JSON-walk replaceSHA with
byte-level replaceSHABytes sliding window that finds and replaces
SHAs everywhere in file content (URLs, markdown, composite strings).

Remove old replaceSHA function. Add commitMapSHALen validation,
benchmark suite, and updated tests.

refactor(bench): migrate benchmarks to b.Loop() (Go 1.24+)

- Replace for i := 0; i < b.N; i++ with b.Loop() in all 5 benchmarks
- Fix dead-code elimination risk in BenchmarkIsHexByte
- Move make+copy setup outside timed loop in BenchmarkReplaceSHABytes
- Remove unnecessary b.ResetTimer() calls (b.Loop auto-excludes setup)
@pmartindev
Copy link
Copy Markdown
Author

pmartindev commented May 22, 2026

  • Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

@amenocal we may want to consider not using pgzip, and just using serial gzip compression to avoid having to import an outside package. In my tests, I removed large attachments, so I would assume pgzip would scale for larger compression jobs (ex. my .tar.gz was only 4.4gb vs 40gb w/ attachments)

@pmartindev
Copy link
Copy Markdown
Author

pmartindev commented May 22, 2026

Also, not sure if it's worth considering, but sometimes shas can be referenced by their first 7 chars (see PR description above 😅 ). We could search for this as well, but we run into a birthday paradox problem with the likely hood of false positives increasing as the number of commits grows. We would have to add additional, specific logic that takes into account when shortened hashes are used (such as in the example above in links with the ../commit/ prefix).

I think the benefits of a fast generic algorithm on all full length shas outweights the benefits of having to build custom logic for shortened shas. If we merge this PR, I recommend closing #9.

Add issue_comments, pull_request_reviews, pull_request_review_comments,
pull_request_review_threads, and commit_comments to the default prefix
list. These metadata files can contain commit SHA references that need
remapping.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@amenocal amenocal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff 🚀 , minor changes that I recommend

//
// numWorkers controls how many goroutines process files in parallel.
// If numWorkers <= 0, it defaults to runtime.NumCPU().
func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, numWorkers int) (Stats, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding something like an ProcessOptions struct work better since this is technically an optional value?

type ProcessOptions struct {
     NumWorkers int // 0 = NumCPU
 }
 
 func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, opts ProcessOptions) 

gives us the flexibility for the future too

// Collect all files to process
var allFiles []string
for _, prefix := range prefixes {
pattern := filepath.Join(archiveDir, prefix+"_*.json")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use the same pattern as shouldRemap <prefix>_<digits>.json ? the way shouldRemap is doing it is probably better and more aligned with our archive format.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! Where do think shouldRemap should be moved to?

Comment thread pkg/commitremap/stream.go
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file better suited for pkg/archive ?

Also I don't see this being called anywhere 😅 should we add an option for this CLI ? or were you just thinking of leaving this solely as a pkg ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point! Yes it's probably better suited for pkg/archive! This isn't called anywhere in this application, because I had intended it to be called by gh-history-rewrite-migration, however, you're right, there's no reason it shouldn't be called here too 😅 I'll push an update to this repo as well!

@amenocal
Copy link
Copy Markdown
Collaborator

  • Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

@amenocal we may want to consider not using pgzip, and just using serial gzip compression to avoid having to import an outside package. In my tests, I removed large attachments, so I would assume pgzip would scale for larger compression jobs (ex. my .tar.gz was only 4.4gb vs 40gb w/ attachments)

@pmartindev I would agree. I'd rather use a standard go library rather than an outside pacakge. unless pgzip is giving us somethign that gzip doesn't

@pmartindev
Copy link
Copy Markdown
Author

pmartindev commented May 22, 2026

unless pgzip is giving us somethign that gzip doesn't

Only advantage is parallelization for compressing the tar. gzip doesn't offer parallelization. Both packages offer fast compression, which creates larger archives. So the trade off would be on very large tar files where compression is slow serially. I think we would need to test on very large archives to see how gzip performs.

@amenocal
Copy link
Copy Markdown
Collaborator

unless pgzip is giving us somethign that gzip doesn't

Only advantage is parallelization for compressing the tar. gzip doesn't offer parallelization. Both packages offer fast compression, which creates larger archives. So the trade off would be on very large tar files where compression is slow serially. I think we would need to test on very large archives to see how gzip performs.

@pmartindev I think there is value in it then. I also see this comment on their README

You should only use this if you are (de)compressing big amounts of data, say more than 1MB at the time

Which I think the likelihood of this is quite high 😅 for the archives that we will handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Match SHAs embedded in URLs and markdown bodies

2 participants