Remap Performance Optimizations on Large Repositories by pmartindev · Pull Request #16 · mona-actions/gh-commit-remap

pmartindev · 2026-05-22T01:09:11Z

Remap Performance Optimizations

Summary

This PR introduces a series of cumulative performance optimizations to the SHA remap phase, achieving a 98% reduction in processing time, from 54 minutes down to 49 seconds. Tests were ran against a monorepo-scale archive (1.8M commits, 95K metadata files, 4.4GB compressed).

Also, intrinsically resolves #12 and includes changes for #11.

Test Environment

Spec	Value
CPU	Apple M4 Max (16 cores)
RAM	64 GB
Storage	SSD (APFS)
OS	macOS
Go	`runtime.NumCPU()` = 16

Benchmark Results

Optimization	Remap Time	Improvement
Baseline (single-threaded)	`54m 00s`	—
Threaded per CPU	`30m 19s`	▼ 44%
+ Byte sliding window	`24m 41s`	▼ 54%
+ Hex lookup table	`18m 42s`	▼ 65%
+ Streaming tar-to-tar	`0m 49s`	▼ 98%

Optimizations

1. Parallel file processing

f66d96e

Process metadata files concurrently using goroutines instead of sequentially. The number of goroutines can be configured with the --threads cli flag. If unspecified, the default is 0. The application interprets 0 to use the goroutines equal to the number of available CPUs (runtime.NumCPU()).

2. Byte sliding window

b21eff5, 30171da

Replaced regex-based SHA matching with a byte-level sliding window that scans for valid hex (SHA) sequences directly. Instead of loading and parsing json objects, the file is loaded as a byte array and shifts according to the predetermined length of a valid sha1 or sha256.
Fix: Skip git-filter-repo's old new header line in commit map parsing.

3. Hex lookup table

30171da

Replaced per-byte hex validation with a 256-entry precomputed lookup table (isHexByte), eliminating branch-heavy range checks in the hot loop.

4. Streaming tar-to-tar remap

724695d

Eliminates the extract → modify → re-tar pipeline entirely.

Reads the input file entry-by-entry. If the entry is one of the predefined sha bearing entries (pull requests, issue events, etc.) it remaps the json in memory, then writes rewritten output directly to a new tar file. If it is not, the output is directly copied to the new tar file from memory.
Only 2 file handles instead of 190,000+ filesystem operations.
Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

Testing

✅ 7 unit tests for StreamRemap (round-trip fidelity, passthrough, safety checks, ordering)
✅ Go benchmarks for all hot-path functions (replaceSHABytes, ParseCommitMap, ProcessFiles, isHexByte)
✅ End-to-end validation against 12 migration archives (34/34 checks passing)

…dd tests for new functionality

The commit-map file produced by git-filter-repo starts with an 'old new' header line. The new SHA length validation rejects it. Skip the header when it appears on the first line. perf(commitremap): hex lookup table and byte-level sliding window Replace branch-heavy isHexByte with a [256]bool lookup table for branchless hex byte validation. Replace JSON-walk replaceSHA with byte-level replaceSHABytes sliding window that finds and replaces SHAs everywhere in file content (URLs, markdown, composite strings). Remove old replaceSHA function. Add commitMapSHALen validation, benchmark suite, and updated tests. refactor(bench): migrate benchmarks to b.Loop() (Go 1.24+) - Replace for i := 0; i < b.N; i++ with b.Loop() in all 5 benchmarks - Fix dead-code elimination risk in BenchmarkIsHexByte - Move make+copy setup outside timed loop in BenchmarkReplaceSHABytes - Remove unnecessary b.ResetTimer() calls (b.Loop auto-excludes setup)

pmartindev · 2026-05-22T01:10:12Z

Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

@amenocal we may want to consider not using pgzip, and just using serial gzip compression to avoid having to import an outside package. In my tests, I removed large attachments, so I would assume pgzip would scale for larger compression jobs (ex. my .tar.gz was only 4.4gb vs 40gb w/ attachments)

pmartindev · 2026-05-22T01:43:12Z

Also, not sure if it's worth considering, but sometimes shas can be referenced by their first 7 chars (see PR description above 😅 ). We could search for this as well, but we run into a birthday paradox problem with the likely hood of false positives increasing as the number of commits grows. We would have to add additional, specific logic that takes into account when shortened hashes are used (such as in the example above in links with the ../commit/ prefix).

I think the benefits of a fast generic algorithm on all full length shas outweights the benefits of having to build custom logic for shortened shas. If we merge this PR, I recommend closing #9.

Add issue_comments, pull_request_reviews, pull_request_review_comments, pull_request_review_threads, and commit_comments to the default prefix list. These metadata files can contain commit SHA references that need remapping. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

amenocal

Good stuff 🚀 , minor changes that I recommend

amenocal · 2026-05-22T15:40:12Z

+//
+// numWorkers controls how many goroutines process files in parallel.
+// If numWorkers <= 0, it defaults to runtime.NumCPU().
+func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, numWorkers int) (Stats, error) {


Would adding something like an ProcessOptions struct work better since this is technically an optional value?

type ProcessOptions struct { NumWorkers int // 0 = NumCPU } func ProcessFiles(archiveDir string, prefixes []string, commitMap map[string]string, opts ProcessOptions)

gives us the flexibility for the future too

amenocal · 2026-05-22T15:42:29Z

+	// Collect all files to process
+	var allFiles []string
 	for _, prefix := range prefixes {
 		pattern := filepath.Join(archiveDir, prefix+"_*.json")


should this use the same pattern as shouldRemap <prefix>_<digits>.json ? the way shouldRemap is doing it is probably better and more aligned with our archive format.

That makes sense! Where do think shouldRemap should be moved to?

amenocal · 2026-05-22T15:45:37Z

Is this file better suited for pkg/archive ?

Also I don't see this being called anywhere 😅 should we add an option for this CLI ? or were you just thinking of leaving this solely as a pkg ?

That's a great point! Yes it's probably better suited for pkg/archive! This isn't called anywhere in this application, because I had intended it to be called by gh-history-rewrite-migration, however, you're right, there's no reason it shouldn't be called here too 😅 I'll push an update to this repo as well!

amenocal · 2026-05-22T15:52:20Z

Output compression uses parallel gzip (pgzip) at BestSpeed with NumCPU concurrency.

@amenocal we may want to consider not using pgzip, and just using serial gzip compression to avoid having to import an outside package. In my tests, I removed large attachments, so I would assume pgzip would scale for larger compression jobs (ex. my .tar.gz was only 4.4gb vs 40gb w/ attachments)

@pmartindev I would agree. I'd rather use a standard go library rather than an outside pacakge. unless pgzip is giving us somethign that gzip doesn't

pmartindev · 2026-05-22T16:06:35Z

unless pgzip is giving us somethign that gzip doesn't

Only advantage is parallelization for compressing the tar. gzip doesn't offer parallelization. Both packages offer fast compression, which creates larger archives. So the trade off would be on very large tar files where compression is slow serially. I think we would need to test on very large archives to see how gzip performs.

amenocal · 2026-05-22T18:22:03Z

unless pgzip is giving us somethign that gzip doesn't

Only advantage is parallelization for compressing the tar. gzip doesn't offer parallelization. Both packages offer fast compression, which creates larger archives. So the trade off would be on very large tar files where compression is slow serially. I think we would need to test on very large archives to see how gzip performs.

@pmartindev I think there is value in it then. I also see this comment on their README

You should only use this if you are (de)compressing big amounts of data, say more than 1MB at the time

Which I think the likelihood of this is quite high 😅 for the archives that we will handling

pmartindev added 5 commits May 20, 2026 12:07

Parallelize file processing.

f66d96e

Improved SHA replacement logic with sliding byte window comparison; a…

b21eff5

…dd tests for new functionality

Added tar streaming and parllel gzip

724695d

Commit phrasing

52a0c34

pmartindev mentioned this pull request May 22, 2026

Match SHAs embedded in URLs and markdown bodies #12

Open

pmartindev mentioned this pull request May 22, 2026

Expand JSON prefix coverage in commitremap.ProcessFiles #11

Open

amenocal reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remap Performance Optimizations on Large Repositories#16

Remap Performance Optimizations on Large Repositories#16
pmartindev wants to merge 6 commits into
mainfrom
pmartindev/perf-4-streaming

pmartindev commented May 22, 2026 •

edited

Loading

Uh oh!

pmartindev commented May 22, 2026 •

edited

Loading

Uh oh!

pmartindev commented May 22, 2026 •

edited

Loading

Uh oh!

amenocal left a comment

Uh oh!

amenocal May 22, 2026

Uh oh!

amenocal May 22, 2026

Uh oh!

pmartindev May 26, 2026

Uh oh!

amenocal May 22, 2026

Uh oh!

pmartindev May 22, 2026

Uh oh!

amenocal commented May 22, 2026

Uh oh!

pmartindev commented May 22, 2026 •

edited

Loading

Uh oh!

amenocal commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pmartindev commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Remap Performance Optimizations

Summary

Test Environment

Benchmark Results

Optimizations

1. Parallel file processing

2. Byte sliding window

3. Hex lookup table

4. Streaming tar-to-tar remap

Testing

Uh oh!

pmartindev commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmartindev commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amenocal left a comment

Choose a reason for hiding this comment

Uh oh!

amenocal May 22, 2026

Choose a reason for hiding this comment

Uh oh!

amenocal May 22, 2026

Choose a reason for hiding this comment

Uh oh!

pmartindev May 26, 2026

Choose a reason for hiding this comment

Uh oh!

amenocal May 22, 2026

Choose a reason for hiding this comment

Uh oh!

pmartindev May 22, 2026

Choose a reason for hiding this comment

Uh oh!

amenocal commented May 22, 2026

Uh oh!

pmartindev commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amenocal commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pmartindev commented May 22, 2026 •

edited

Loading

pmartindev commented May 22, 2026 •

edited

Loading

pmartindev commented May 22, 2026 •

edited

Loading

pmartindev commented May 22, 2026 •

edited

Loading