Performance Improvements by pmartindev · Pull Request #8 · mona-actions/gh-commit-remap

pmartindev · 2024-09-06T21:35:41Z

This large PR (sorry ya'll) makes performance improvements by:

Changing the comparison from an O(n^2) alg to O(n) by storing the commit-map content as a map with the old sha as the key vs comparing each old sha to the json value
Processing each file in an individual goroutine
Only processing the files that can possibly contain sha objects
Only processing the fields that can possibly contain sha objects

For comparison, I tested on an large archive file (happy to give you a copy for testing 😄) containing a repo with 50k commits in the default branch that contained 138,100 metadata objects (1381 files * 100 objects per file) and a commit-map with 627k lines. Each file took ~1 minute per file to process with a total resulting in > 24 hours to rewrite the archive. With the new improvements, I was able to process the 1381 files in 12 seconds

Processed 1380/1381 files
./run.sh  12.00s user 3.44s system 99% cpu 15.486 total

Because of the threading, the memory footprint increases, but not significantly, for the default 10 go routines on 1.3k files, the go memory profiler indicates < 100mb which is ~2x the size of the commit-map file in this test. It would be possible to optimize this using buffers on the commit-map reads, but I imagine performance might decrease having to make multiple syscalls to read the file.

In addition, this also adds QoL changes for:

Adding a new optional flag to specify the number of threads to use for processing
Adds buffers for reads on the metadata files
Prints the count of files processed vs the total count
Adds a log file of all of the processed files instead of printing to stdout

amenocal

Mainly the hardcoded workerCount need to be addressed 😆

Co-authored-by: Alejandro Menocal <amenocal@github.com>

amenocal

🚀

kuhlman-labs · 2024-09-06T23:24:54Z

+	for scanner.Scan() {
+		line := scanner.Text()
+		// Skip adding the header to the map
+		if line == "old                                      new" {


Should we make this a const?

Good idea, done!

kuhlman-labs · 2024-09-06T23:36:10Z

-			Old: fields[0],
-			New: fields[1],
-		})
+		commitMap[fields[0]] = fields[1]


To improve readability we should be explicit about what the fields are mapping to, something like:
oldSha, newSha := fields[0], fields[1]
commitMap[oldSha] = newSha

kuhlman-labs · 2024-09-06T23:49:31Z

+					log.Fatalf("error updating metadata file: %v", err)
+				}
+				processedFiles <- file
+				processedFilesCount++


Should we use atomic counters here?
https://gobyexample.com/atomic-counters

Definitely 👍

kuhlman-labs

LGTM

amenocal · 2024-10-29T14:28:52Z

@pmartindev we should fix the build issues and get this merged :D

ssulei7 · 2024-10-29T17:52:25Z

@amenocal looks like for our third test case, we are getting an error thrown when parsing an invalid line in our third test case. Looking into how we can better address that panic.

ssulei7

@amenocal @pmartindev fixed logic in commitremap to do field len check before we parse.

pmartindev · 2024-11-06T20:11:03Z

I would like to test this end to end with an actual archive, then I'm fine with merging 😃

amenocal · 2025-06-05T15:02:04Z

@pmartindev can we get this merged ? :D

amenocal · 2026-04-29T19:53:16Z

Closing as this is stale some of this improvement were implemented in #14 :

┌─────────────────────┬──────────────────────────────┬─────────────────────────────┬────────────────────────┐
│                     │ Original                     │ PR #8                       │ PR #14           │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Commit map type     │ []CommitMapEntry             │ map[string]string ✅        │ map[string]string ✅   │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Tree walks per file │ N (per commit)               │ 1                           │ 1                      │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ SHA lookup          │ O(N) string-compare per node │ O(1) hash                   │ O(1) hash              │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Big-O               │ O(F · C · J)                 │ O(F · J)                    │ O(F · J)               │

Performance improvements using maps and goroutines

7b72e8e

pmartindev requested review from amenocal, kuhlman-labs and ssulei7 September 6, 2024 21:35

amenocal reviewed Sep 6, 2024

View reviewed changes

Comment thread internal/commitremap/commitremap.go Outdated

amenocal reviewed Sep 6, 2024

View reviewed changes

Comment thread cmd/root.go

amenocal reviewed Sep 6, 2024

View reviewed changes

Comment thread internal/commitremap/commitremap.go

amenocal requested changes Sep 6, 2024

View reviewed changes

pmartindev and others added 2 commits September 6, 2024 17:05

Update internal/commitremap/commitremap.go

66be413

Co-authored-by: Alejandro Menocal <amenocal@github.com>

Comments,

9e68bcb

amenocal approved these changes Sep 6, 2024

View reviewed changes

kuhlman-labs reviewed Sep 6, 2024

View reviewed changes

Comment thread internal/commitremap/commitremap.go

kuhlman-labs reviewed Sep 6, 2024

View reviewed changes

pmartindev added 2 commits September 9, 2024 11:32

Code clarity

3dc938a

Code clarity

390afa5

kuhlman-labs approved these changes Sep 12, 2024

View reviewed changes

Moving field len check for invalid input

89704c0

ssulei7 approved these changes Oct 29, 2024

View reviewed changes

amenocal closed this Apr 29, 2026

Conversation

pmartindev commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amenocal left a comment

Choose a reason for hiding this comment

Uh oh!

amenocal left a comment

Choose a reason for hiding this comment

Uh oh!

kuhlman-labs Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

pmartindev Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

kuhlman-labs Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

pmartindev Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kuhlman-labs Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

pmartindev Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

kuhlman-labs left a comment

Choose a reason for hiding this comment

Uh oh!

amenocal commented Oct 29, 2024

Uh oh!

ssulei7 commented Oct 29, 2024

Uh oh!

ssulei7 left a comment

Choose a reason for hiding this comment

Uh oh!

pmartindev commented Nov 6, 2024

Uh oh!

amenocal commented Jun 5, 2025

Uh oh!

amenocal commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmartindev commented Sep 6, 2024 •

edited

Loading