Skip to content

Performance Improvements#8

Closed
pmartindev wants to merge 6 commits into
mainfrom
pmartindev/performance-improvements
Closed

Performance Improvements#8
pmartindev wants to merge 6 commits into
mainfrom
pmartindev/performance-improvements

Conversation

@pmartindev
Copy link
Copy Markdown

@pmartindev pmartindev commented Sep 6, 2024

This large PR (sorry ya'll) makes performance improvements by:

  • Changing the comparison from an O(n^2) alg to O(n) by storing the commit-map content as a map with the old sha as the key vs comparing each old sha to the json value
  • Processing each file in an individual goroutine
  • Only processing the files that can possibly contain sha objects
  • Only processing the fields that can possibly contain sha objects

For comparison, I tested on an large archive file (happy to give you a copy for testing 😄) containing a repo with 50k commits in the default branch that contained 138,100 metadata objects (1381 files * 100 objects per file) and a commit-map with 627k lines. Each file took ~1 minute per file to process with a total resulting in > 24 hours to rewrite the archive. With the new improvements, I was able to process the 1381 files in 12 seconds

Processed 1380/1381 files
./run.sh  12.00s user 3.44s system 99% cpu 15.486 total

Because of the threading, the memory footprint increases, but not significantly, for the default 10 go routines on 1.3k files, the go memory profiler indicates < 100mb which is ~2x the size of the commit-map file in this test. It would be possible to optimize this using buffers on the commit-map reads, but I imagine performance might decrease having to make multiple syscalls to read the file.

CleanShot 2024-09-06 at 16 29 06@2x

In addition, this also adds QoL changes for:

  • Adding a new optional flag to specify the number of threads to use for processing
  • Adds buffers for reads on the metadata files
  • Prints the count of files processed vs the total count
  • Adds a log file of all of the processed files instead of printing to stdout

Comment thread internal/commitremap/commitremap.go Outdated
Comment thread cmd/root.go
Comment thread internal/commitremap/commitremap.go
Copy link
Copy Markdown
Collaborator

@amenocal amenocal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly the hardcoded workerCount need to be addressed 😆

pmartindev and others added 2 commits September 6, 2024 17:05
Co-authored-by: Alejandro Menocal <amenocal@github.com>
Copy link
Copy Markdown
Collaborator

@amenocal amenocal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Comment thread internal/commitremap/commitremap.go Outdated
for scanner.Scan() {
line := scanner.Text()
// Skip adding the header to the map
if line == "old new" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a const?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done!

Comment thread internal/commitremap/commitremap.go Outdated
Old: fields[0],
New: fields[1],
})
commitMap[fields[0]] = fields[1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve readability we should be explicit about what the fields are mapping to, something like:
oldSha, newSha := fields[0], fields[1]
commitMap[oldSha] = newSha

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread internal/commitremap/commitremap.go
Comment thread internal/commitremap/commitremap.go Outdated
log.Fatalf("error updating metadata file: %v", err)
}
processedFiles <- file
processedFilesCount++
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use atomic counters here?
https://gobyexample.com/atomic-counters

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely 👍

Copy link
Copy Markdown
Contributor

@kuhlman-labs kuhlman-labs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amenocal
Copy link
Copy Markdown
Collaborator

@pmartindev we should fix the build issues and get this merged :D

@ssulei7
Copy link
Copy Markdown
Contributor

ssulei7 commented Oct 29, 2024

@amenocal looks like for our third test case, we are getting an error thrown when parsing an invalid line in our third test case. Looking into how we can better address that panic.

Copy link
Copy Markdown
Contributor

@ssulei7 ssulei7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amenocal @pmartindev fixed logic in commitremap to do field len check before we parse.

@pmartindev
Copy link
Copy Markdown
Author

I would like to test this end to end with an actual archive, then I'm fine with merging 😃

@amenocal
Copy link
Copy Markdown
Collaborator

amenocal commented Jun 5, 2025

@pmartindev can we get this merged ? :D

@amenocal
Copy link
Copy Markdown
Collaborator

Closing as this is stale some of this improvement were implemented in #14 :

┌─────────────────────┬──────────────────────────────┬─────────────────────────────┬────────────────────────┐
│                     │ Original                     │ PR #8                       │ PR #14           │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Commit map type     │ []CommitMapEntry             │ map[string]string ✅        │ map[string]string ✅   │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Tree walks per file │ N (per commit)               │ 1                           │ 1                      │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ SHA lookup          │ O(N) string-compare per node │ O(1) hash                   │ O(1) hash              │
├─────────────────────┼──────────────────────────────┼─────────────────────────────┼────────────────────────┤
│ Big-O               │ O(F · C · J)                 │ O(F · J)                    │ O(F · J)               │

@amenocal amenocal closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants