Performance Improvements#8
Conversation
amenocal
left a comment
There was a problem hiding this comment.
Mainly the hardcoded workerCount need to be addressed 😆
Co-authored-by: Alejandro Menocal <amenocal@github.com>
| for scanner.Scan() { | ||
| line := scanner.Text() | ||
| // Skip adding the header to the map | ||
| if line == "old new" { |
There was a problem hiding this comment.
Should we make this a const?
| Old: fields[0], | ||
| New: fields[1], | ||
| }) | ||
| commitMap[fields[0]] = fields[1] |
There was a problem hiding this comment.
To improve readability we should be explicit about what the fields are mapping to, something like:
oldSha, newSha := fields[0], fields[1]
commitMap[oldSha] = newSha
| log.Fatalf("error updating metadata file: %v", err) | ||
| } | ||
| processedFiles <- file | ||
| processedFilesCount++ |
There was a problem hiding this comment.
Should we use atomic counters here?
https://gobyexample.com/atomic-counters
|
@pmartindev we should fix the build issues and get this merged :D |
|
@amenocal looks like for our third test case, we are getting an error thrown when parsing an invalid line in our third test case. Looking into how we can better address that panic. |
ssulei7
left a comment
There was a problem hiding this comment.
@amenocal @pmartindev fixed logic in commitremap to do field len check before we parse.
|
I would like to test this end to end with an actual archive, then I'm fine with merging 😃 |
|
@pmartindev can we get this merged ? :D |
|
Closing as this is stale some of this improvement were implemented in #14 : |
This large PR (sorry ya'll) makes performance improvements by:
For comparison, I tested on an large archive file (happy to give you a copy for testing 😄) containing a repo with 50k commits in the default branch that contained 138,100 metadata objects (1381 files * 100 objects per file) and a commit-map with 627k lines. Each file took ~1 minute per file to process with a total resulting in > 24 hours to rewrite the archive. With the new improvements, I was able to process the 1381 files in 12 seconds
Because of the threading, the memory footprint increases, but not significantly, for the default 10 go routines on 1.3k files, the go memory profiler indicates < 100mb which is ~2x the size of the commit-map file in this test. It would be possible to optimize this using buffers on the commit-map reads, but I imagine performance might decrease having to make multiple syscalls to read the file.
In addition, this also adds QoL changes for: