-
Notifications
You must be signed in to change notification settings - Fork 0
Add anonymize script #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @sirkostya009, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust Go-based anonymization script designed to process CSV files containing sensitive information. The primary goal is to transform identifiable data into consistently generated, readable placeholders, thereby enabling the use of production-like datasets in non-production environments without compromising privacy. The script intelligently handles various data types, including structured JSON and parameterized strings, ensuring that the anonymized output retains its original format and relationships. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new script for anonymizing CSV data. The overall structure is good, leveraging a state struct and helper functions for clarity and performance optimizations like using strings.Builder. However, the current implementation has several critical issues that could lead to panics when processing invalid or unexpected data formats in the CSV. I've identified multiple places where the code can panic due to unchecked assumptions about the input data, such as out-of-bounds slice access and nil map access. There is also a logic bug in how anonymized filenames are generated. My review provides specific suggestions to make the script more robust and correct.
| matches := requirementsMDLinkRegex.FindStringSubmatch(requirement) | ||
| name := matches[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing matches[1] will cause a panic if requirementsMDLinkRegex.FindStringSubmatch does not find a match, as matches will be nil. You should add a check to ensure matches is not nil and has at least 2 elements before accessing the captured group.
matches := requirementsMDLinkRegex.FindStringSubmatch(requirement)
if len(matches) < 2 {
continue // Or handle error for malformed requirement
}
name := matches[1]| read[HeaderSteps] = buf.String()[:buf.Len()-1] // trim trailing newline | ||
| } | ||
|
|
||
| var parameterValueKeys map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if err != nil { | ||
| panic(err) | ||
| } | ||
| parameterValueKeys = make(map[string]string, len(parameterValues[0].Values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if strings.HasPrefix(jsonVal, `{"value":"`) { | ||
| value := jsonVal[len(`{"value":"`) : len(jsonVal)-2] | ||
| buf.Reset() | ||
| buf.WriteString(`{"value":"`) | ||
| buf.WriteString(anon.readableCustomField(value)) | ||
| buf.WriteString(`"}`) | ||
| read[HeaderTemplateSuffixParams+1+i] = buf.String() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This manual JSON parsing is brittle. It relies on strings.HasPrefix and string slicing, which will fail if the JSON format varies slightly (e.g., with extra whitespace, different key ordering, or additional fields). A much more robust approach is to unmarshal the JSON into a struct. This also allows you to correctly handle the logic described in the comments, like skipping values where isDefault is true.
| if strings.HasPrefix(jsonVal, `{"value":"`) { | |
| value := jsonVal[len(`{"value":"`) : len(jsonVal)-2] | |
| buf.Reset() | |
| buf.WriteString(`{"value":"`) | |
| buf.WriteString(anon.readableCustomField(value)) | |
| buf.WriteString(`"}`) | |
| read[HeaderTemplateSuffixParams+1+i] = buf.String() | |
| } | |
| var cf struct { | |
| Value *string `json:"value"` | |
| } | |
| if err := json.Unmarshal([]byte(jsonVal), &cf); err == nil && cf.Value != nil { | |
| value := *cf.Value | |
| buf.Reset() | |
| buf.WriteString(`{"value":"`) | |
| buf.WriteString(anon.readableCustomField(value)) | |
| buf.WriteString(`"}`) | |
| read[HeaderTemplateSuffixParams+1+i] = buf.String() | |
| } |
| func (a *state) readableFileID(original string) string { | ||
| if existing, ok := a.files[original]; ok { | ||
| return existing | ||
| } | ||
| a.fileCounter++ | ||
| readable := "file-" + padInt(a.fileCounter, 3) | ||
| a.files[original] = readable | ||
| return readable | ||
| } | ||
|
|
||
| func (a *state) readableFileName(ext string) string { | ||
| adj := adjectives[a.fileCounter%len(adjectives)] | ||
| noun := nouns[a.fileCounter%len(nouns)] | ||
| return "document-" + adj + "-" + noun + ext | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a logic bug in the coupling between readableFileID and readableFileName. readableFileID increments a.fileCounter only for new file IDs. readableFileName then uses this counter. If an already-seen file ID is processed, readableFileID returns a cached value without incrementing the counter. The subsequent call to readableFileName will then use a stale counter value, leading to incorrect filenames being generated.
To fix this, you should generate and cache both the new ID and new filename together. Consider refactoring into a single function that returns both, and stores them in a struct in your state map. For example:
// In state struct:
type fileMapping struct {
newID string
newName string
}
files map[string]fileMapping
// New function:
func (a *state) getAnonFileInfo(originalID, ext string) (string, string) {
if mapping, ok := a.files[originalID]; ok {
return mapping.newID, mapping.newName
}
a.fileCounter++
newID := "file-" + padInt(a.fileCounter, 3)
adj := adjectives[a.fileCounter%len(adjectives)]
noun := nouns[a.fileCounter%len(nouns)]
newName := "document-" + adj + "-" + noun + ext
a.files[originalID] = fileMapping{newID: newID, newName: newName}
return newID, newName
}
// In main():
ext := filepath.Ext(files[i].FileName)
files[i].ID, files[i].FileName = anon.getAnonFileInfo(files[i].ID, ext)| if *filename == "" { | ||
| panic("no -f flag provided") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using panic for handling expected errors like invalid user input is not idiomatic for Go applications. It causes an abrupt stop and a stack trace, which isn't user-friendly. Consider handling this error gracefully by printing a message to os.Stderr and exiting with a non-zero status code, for example using log.Fatal or fmt.Fprintf(os.Stderr, ...) followed by os.Exit(1). This applies to other panic calls for file I/O and parsing errors in this function as well.
| for j, r := range word { | ||
| a.textBuf.WriteRune(unicode.ToUpper(r)) | ||
| a.textBuf.WriteString(word[j+utf8.RuneLen(r):]) | ||
| break | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for loop to capitalize the first letter of a word is unconventional as it's designed to only run once. A more idiomatic and clearer approach would be to decode the first rune and then capitalize it, which avoids the loop structure.
r, size := utf8.DecodeRuneInString(word)
a.textBuf.WriteRune(unicode.ToUpper(r))
a.textBuf.WriteString(word[size:])
of https://github.com/Hypersequent/tms-issues/issues/2179