feat(categorization): extract findCommonPhrases util with bigram support and tests#803
Conversation
…ort and tests Moves the common-phrase detection logic from CategoryBuilder.vue into a standalone src/util/categorization.ts module and adds 10 unit tests. Changes vs original PR ActivityWatch#455: - Extracts function to util module for testability - Uses Map<string, WordEntry> (consistent with existing CategoryBuilder code) - Filters bigram components against ignored_words and length <= 2 - Removes debug console.log statements - Full TypeScript types Closes ActivityWatch#455
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #803 +/- ##
==========================================
+ Coverage 28.17% 29.96% +1.79%
==========================================
Files 31 32 +1
Lines 1782 1829 +47
Branches 322 331 +9
==========================================
+ Hits 502 548 +46
- Misses 1212 1260 +48
+ Partials 68 21 -47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Greptile SummaryThis PR extracts the word-frequency logic from Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([findCommonPhrases\nevents, ignored_words]) --> B
subgraph Step1["Step 1: Build word map"]
B[For each event] --> C[Split title by SPLIT_REGEX]
C --> D{word.length > 2\nAND not ignored?}
D -- Yes --> E[Accumulate word duration\n& push event]
D -- No --> C
end
E --> F
subgraph Step2["Step 2: Build bigram map"]
F[For each event] --> G[Split title by SPLIT_REGEX]
G --> H{w1 & w2 both valid?\nlength > 2, not ignored}
H -- Yes --> I[Accumulate bigram duration\n& push event]
H -- No --> G
end
I --> J
subgraph Step3["Step 3: Promote dominant bigrams"]
J[Snapshot original word durations] --> K[For each bigram]
K --> L{bigramDuration / w1OrigDur > 0.5\nAND bigramDuration / w2OrigDur > 0.5?}
L -- Yes --> M[Add bigram to words map\nReduce w1 & w2 durations]
L -- No --> K
M --> K
end
M --> N([Return words Map])
K -- done --> N
Reviews (2): Last reviewed commit: "style(categorization): fix prettier line..." | Re-trigger Greptile |
| // Step 3: Promote bigrams that dominate both constituent words (>50% threshold) | ||
| for (const [bigram, bigramEntry] of bigrams) { | ||
| const spaceIdx = bigram.indexOf(' '); | ||
| const word1 = bigram.slice(0, spaceIdx); | ||
| const word2 = bigram.slice(spaceIdx + 1); | ||
| const w1Entry = words.get(word1); | ||
| const w2Entry = words.get(word2); | ||
| if (!w1Entry || !w2Entry) continue; | ||
|
|
||
| const bigram_duration = bigramEntry.duration; | ||
| if (bigram_duration / w1Entry.duration > 0.5 && bigram_duration / w2Entry.duration > 0.5) { | ||
| // Promote bigram, reduce constituent word durations | ||
| words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events }); | ||
| w1Entry.duration -= bigram_duration; | ||
| w2Entry.duration -= bigram_duration; | ||
| } | ||
| } |
There was a problem hiding this comment.
Mutating word durations mid-loop causes incorrect promotions and negative values
The promotion loop reads w1Entry.duration and w2Entry.duration to decide whether a bigram qualifies, but then immediately mutates those same values. For a trigram that always co-occurs (e.g. "Visual Studio Code"), both "Visual Studio" and "Studio Code" will each have 100 % of "Studio"'s duration, so they both qualify. Whichever processes second sees Studio.duration = 0, making the ratio Infinity, so it also promotes — and Studio.duration ends up negative. Both bigrams are then shown to the user with the full duration, double-counting the same events.
Snapshot the original durations before the loop so each promotion decision is independent:
// Step 3: Promote bigrams that dominate both constituent words (>50% threshold)
const originalDurations = new Map<string, number>();
for (const [w, entry] of words) {
originalDurations.set(w, entry.duration);
}
for (const [bigram, bigramEntry] of bigrams) {
const spaceIdx = bigram.indexOf(' ');
const word1 = bigram.slice(0, spaceIdx);
const word2 = bigram.slice(spaceIdx + 1);
const w1Entry = words.get(word1);
const w2Entry = words.get(word2);
if (!w1Entry || !w2Entry) continue;
const bigram_duration = bigramEntry.duration;
const orig1 = originalDurations.get(word1) ?? 0;
const orig2 = originalDurations.get(word2) ?? 0;
if (bigram_duration / orig1 > 0.5 && bigram_duration / orig2 > 0.5) {
words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events });
w1Entry.duration -= bigram_duration;
w2Entry.duration -= bigram_duration;
}
}| for (const event of events) { | ||
| const parts = event.data.title.split(SPLIT_REGEX); | ||
| for (let i = 0; i < parts.length - 1; i++) { | ||
| const w1 = parts[i]; | ||
| const w2 = parts[i + 1]; | ||
| if (w1.length <= 2 || ignored_words.includes(w1)) continue; | ||
| if (w2.length <= 2 || ignored_words.includes(w2)) continue; | ||
| const bigram = `${w1} ${w2}`; | ||
| const entry = bigrams.get(bigram); | ||
| if (entry) { | ||
| entry.duration += event.duration; | ||
| entry.events.push(event); | ||
| } else { | ||
| bigrams.set(bigram, { bigram, duration: event.duration, events: [event] }); | ||
| } | ||
| } |
There was a problem hiding this comment.
Non-adjacent words silently bigrammed across separator gaps
After splitting "foo-bar" the array is ["foo", "bar"]. The loop pairs parts[i] with parts[i+1], so "foo" and "bar" are treated as an adjacent bigram even though a hyphen separated them. Titles like "JIRA-1234 Fix bug" could unexpectedly promote "Fix bug" as a bigram even when "JIRA" and "1234" are filtered. This is a design decision worth calling out explicitly in the JSDoc — if the intent is "adjacent in the cleaned word stream" then the current approach is correct; if the intent is "strictly space-adjacent in the original title" then the regex should only split on whitespace for bigram indexing.
No code change required if this is the intended semantics; just a documentation note.
| test('accumulated duration across multiple events', () => { | ||
| const events = [ | ||
| makeEvent('Python Programming', 30), | ||
| makeEvent('Python Programming', 30), | ||
| makeEvent('Python Programming', 30), | ||
| makeEvent('Python Programming', 30), | ||
| ]; | ||
| const result = findCommonPhrases(events, []); | ||
| // All events have same title → bigram fully dominates | ||
| expect(result.get('Python Programming')?.duration).toBe(120); | ||
| expect(result.get('Python')?.duration).toBe(0); | ||
| expect(result.get('Programming')?.duration).toBe(0); | ||
| }); |
There was a problem hiding this comment.
Missing test for trigram / middle-word double-promotion
The existing tests only verify single-bigram promotion. The mutation bug described above is exercisable with a three-word phrase where every event has the same title (e.g. "Visual Studio Code"), where both "Visual Studio" and "Studio Code" should be candidates. A test like the one below would catch the negative-duration regression:
test('does not produce negative duration when middle word appears in two bigrams', () => {
// "Studio" appears in both "Visual Studio" and "Studio Code"
const events = [makeEvent('Visual Studio Code', 100), makeEvent('Visual Studio Code', 100)];
const result = findCommonPhrases(events, []);
for (const [, entry] of result) {
expect(entry.duration).toBeGreaterThanOrEqual(0);
}
});…loop Without this, promoting a bigram (e.g. 'Alpha Beta') reduces constituent word durations in-place. A later bigram that shares the middle word (e.g. 'Beta Gamma') then sees Beta.duration=0, so the check becomes 10/0 = Infinity > 0.5 and the weak bigram is incorrectly promoted. Fix: build an originalDurations snapshot before the Step 3 loop and use it for all threshold comparisons; mutations to entry.duration still happen (for accurate display) but no longer corrupt subsequent checks. Also adds a regression test that fails on the unfixed code.
|
Fixed the P1 bug flagged by Greptile in 99b073f. Root cause: In Step 3 the promotion loop mutated Fix: snapshot all word durations into Regression test added: |
|
Fixed lint failure in 4a853eb: replaced the two |
|
Fixed the remaining lint warning in 224a5a1: broke the over-length guard clause onto two lines to satisfy prettier's line-length rule (the previous commit fixed the logic but the resulting line was 100 chars, over prettier's 80-char limit). |
|
@greptileai review |
Takes over #455 as requested by @ErikBjare.
Changes
src/util/categorization.tsfor testabilityMap<string, WordEntry>(consistent with existingCategoryBuilder.vuecode)ignored_wordsand length ≤ 2 (the original only filtered single words)console.logstatements from the original PRTests
10 unit tests covering:
ignored_wordsfiltering (single words and bigram components)-,,,:,(,))