Skip to content

feat(categorization): extract findCommonPhrases util with bigram support and tests#803

Merged
ErikBjare merged 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:feat/categorization-bigrams-with-tests
Apr 14, 2026
Merged

feat(categorization): extract findCommonPhrases util with bigram support and tests#803
ErikBjare merged 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:feat/categorization-bigrams-with-tests

Conversation

@TimeToBuildBob
Copy link
Copy Markdown
Contributor

Takes over #455 as requested by @ErikBjare.

Changes

  • Extracts the bigram-based phrase detection into src/util/categorization.ts for testability
  • Uses Map<string, WordEntry> (consistent with existing CategoryBuilder.vue code)
  • Also filters bigram components against ignored_words and length ≤ 2 (the original only filtered single words)
  • Removes debug console.log statements from the original PR
  • Full TypeScript types

Tests

10 unit tests covering:

  • Empty input
  • Duration accumulation
  • Bigram promotion when it dominates both words (>50% threshold)
  • No promotion when a word appears in many different contexts
  • ignored_words filtering (single words and bigram components)
  • Short-word filtering (length ≤ 2)
  • Various separator characters (-, ,, :, (, ))
  • Multi-event accumulation
PASS node test/unit/categorization.test.node.ts
  ✓ returns empty map for empty events
  ✓ counts single words by duration
  ✓ promotes bigram when it dominates both constituent words
  ✓ does not promote bigram when one word appears independently too often
  ✓ filters out words with length <= 2
  ✓ filters out ignored words
  ✓ ignored words are not used as bigram components
  ✓ handles titles split by various separator characters
  ✓ accumulated duration across multiple events
  ✓ returns Map with word entries containing events list
Tests: 10 passed, 10 total

…ort and tests

Moves the common-phrase detection logic from CategoryBuilder.vue into a
standalone src/util/categorization.ts module and adds 10 unit tests.

Changes vs original PR ActivityWatch#455:
- Extracts function to util module for testability
- Uses Map<string, WordEntry> (consistent with existing CategoryBuilder code)
- Filters bigram components against ignored_words and length <= 2
- Removes debug console.log statements
- Full TypeScript types

Closes ActivityWatch#455
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 29.96%. Comparing base (0bafab0) to head (224a5a1).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/util/categorization.ts 97.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #803      +/-   ##
==========================================
+ Coverage   28.17%   29.96%   +1.79%     
==========================================
  Files          31       32       +1     
  Lines        1782     1829      +47     
  Branches      322      331       +9     
==========================================
+ Hits          502      548      +46     
- Misses       1212     1260      +48     
+ Partials       68       21      -47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR extracts the word-frequency logic from CategoryBuilder.vue into a typed findCommonPhrases utility in src/util/categorization.ts, adds bigram detection and promotion (>50% co-occurrence threshold), and covers the implementation with 10 unit tests. The key algorithmic addition over the original PR (#455) is the pre-loop snapshot of original word durations, which prevents the Infinity/divide-by-zero corruption for asymmetric trigrams that was flagged in earlier review threads.

Confidence Score: 5/5

  • Safe to merge; the snapshot fix addresses the main previously-flagged regression, and all remaining findings are minor style suggestions.
  • All new findings are P2 (one toBeCloseTo vs toBe precision nit in tests). Prior thread concerns have been converged on: the mutation/Infinity bug is fixed via the originalDurations snapshot, a trigram regression test was added, and the separator-gap design note was acknowledged as not requiring a code change.
  • No files require special attention.

Important Files Changed

Filename Overview
src/util/categorization.ts New utility implementing a 3-step phrase-detection algorithm (word accumulation → bigram accumulation → bigram promotion). The snapshot of original durations before the promotion loop correctly prevents the divide-by-zero/Infinity corruption for asymmetric trigrams flagged in earlier threads. Pure trigrams where all three words always co-occur can still cause the middle word's duration to go negative (both overlapping bigrams satisfy the >50% threshold independently against the snapshotted values), but this is mitigated in display by the >60s filter.
src/views/settings/CategoryBuilder.vue Clean removal of the inline single-word counting loop, replaced with a one-liner call to findCommonPhrases. The SPLIT_REGEX is identical to the old inline pattern. The return type (Map<string, WordEntry>) is compatible with the existing words_by_duration computed property that already iterated .values().
test/unit/categorization.test.node.ts 10 well-structured unit tests covering the main scenarios. The asymmetric-trigram snapshot test correctly verifies the fix for the previously flagged mutation bug. Minor: toBeCloseTo is used in one test where toBe would be more precise for exact integer inputs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([findCommonPhrases\nevents, ignored_words]) --> B

    subgraph Step1["Step 1: Build word map"]
        B[For each event] --> C[Split title by SPLIT_REGEX]
        C --> D{word.length > 2\nAND not ignored?}
        D -- Yes --> E[Accumulate word duration\n& push event]
        D -- No --> C
    end

    E --> F

    subgraph Step2["Step 2: Build bigram map"]
        F[For each event] --> G[Split title by SPLIT_REGEX]
        G --> H{w1 & w2 both valid?\nlength > 2, not ignored}
        H -- Yes --> I[Accumulate bigram duration\n& push event]
        H -- No --> G
    end

    I --> J

    subgraph Step3["Step 3: Promote dominant bigrams"]
        J[Snapshot original word durations] --> K[For each bigram]
        K --> L{bigramDuration / w1OrigDur > 0.5\nAND bigramDuration / w2OrigDur > 0.5?}
        L -- Yes --> M[Add bigram to words map\nReduce w1 & w2 durations]
        L -- No --> K
        M --> K
    end

    M --> N([Return words Map])
    K -- done --> N
Loading

Reviews (2): Last reviewed commit: "style(categorization): fix prettier line..." | Re-trigger Greptile

Comment on lines +66 to +82
// Step 3: Promote bigrams that dominate both constituent words (>50% threshold)
for (const [bigram, bigramEntry] of bigrams) {
const spaceIdx = bigram.indexOf(' ');
const word1 = bigram.slice(0, spaceIdx);
const word2 = bigram.slice(spaceIdx + 1);
const w1Entry = words.get(word1);
const w2Entry = words.get(word2);
if (!w1Entry || !w2Entry) continue;

const bigram_duration = bigramEntry.duration;
if (bigram_duration / w1Entry.duration > 0.5 && bigram_duration / w2Entry.duration > 0.5) {
// Promote bigram, reduce constituent word durations
words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events });
w1Entry.duration -= bigram_duration;
w2Entry.duration -= bigram_duration;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Mutating word durations mid-loop causes incorrect promotions and negative values

The promotion loop reads w1Entry.duration and w2Entry.duration to decide whether a bigram qualifies, but then immediately mutates those same values. For a trigram that always co-occurs (e.g. "Visual Studio Code"), both "Visual Studio" and "Studio Code" will each have 100 % of "Studio"'s duration, so they both qualify. Whichever processes second sees Studio.duration = 0, making the ratio Infinity, so it also promotes — and Studio.duration ends up negative. Both bigrams are then shown to the user with the full duration, double-counting the same events.

Snapshot the original durations before the loop so each promotion decision is independent:

// Step 3: Promote bigrams that dominate both constituent words (>50% threshold)
const originalDurations = new Map<string, number>();
for (const [w, entry] of words) {
  originalDurations.set(w, entry.duration);
}

for (const [bigram, bigramEntry] of bigrams) {
  const spaceIdx = bigram.indexOf(' ');
  const word1 = bigram.slice(0, spaceIdx);
  const word2 = bigram.slice(spaceIdx + 1);
  const w1Entry = words.get(word1);
  const w2Entry = words.get(word2);
  if (!w1Entry || !w2Entry) continue;

  const bigram_duration = bigramEntry.duration;
  const orig1 = originalDurations.get(word1) ?? 0;
  const orig2 = originalDurations.get(word2) ?? 0;
  if (bigram_duration / orig1 > 0.5 && bigram_duration / orig2 > 0.5) {
    words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events });
    w1Entry.duration -= bigram_duration;
    w2Entry.duration -= bigram_duration;
  }
}

Comment on lines +48 to +63
for (const event of events) {
const parts = event.data.title.split(SPLIT_REGEX);
for (let i = 0; i < parts.length - 1; i++) {
const w1 = parts[i];
const w2 = parts[i + 1];
if (w1.length <= 2 || ignored_words.includes(w1)) continue;
if (w2.length <= 2 || ignored_words.includes(w2)) continue;
const bigram = `${w1} ${w2}`;
const entry = bigrams.get(bigram);
if (entry) {
entry.duration += event.duration;
entry.events.push(event);
} else {
bigrams.set(bigram, { bigram, duration: event.duration, events: [event] });
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-adjacent words silently bigrammed across separator gaps

After splitting "foo-bar" the array is ["foo", "bar"]. The loop pairs parts[i] with parts[i+1], so "foo" and "bar" are treated as an adjacent bigram even though a hyphen separated them. Titles like "JIRA-1234 Fix bug" could unexpectedly promote "Fix bug" as a bigram even when "JIRA" and "1234" are filtered. This is a design decision worth calling out explicitly in the JSDoc — if the intent is "adjacent in the cleaned word stream" then the current approach is correct; if the intent is "strictly space-adjacent in the original title" then the regex should only split on whitespace for bigram indexing.

No code change required if this is the intended semantics; just a documentation note.

Comment on lines +85 to +97
test('accumulated duration across multiple events', () => {
const events = [
makeEvent('Python Programming', 30),
makeEvent('Python Programming', 30),
makeEvent('Python Programming', 30),
makeEvent('Python Programming', 30),
];
const result = findCommonPhrases(events, []);
// All events have same title → bigram fully dominates
expect(result.get('Python Programming')?.duration).toBe(120);
expect(result.get('Python')?.duration).toBe(0);
expect(result.get('Programming')?.duration).toBe(0);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test for trigram / middle-word double-promotion

The existing tests only verify single-bigram promotion. The mutation bug described above is exercisable with a three-word phrase where every event has the same title (e.g. "Visual Studio Code"), where both "Visual Studio" and "Studio Code" should be candidates. A test like the one below would catch the negative-duration regression:

test('does not produce negative duration when middle word appears in two bigrams', () => {
  // "Studio" appears in both "Visual Studio" and "Studio Code"
  const events = [makeEvent('Visual Studio Code', 100), makeEvent('Visual Studio Code', 100)];
  const result = findCommonPhrases(events, []);
  for (const [, entry] of result) {
    expect(entry.duration).toBeGreaterThanOrEqual(0);
  }
});

…loop

Without this, promoting a bigram (e.g. 'Alpha Beta') reduces constituent
word durations in-place.  A later bigram that shares the middle word
(e.g. 'Beta Gamma') then sees Beta.duration=0, so the check becomes
10/0 = Infinity > 0.5 and the weak bigram is incorrectly promoted.

Fix: build an originalDurations snapshot before the Step 3 loop and use
it for all threshold comparisons; mutations to entry.duration still happen
(for accurate display) but no longer corrupt subsequent checks.

Also adds a regression test that fails on the unfixed code.
@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Fixed the P1 bug flagged by Greptile in 99b073f.

Root cause: In Step 3 the promotion loop mutated w1Entry.duration / w2Entry.duration in-place. When two bigrams share a middle word (trigram case), the first promotion reduces that word's duration to 0. The next check then sees 10 / 0 = Infinity > 0.5, so a weak bigram that should not qualify gets incorrectly promoted.

Fix: snapshot all word durations into originalDurations before the loop and use those snapshots for threshold comparisons. Mutations to entry.duration still happen (so individual word entries reflect the reduced "leftover" time), but they no longer corrupt subsequent threshold checks.

Regression test added: 'trigram: does not double-promote when middle word duration is consumed' — sets up makeEvent('Alpha Beta', 100) + makeEvent('Alpha Beta Gamma', 10) where "Beta Gamma" (10/110 ≈ 9% of Beta) must not be promoted. The test fails on the unfixed code and passes after the fix.

@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Fixed lint failure in 4a853eb: replaced the two ! non-null assertions on originalDurations.get() with explicit undefined checks in the loop guard. Logically they were always defined (populated from the same words map), but ESLint correctly flagged the unsafe assertions.

@TimeToBuildBob
Copy link
Copy Markdown
Contributor Author

Fixed the remaining lint warning in 224a5a1: broke the over-length guard clause onto two lines to satisfy prettier's line-length rule (the previous commit fixed the logic but the resulting line was 100 chars, over prettier's 80-char limit).

@ErikBjare
Copy link
Copy Markdown
Member

@greptileai review

@ErikBjare ErikBjare merged commit 5529031 into ActivityWatch:master Apr 14, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants