feat(categorization): extract findCommonPhrases util with bigram support and tests by TimeToBuildBob · Pull Request #803 · ActivityWatch/aw-webui

TimeToBuildBob · 2026-04-14T17:26:59Z

Takes over #455 as requested by @ErikBjare.

Changes

Extracts the bigram-based phrase detection into src/util/categorization.ts for testability
Uses Map<string, WordEntry> (consistent with existing CategoryBuilder.vue code)
Also filters bigram components against ignored_words and length ≤ 2 (the original only filtered single words)
Removes debug console.log statements from the original PR
Full TypeScript types

Tests

10 unit tests covering:

Empty input
Duration accumulation
Bigram promotion when it dominates both words (>50% threshold)
No promotion when a word appears in many different contexts
ignored_words filtering (single words and bigram components)
Short-word filtering (length ≤ 2)
Various separator characters (-, ,, :, (, ))
Multi-event accumulation

PASS node test/unit/categorization.test.node.ts
  ✓ returns empty map for empty events
  ✓ counts single words by duration
  ✓ promotes bigram when it dominates both constituent words
  ✓ does not promote bigram when one word appears independently too often
  ✓ filters out words with length <= 2
  ✓ filters out ignored words
  ✓ ignored words are not used as bigram components
  ✓ handles titles split by various separator characters
  ✓ accumulated duration across multiple events
  ✓ returns Map with word entries containing events list
Tests: 10 passed, 10 total

…ort and tests Moves the common-phrase detection logic from CategoryBuilder.vue into a standalone src/util/categorization.ts module and adds 10 unit tests. Changes vs original PR ActivityWatch#455: - Extracts function to util module for testability - Uses Map<string, WordEntry> (consistent with existing CategoryBuilder code) - Filters bigram components against ignored_words and length <= 2 - Removes debug console.log statements - Full TypeScript types Closes ActivityWatch#455

codecov · 2026-04-14T17:29:21Z

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 29.96%. Comparing base (0bafab0) to head (224a5a1).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/util/categorization.ts	97.87%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #803      +/-   ##
==========================================
+ Coverage   28.17%   29.96%   +1.79%     
==========================================
  Files          31       32       +1     
  Lines        1782     1829      +47     
  Branches      322      331       +9     
==========================================
+ Hits          502      548      +46     
- Misses       1212     1260      +48     
+ Partials       68       21      -47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

greptile-apps · 2026-04-14T17:29:25Z

Greptile Summary

This PR extracts the word-frequency logic from CategoryBuilder.vue into a typed findCommonPhrases utility in src/util/categorization.ts, adds bigram detection and promotion (>50% co-occurrence threshold), and covers the implementation with 10 unit tests. The key algorithmic addition over the original PR (#455) is the pre-loop snapshot of original word durations, which prevents the Infinity/divide-by-zero corruption for asymmetric trigrams that was flagged in earlier review threads.

Confidence Score: 5/5

Safe to merge; the snapshot fix addresses the main previously-flagged regression, and all remaining findings are minor style suggestions.
All new findings are P2 (one toBeCloseTo vs toBe precision nit in tests). Prior thread concerns have been converged on: the mutation/Infinity bug is fixed via the originalDurations snapshot, a trigram regression test was added, and the separator-gap design note was acknowledged as not requiring a code change.
No files require special attention.

Important Files Changed

Filename	Overview
src/util/categorization.ts	New utility implementing a 3-step phrase-detection algorithm (word accumulation → bigram accumulation → bigram promotion). The snapshot of original durations before the promotion loop correctly prevents the divide-by-zero/Infinity corruption for asymmetric trigrams flagged in earlier threads. Pure trigrams where all three words always co-occur can still cause the middle word's duration to go negative (both overlapping bigrams satisfy the >50% threshold independently against the snapshotted values), but this is mitigated in display by the >60s filter.
src/views/settings/CategoryBuilder.vue	Clean removal of the inline single-word counting loop, replaced with a one-liner call to `findCommonPhrases`. The SPLIT_REGEX is identical to the old inline pattern. The return type (`Map<string, WordEntry>`) is compatible with the existing `words_by_duration` computed property that already iterated `.values()`.
test/unit/categorization.test.node.ts	10 well-structured unit tests covering the main scenarios. The asymmetric-trigram snapshot test correctly verifies the fix for the previously flagged mutation bug. Minor: `toBeCloseTo` is used in one test where `toBe` would be more precise for exact integer inputs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([findCommonPhrases\nevents, ignored_words]) --> B

    subgraph Step1["Step 1: Build word map"]
        B[For each event] --> C[Split title by SPLIT_REGEX]
        C --> D{word.length > 2\nAND not ignored?}
        D -- Yes --> E[Accumulate word duration\n& push event]
        D -- No --> C
    end

    E --> F

    subgraph Step2["Step 2: Build bigram map"]
        F[For each event] --> G[Split title by SPLIT_REGEX]
        G --> H{w1 & w2 both valid?\nlength > 2, not ignored}
        H -- Yes --> I[Accumulate bigram duration\n& push event]
        H -- No --> G
    end

    I --> J

    subgraph Step3["Step 3: Promote dominant bigrams"]
        J[Snapshot original word durations] --> K[For each bigram]
        K --> L{bigramDuration / w1OrigDur > 0.5\nAND bigramDuration / w2OrigDur > 0.5?}
        L -- Yes --> M[Add bigram to words map\nReduce w1 & w2 durations]
        L -- No --> K
        M --> K
    end

    M --> N([Return words Map])
    K -- done --> N

_{Reviews (2): Last reviewed commit: "style(categorization): fix prettier line..." | Re-trigger Greptile}

greptile-apps · 2026-04-14T17:29:29Z

+  // Step 3: Promote bigrams that dominate both constituent words (>50% threshold)
+  for (const [bigram, bigramEntry] of bigrams) {
+    const spaceIdx = bigram.indexOf(' ');
+    const word1 = bigram.slice(0, spaceIdx);
+    const word2 = bigram.slice(spaceIdx + 1);
+    const w1Entry = words.get(word1);
+    const w2Entry = words.get(word2);
+    if (!w1Entry || !w2Entry) continue;
+
+    const bigram_duration = bigramEntry.duration;
+    if (bigram_duration / w1Entry.duration > 0.5 && bigram_duration / w2Entry.duration > 0.5) {
+      // Promote bigram, reduce constituent word durations
+      words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events });
+      w1Entry.duration -= bigram_duration;
+      w2Entry.duration -= bigram_duration;
+    }
+  }


Mutating word durations mid-loop causes incorrect promotions and negative values

The promotion loop reads w1Entry.duration and w2Entry.duration to decide whether a bigram qualifies, but then immediately mutates those same values. For a trigram that always co-occurs (e.g. "Visual Studio Code"), both "Visual Studio" and "Studio Code" will each have 100 % of "Studio"'s duration, so they both qualify. Whichever processes second sees Studio.duration = 0, making the ratio Infinity, so it also promotes — and Studio.duration ends up negative. Both bigrams are then shown to the user with the full duration, double-counting the same events.

Snapshot the original durations before the loop so each promotion decision is independent:

// Step 3: Promote bigrams that dominate both constituent words (>50% threshold) const originalDurations = new Map<string, number>(); for (const [w, entry] of words) { originalDurations.set(w, entry.duration); } for (const [bigram, bigramEntry] of bigrams) { const spaceIdx = bigram.indexOf(' '); const word1 = bigram.slice(0, spaceIdx); const word2 = bigram.slice(spaceIdx + 1); const w1Entry = words.get(word1); const w2Entry = words.get(word2); if (!w1Entry || !w2Entry) continue; const bigram_duration = bigramEntry.duration; const orig1 = originalDurations.get(word1) ?? 0; const orig2 = originalDurations.get(word2) ?? 0; if (bigram_duration / orig1 > 0.5 && bigram_duration / orig2 > 0.5) { words.set(bigram, { word: bigram, duration: bigram_duration, events: bigramEntry.events }); w1Entry.duration -= bigram_duration; w2Entry.duration -= bigram_duration; } }

greptile-apps · 2026-04-14T17:29:29Z

+  for (const event of events) {
+    const parts = event.data.title.split(SPLIT_REGEX);
+    for (let i = 0; i < parts.length - 1; i++) {
+      const w1 = parts[i];
+      const w2 = parts[i + 1];
+      if (w1.length <= 2 || ignored_words.includes(w1)) continue;
+      if (w2.length <= 2 || ignored_words.includes(w2)) continue;
+      const bigram = `${w1} ${w2}`;
+      const entry = bigrams.get(bigram);
+      if (entry) {
+        entry.duration += event.duration;
+        entry.events.push(event);
+      } else {
+        bigrams.set(bigram, { bigram, duration: event.duration, events: [event] });
+      }
+    }


Non-adjacent words silently bigrammed across separator gaps

After splitting "foo-bar" the array is ["foo", "bar"]. The loop pairs parts[i] with parts[i+1], so "foo" and "bar" are treated as an adjacent bigram even though a hyphen separated them. Titles like "JIRA-1234 Fix bug" could unexpectedly promote "Fix bug" as a bigram even when "JIRA" and "1234" are filtered. This is a design decision worth calling out explicitly in the JSDoc — if the intent is "adjacent in the cleaned word stream" then the current approach is correct; if the intent is "strictly space-adjacent in the original title" then the regex should only split on whitespace for bigram indexing.

No code change required if this is the intended semantics; just a documentation note.

greptile-apps · 2026-04-14T17:29:31Z

+  test('accumulated duration across multiple events', () => {
+    const events = [
+      makeEvent('Python Programming', 30),
+      makeEvent('Python Programming', 30),
+      makeEvent('Python Programming', 30),
+      makeEvent('Python Programming', 30),
+    ];
+    const result = findCommonPhrases(events, []);
+    // All events have same title → bigram fully dominates
+    expect(result.get('Python Programming')?.duration).toBe(120);
+    expect(result.get('Python')?.duration).toBe(0);
+    expect(result.get('Programming')?.duration).toBe(0);
+  });


Missing test for trigram / middle-word double-promotion

The existing tests only verify single-bigram promotion. The mutation bug described above is exercisable with a three-word phrase where every event has the same title (e.g. "Visual Studio Code"), where both "Visual Studio" and "Studio Code" should be candidates. A test like the one below would catch the negative-duration regression:

test('does not produce negative duration when middle word appears in two bigrams', () => { // "Studio" appears in both "Visual Studio" and "Studio Code" const events = [makeEvent('Visual Studio Code', 100), makeEvent('Visual Studio Code', 100)]; const result = findCommonPhrases(events, []); for (const [, entry] of result) { expect(entry.duration).toBeGreaterThanOrEqual(0); } });

…loop Without this, promoting a bigram (e.g. 'Alpha Beta') reduces constituent word durations in-place. A later bigram that shares the middle word (e.g. 'Beta Gamma') then sees Beta.duration=0, so the check becomes 10/0 = Infinity > 0.5 and the weak bigram is incorrectly promoted. Fix: build an originalDurations snapshot before the Step 3 loop and use it for all threshold comparisons; mutations to entry.duration still happen (for accurate display) but no longer corrupt subsequent checks. Also adds a regression test that fails on the unfixed code.

TimeToBuildBob · 2026-04-14T17:39:14Z

Fixed the P1 bug flagged by Greptile in 99b073f.

Root cause: In Step 3 the promotion loop mutated w1Entry.duration / w2Entry.duration in-place. When two bigrams share a middle word (trigram case), the first promotion reduces that word's duration to 0. The next check then sees 10 / 0 = Infinity > 0.5, so a weak bigram that should not qualify gets incorrectly promoted.

Fix: snapshot all word durations into originalDurations before the loop and use those snapshots for threshold comparisons. Mutations to entry.duration still happen (so individual word entries reflect the reduced "leftover" time), but they no longer corrupt subsequent threshold checks.

Regression test added: 'trigram: does not double-promote when middle word duration is consumed' — sets up makeEvent('Alpha Beta', 100) + makeEvent('Alpha Beta Gamma', 10) where "Beta Gamma" (10/110 ≈ 9% of Beta) must not be promoted. The test fails on the unfixed code and passes after the fix.

…ned guards

TimeToBuildBob · 2026-04-14T17:42:48Z

Fixed lint failure in 4a853eb: replaced the two ! non-null assertions on originalDurations.get() with explicit undefined checks in the loop guard. Logically they were always defined (populated from the same words map), but ESLint correctly flagged the unsafe assertions.

TimeToBuildBob · 2026-04-14T17:54:57Z

Fixed the remaining lint warning in 224a5a1: broke the over-length guard clause onto two lines to satisfy prettier's line-length rule (the previous commit fixed the logic but the resulting line was 100 chars, over prettier's 80-char limit).

ErikBjare · 2026-04-14T18:55:20Z

@greptileai review

TimeToBuildBob mentioned this pull request Apr 14, 2026

feat: improved categorization helper to include common bigrams #455

Closed

1 task

greptile-apps bot reviewed Apr 14, 2026

View reviewed changes

fix(categorization): replace non-null assertions with explicit undefi…

4a853eb

…ned guards

style(categorization): fix prettier line-length warning in guard clause

224a5a1

ErikBjare merged commit 5529031 into ActivityWatch:master Apr 14, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(categorization): extract findCommonPhrases util with bigram support and tests#803

feat(categorization): extract findCommonPhrases util with bigram support and tests#803
ErikBjare merged 4 commits intoActivityWatch:masterfrom
TimeToBuildBob:feat/categorization-bigrams-with-tests

TimeToBuildBob commented Apr 14, 2026

Uh oh!

codecov bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 14, 2026

Uh oh!

greptile-apps bot Apr 14, 2026

Uh oh!

greptile-apps bot Apr 14, 2026

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

ErikBjare commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TimeToBuildBob commented Apr 14, 2026

Changes

Tests

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

TimeToBuildBob commented Apr 14, 2026

Uh oh!

ErikBjare commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 14, 2026 •

edited

Loading

greptile-apps bot commented Apr 14, 2026 •

edited

Loading