Skip to content

fix(classification): guard against empty content in dedup check#600

Open
norrietaylor wants to merge 1 commit into
mainfrom
fix/classification-empty-content-dedup
Open

fix(classification): guard against empty content in dedup check#600
norrietaylor wants to merge 1 commit into
mainfrom
fix/classification-empty-content-dedup

Conversation

@norrietaylor

@norrietaylor norrietaylor commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Root cause

DeduplicationChecker.check() in src/distillery/classification/dedup.py extracted the first line of the most similar entry's content via:

first_line = highest.entry.content.splitlines()[0][:80]

When the most similar entry returned by find_similar has empty content, "".splitlines() returns [], so indexing [0] raises IndexError: list index out of range. Entry performs no validation forbidding empty content, so such an entry can legitimately exist in the store and be returned as a near-duplicate match, crashing the entire dedup check (and any caller, e.g. the distillery_store MCP tool).

Fix

Guard the index access:

content_lines = highest.entry.content.splitlines()
first_line = content_lines[0][:80] if content_lines else ""

When content is empty, first_line becomes "" and the reasoning string degrades gracefully instead of raising. Surgical, single-spot change; action selection and all other behavior are unchanged.

Acceptance

  • New test TestEmptyContentMatch::test_empty_content_match_returns_result in tests/test_dedup.py reproduces the crash (fails with IndexError before the fix) and passes after.
  • Full dedup/classification test area green: tests/test_dedup.py, tests/test_classification_engine.py, tests/test_classification/, tests/test_mcp_dedup.py, tests/test_store_dedup_response.py (119 passed).
  • ruff check and mypy --strict clean on touched files.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed a crash that occurred during deduplication when processing empty content. The deduplication process now handles empty content gracefully without raising errors.

DeduplicationChecker.check() crashed with IndexError when the most
similar entry returned by find_similar had empty content, because
"".splitlines() returns [] and indexing [0] is out of range.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Too much diff to scan? Review this PR in Change Stack to start with the highest-impact changes.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 3c99a336-51aa-4991-a85d-9e91ebff3116

📥 Commits

Reviewing files that changed from the base of the PR and between 5a437a2 and a173302.

📒 Files selected for processing (2)
  • src/distillery/classification/dedup.py
  • tests/test_dedup.py

📝 Walkthrough

Walkthrough

This PR fixes an IndexError in DeduplicationChecker.check() that occurs when the similarity store returns an entry with empty content. The fix safely derives the first line by checking if lines exist before indexing, and a new test validates the corrected behavior.

Changes

Empty Content Safety in Deduplication

Layer / File(s) Summary
Empty content safety in deduplication
src/distillery/classification/dedup.py, tests/test_dedup.py
check() now computes content_lines from highest.entry.content.splitlines() and sets first_line only if content_lines is non-empty, preventing IndexError on empty content. New test TestEmptyContentMatch verifies empty content returns a SKIP result with the expected score rather than crashing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

A rabbit hops through empty lines with care,
No index faults in dedup's repair,
When content vanishes without a trace,
The fix ensures we still complete the race. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main fix: guarding against empty content in the dedup check, which directly matches the primary change made to prevent IndexError when content is empty.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant