refactor: decompose _detect_text_based_table (closes #98) by longieirl · Pull Request #101 · longieirl/bankstatementprocessor

longieirl · 2026-04-03T08:23:51Z

Pull Request

Summary

Decomposes _detect_text_based_table (previously grade E, CC=32) into four focused private methods, adds characterisation tests as a safety net, and removes the temporary Xenon exclusion from CI. Closes #98.

Changes

Add 5 characterisation tests to TestDetectTextBasedTable (safety net before refactoring)
Add 3 unit tests for _group_words_by_row in new TestGroupWordsByRow class
Promote HEADER_KEYWORDS, TRANSACTION_INDICATORS, FOOTER_KEYWORDS to ClassVar constants on TableDetector
Move from collections import defaultdict from inline function import to module level
Extract _group_words_by_row(words) — Y-bucketing into 5px groups
Extract _find_header_row(y_groups) — column header pattern scan
Extract _find_footer_boundary(y_groups, header_y) — footer scan + data row accumulation
Extract _calculate_bottom_y(table_y_positions, footer_start_y) — bottom boundary calculation
Remove # noqa: C901, PLR0912, PLR0915 from _detect_text_based_table definition
Remove table_detector.py from Xenon --exclude list in ci.yml

Type

Testing

Tests pass (coverage ≥ 91%) — 1430 passed, 92.15% coverage
Manually tested
make docker-integration passed locally (required when touching packages/parser-core/)

Checklist

Code follows project style
Self-reviewed
Documentation updated (if needed)
No new warnings — ruff check exits 0, xenon exits 0, radon max grade B (CC=9)

Downstream impact

This PR changes a public interface in bankstatements_core (exported class, function, or exception)

Complexity metrics (after)

Method	Grade	CC
`_find_header_row`	B	9
`_find_footer_boundary`	B	9
`_detect_text_based_table`	B	9
`_calculate_bottom_y`	B	6
`_group_words_by_row`	A	2

Previously _detect_text_based_table was grade E (CC=32).

- Add import pytest to test file - Add 5 behavioural safety net tests to TestDetectTextBasedTable: test_text_based_avg_row_spacing_used_for_bottom test_text_based_no_valid_row_spacings_uses_fallback test_text_based_large_gap_acts_as_footer test_text_based_transaction_indicator_skips_row test_text_based_too_small_bbox_returns_none - Fix test inputs: add row2 to ensure bbox height exceeds min_table_height=50 threshold

- Add from collections import defaultdict at module level (replaces inline import) - Extend from typing import Any to include ClassVar - Add HEADER_KEYWORDS, TRANSACTION_INDICATORS, FOOTER_KEYWORDS as ClassVar constants - Remove local constant assignments and inline import from _detect_text_based_table - Update all references inside function to use self.HEADER_KEYWORDS etc.

…w tests - Extract Y-bucketing loop into _group_words_by_row private method - Replace inline grouping block in _detect_text_based_table with self._group_words_by_row(words) - Add TestGroupWordsByRow class with 3 unit tests: test_groups_words_into_5px_buckets, test_words_in_different_buckets, test_empty_words_returns_empty - Fix test bucket boundary input (101/102 instead of 101/103)

- Add _find_header_row(y_groups) private method before _detect_text_based_table - Replace inline header-scanning loop with self._find_header_row(y_groups) - Method returns first Y-position matching column header pattern (3+ keywords, 4-10 words, no transaction indicators)

…d_table - Add _find_footer_boundary(y_groups, header_y) private method - Returns (footer_start_y, data_y_positions) tuple (data rows only, not header) - Replace inline footer-scanning loop with self._find_footer_boundary(y_groups, header_y) - Caller prepends header_y to build table_y_positions

…table - Add _calculate_bottom_y(table_y_positions, footer_start_y) private method - Replace inline footer/spacing calculation block with self._calculate_bottom_y() - _detect_text_based_table now thin orchestrating shell (CC drops from E/32 to B/9)

…le_detector.py - Remove # noqa: C901, PLR0912, PLR0915 from _detect_text_based_table def line - Fix RUF005: use [header_y, *data_y_positions] instead of list concatenation - Fix RUF001: restore actual × character in log string (was \u00d7 escape) - Remove table_detector.py from xenon --exclude in ci.yml - Remove temporary exclusion comment block referencing issue #98 - All sub-functions now grade B (CC <= 10); xenon passes without exclusion

web-flow added 7 commits April 3, 2026 09:17

github-actions bot added bug Something isn't working ci labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: decompose _detect_text_based_table (closes #98)#101

refactor: decompose _detect_text_based_table (closes #98)#101
longieirl wants to merge 7 commits intomainfrom
fix/98-decompose-detect-text-based-table

longieirl commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

longieirl commented Apr 3, 2026

Pull Request

Summary

Changes

Type

Testing

Checklist

Downstream impact

Complexity metrics (after)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants