refactor: decompose _detect_text_based_table (closes #98)#101
Open
refactor: decompose _detect_text_based_table (closes #98)#101
Conversation
- Add import pytest to test file - Add 5 behavioural safety net tests to TestDetectTextBasedTable: test_text_based_avg_row_spacing_used_for_bottom test_text_based_no_valid_row_spacings_uses_fallback test_text_based_large_gap_acts_as_footer test_text_based_transaction_indicator_skips_row test_text_based_too_small_bbox_returns_none - Fix test inputs: add row2 to ensure bbox height exceeds min_table_height=50 threshold
- Add from collections import defaultdict at module level (replaces inline import) - Extend from typing import Any to include ClassVar - Add HEADER_KEYWORDS, TRANSACTION_INDICATORS, FOOTER_KEYWORDS as ClassVar constants - Remove local constant assignments and inline import from _detect_text_based_table - Update all references inside function to use self.HEADER_KEYWORDS etc.
…w tests - Extract Y-bucketing loop into _group_words_by_row private method - Replace inline grouping block in _detect_text_based_table with self._group_words_by_row(words) - Add TestGroupWordsByRow class with 3 unit tests: test_groups_words_into_5px_buckets, test_words_in_different_buckets, test_empty_words_returns_empty - Fix test bucket boundary input (101/102 instead of 101/103)
- Add _find_header_row(y_groups) private method before _detect_text_based_table - Replace inline header-scanning loop with self._find_header_row(y_groups) - Method returns first Y-position matching column header pattern (3+ keywords, 4-10 words, no transaction indicators)
…d_table - Add _find_footer_boundary(y_groups, header_y) private method - Returns (footer_start_y, data_y_positions) tuple (data rows only, not header) - Replace inline footer-scanning loop with self._find_footer_boundary(y_groups, header_y) - Caller prepends header_y to build table_y_positions
…table - Add _calculate_bottom_y(table_y_positions, footer_start_y) private method - Replace inline footer/spacing calculation block with self._calculate_bottom_y() - _detect_text_based_table now thin orchestrating shell (CC drops from E/32 to B/9)
…le_detector.py - Remove # noqa: C901, PLR0912, PLR0915 from _detect_text_based_table def line - Fix RUF005: use [header_y, *data_y_positions] instead of list concatenation - Fix RUF001: restore actual × character in log string (was \u00d7 escape) - Remove table_detector.py from xenon --exclude in ci.yml - Remove temporary exclusion comment block referencing issue #98 - All sub-functions now grade B (CC <= 10); xenon passes without exclusion
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Summary
Decomposes
_detect_text_based_table(previously grade E, CC=32) into four focused private methods, adds characterisation tests as a safety net, and removes the temporary Xenon exclusion from CI. Closes #98.Changes
TestDetectTextBasedTable(safety net before refactoring)_group_words_by_rowin newTestGroupWordsByRowclassHEADER_KEYWORDS,TRANSACTION_INDICATORS,FOOTER_KEYWORDStoClassVarconstants onTableDetectorfrom collections import defaultdictfrom inline function import to module level_group_words_by_row(words)— Y-bucketing into 5px groups_find_header_row(y_groups)— column header pattern scan_find_footer_boundary(y_groups, header_y)— footer scan + data row accumulation_calculate_bottom_y(table_y_positions, footer_start_y)— bottom boundary calculation# noqa: C901, PLR0912, PLR0915from_detect_text_based_tabledefinitiontable_detector.pyfrom Xenon--excludelist inci.ymlType
Testing
make docker-integrationpassed locally (required when touchingpackages/parser-core/)Checklist
ruff checkexits 0,xenonexits 0, radon max grade B (CC=9)Downstream impact
bankstatements_core(exported class, function, or exception)Complexity metrics (after)
_find_header_row_find_footer_boundary_detect_text_based_table_calculate_bottom_y_group_words_by_rowPreviously
_detect_text_based_tablewas grade E (CC=32).