Add test suite: 92% coverage on the core library by shaunpatterson · Pull Request #6 · ekimetrics/adaptive-chunking

shaunpatterson · 2026-06-16T03:21:10Z

Brings the importable core library from ~19% to 92% test coverage, and adds a coverage config that scopes the target sensibly.

Note: this branch is stacked on the four fix PRs (#2–#5) — it merges them so the new tests assert the fixed behavior. Once those merge, this rebases to just the test additions + config.

What's added

Pure-logic and mockable tests (no heavy ML deps, no network, no model downloads):

splitters.py (99%) — both merge modes, backward order, overlap re-splitting, regex separators, empty-string binary split, validation errors, group_chunks/group_pages/combine_blocks/regex_splitter.
postprocessing.py (99%) — page/title info, gap check/repair, oversized-split & small-chunk merges, and the *_from_df drivers with tiny parquet fixtures.
metrics.py (77%) — block integrity, missing-ref errors, cohesion/coherence/dissimilarity with a deterministic fake embedding model, sentence_transformers mocked via sys.modules, real scikit-learn for the lexical metric, and the pure coref helpers.
jina_embedder.py (80%) — happy-path + normalization with httpx mocked.
pipeline.py (94%) — chunk_files end-to-end with a fake parser.
compute_metrics / split_documents / extract_mentions (99–100%) — directory drivers with fake splitters/models and parquet/JSON fixtures.
chunking_utils.py (90%).

Coverage config (`[tool.coverage]` in pyproject)

Scopes measurement to the library and omits:

paper/* — one-off research replication scripts, not part of the API.
parsing.py — thin adapters over Azure DI / Docling / PyMuPDF, which are heavy optional dependencies needing model downloads (still exercised by test_parsing.py, just not measured).

The remaining uncovered gap is the spaCy/maverick coreference internals in metrics.py (extract_entity_pronoun_pairs, CoreferenceSolver.__init__/find_mentions), which require real models to run.

Result

pytest: 211 passed, 1 skipped; 92% on the scoped target.

🤖 Generated with Claude Code

str.find() returns -1 for an absent chunk and never raises, so the try/except ValueError was dead code: a missing chunk produced start_idx=-1, which silently became a valid-looking offset and corrupted every reference boundary downstream — yielding wrong reference-completeness metrics with no error. Check the -1 sentinel explicitly and raise. Adds regression tests covering boundary-splitting, no-split, the missing-chunk raise, and the empty-pairs case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CoreferenceSolver._tokenize_by_word called spacy.load("en_core_web_sm") on every invocation, and find_mentions() invokes it once per context window. Loading a spaCy model costs ~0.5-2s, so on a long document this dominated coreference runtime. Move the load behind a module-level functools.lru_cache so the model is built exactly once per process and shared across all solver instances. Adds tests (fake spaCy, no real dependency) asserting the model loads exactly once across repeated calls and that tokenization is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

split_documents.py and extract_mentions.py both read confidence/lang_code outside the 'if lang_probs:' block that assigned them. When detect_langs returned an empty list this raised NameError on the first document, and otherwise leaked the previous document's language/confidence into the next iteration's skip decision. Extract the (duplicated) check into chunking_utils.is_high_confidence_non_english, which keeps the document when language can't be determined. Adds tests covering high/low confidence, English, empty result, and the exception path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

find_chunks_start_and_end and repair_gaps_between_chunks returned None on empty input despite a list[...] return hint. Callers zip/iterate the result, so an empty document could raise TypeError instead of degrading gracefully. Return [] and tighten repair_gaps_between_chunks's hint to list[str]. Adds empty-input regression tests for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # tests/test_metrics.py

Adds pure-logic and mockable tests across splitters, postprocessing (incl. the *_from_df drivers), metrics (fake embedding model, mocked sentence-transformers, real sklearn), jina_embedder (mocked httpx), pipeline (fake parser), and the split/metrics/mentions directory drivers (fake splitters/models, parquet fixtures). Adds a coverage config scoping the target to the importable library and excluding paper/ (research replication scripts) and parsing.py (thin adapters over heavy optional SDKs: Azure DI / Docling / PyMuPDF). On that scope coverage is 92% (was ~19%); the remaining gap is the spaCy/maverick coref internals in metrics.py, which need real models to exercise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

split_documents, compute_metrics, postprocessing (*_from_df) and extract_mentions all read/write .parquet via pandas, but pandas>=2.2 does not pull a parquet engine automatically, so these functions (and their tests) fail at runtime without one. Add pyarrow as an explicit dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shaunpatterson and others added 10 commits June 15, 2026 22:58

Merge branch 'fix/coref-load-spacy-once' into test/coverage-90

d1e563a

# Conflicts: # tests/test_metrics.py

Merge branch 'fix/language-filter-scope' into test/coverage-90

8d0238f

Merge branch 'fix/empty-list-contracts' into test/coverage-90

e767476

Resolve test_metrics merge conflict (combine fix branches' tests)

3a4c431

shaunpatterson mentioned this pull request Jun 16, 2026

Structural dedup: shared chunks-IO, splitter & parser helpers (~190 fewer lines) #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add test suite: 92% coverage on the core library#6

Add test suite: 92% coverage on the core library#6
shaunpatterson wants to merge 10 commits into
ekimetrics:mainfrom
shaunpatterson:test/coverage-90

shaunpatterson commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shaunpatterson commented Jun 16, 2026

What's added

Coverage config ([tool.coverage] in pyproject)

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Coverage config (`[tool.coverage]` in pyproject)