Skip to content

Add test suite: 92% coverage on the core library#6

Open
shaunpatterson wants to merge 10 commits into
ekimetrics:mainfrom
shaunpatterson:test/coverage-90
Open

Add test suite: 92% coverage on the core library#6
shaunpatterson wants to merge 10 commits into
ekimetrics:mainfrom
shaunpatterson:test/coverage-90

Conversation

@shaunpatterson

Copy link
Copy Markdown

Brings the importable core library from ~19% to 92% test coverage, and adds a coverage config that scopes the target sensibly.

Note: this branch is stacked on the four fix PRs (#2#5) — it merges them so the new tests assert the fixed behavior. Once those merge, this rebases to just the test additions + config.

What's added

Pure-logic and mockable tests (no heavy ML deps, no network, no model downloads):

  • splitters.py (99%) — both merge modes, backward order, overlap re-splitting, regex separators, empty-string binary split, validation errors, group_chunks/group_pages/combine_blocks/regex_splitter.
  • postprocessing.py (99%) — page/title info, gap check/repair, oversized-split & small-chunk merges, and the *_from_df drivers with tiny parquet fixtures.
  • metrics.py (77%) — block integrity, missing-ref errors, cohesion/coherence/dissimilarity with a deterministic fake embedding model, sentence_transformers mocked via sys.modules, real scikit-learn for the lexical metric, and the pure coref helpers.
  • jina_embedder.py (80%) — happy-path + normalization with httpx mocked.
  • pipeline.py (94%) — chunk_files end-to-end with a fake parser.
  • compute_metrics / split_documents / extract_mentions (99–100%) — directory drivers with fake splitters/models and parquet/JSON fixtures.
  • chunking_utils.py (90%).

Coverage config ([tool.coverage] in pyproject)

Scopes measurement to the library and omits:

  • paper/* — one-off research replication scripts, not part of the API.
  • parsing.py — thin adapters over Azure DI / Docling / PyMuPDF, which are heavy optional dependencies needing model downloads (still exercised by test_parsing.py, just not measured).

The remaining uncovered gap is the spaCy/maverick coreference internals in metrics.py (extract_entity_pronoun_pairs, CoreferenceSolver.__init__/find_mentions), which require real models to run.

Result

pytest: 211 passed, 1 skipped; 92% on the scoped target.

🤖 Generated with Claude Code

shaunpatterson and others added 10 commits June 15, 2026 22:58
str.find() returns -1 for an absent chunk and never raises, so the
try/except ValueError was dead code: a missing chunk produced
start_idx=-1, which silently became a valid-looking offset and
corrupted every reference boundary downstream — yielding wrong
reference-completeness metrics with no error.

Check the -1 sentinel explicitly and raise. Adds regression tests
covering boundary-splitting, no-split, the missing-chunk raise, and
the empty-pairs case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CoreferenceSolver._tokenize_by_word called spacy.load("en_core_web_sm")
on every invocation, and find_mentions() invokes it once per context
window. Loading a spaCy model costs ~0.5-2s, so on a long document this
dominated coreference runtime.

Move the load behind a module-level functools.lru_cache so the model is
built exactly once per process and shared across all solver instances.

Adds tests (fake spaCy, no real dependency) asserting the model loads
exactly once across repeated calls and that tokenization is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
split_documents.py and extract_mentions.py both read confidence/lang_code
outside the 'if lang_probs:' block that assigned them. When detect_langs
returned an empty list this raised NameError on the first document, and
otherwise leaked the previous document's language/confidence into the next
iteration's skip decision.

Extract the (duplicated) check into chunking_utils.is_high_confidence_non_english,
which keeps the document when language can't be determined. Adds tests
covering high/low confidence, English, empty result, and the exception path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
find_chunks_start_and_end and repair_gaps_between_chunks returned None on
empty input despite a list[...] return hint. Callers zip/iterate the
result, so an empty document could raise TypeError instead of degrading
gracefully. Return [] and tighten repair_gaps_between_chunks's hint to
list[str]. Adds empty-input regression tests for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds pure-logic and mockable tests across splitters, postprocessing
(incl. the *_from_df drivers), metrics (fake embedding model, mocked
sentence-transformers, real sklearn), jina_embedder (mocked httpx),
pipeline (fake parser), and the split/metrics/mentions directory drivers
(fake splitters/models, parquet fixtures).

Adds a coverage config scoping the target to the importable library and
excluding paper/ (research replication scripts) and parsing.py (thin
adapters over heavy optional SDKs: Azure DI / Docling / PyMuPDF). On that
scope coverage is 92% (was ~19%); the remaining gap is the spaCy/maverick
coref internals in metrics.py, which need real models to exercise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
split_documents, compute_metrics, postprocessing (*_from_df) and
extract_mentions all read/write .parquet via pandas, but pandas>=2.2 does
not pull a parquet engine automatically, so these functions (and their
tests) fail at runtime without one. Add pyarrow as an explicit dependency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant