Add test suite: 92% coverage on the core library#6
Open
shaunpatterson wants to merge 10 commits into
Open
Conversation
str.find() returns -1 for an absent chunk and never raises, so the try/except ValueError was dead code: a missing chunk produced start_idx=-1, which silently became a valid-looking offset and corrupted every reference boundary downstream — yielding wrong reference-completeness metrics with no error. Check the -1 sentinel explicitly and raise. Adds regression tests covering boundary-splitting, no-split, the missing-chunk raise, and the empty-pairs case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CoreferenceSolver._tokenize_by_word called spacy.load("en_core_web_sm")
on every invocation, and find_mentions() invokes it once per context
window. Loading a spaCy model costs ~0.5-2s, so on a long document this
dominated coreference runtime.
Move the load behind a module-level functools.lru_cache so the model is
built exactly once per process and shared across all solver instances.
Adds tests (fake spaCy, no real dependency) asserting the model loads
exactly once across repeated calls and that tokenization is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
split_documents.py and extract_mentions.py both read confidence/lang_code outside the 'if lang_probs:' block that assigned them. When detect_langs returned an empty list this raised NameError on the first document, and otherwise leaked the previous document's language/confidence into the next iteration's skip decision. Extract the (duplicated) check into chunking_utils.is_high_confidence_non_english, which keeps the document when language can't be determined. Adds tests covering high/low confidence, English, empty result, and the exception path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
find_chunks_start_and_end and repair_gaps_between_chunks returned None on empty input despite a list[...] return hint. Callers zip/iterate the result, so an empty document could raise TypeError instead of degrading gracefully. Return [] and tighten repair_gaps_between_chunks's hint to list[str]. Adds empty-input regression tests for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # tests/test_metrics.py
Adds pure-logic and mockable tests across splitters, postprocessing (incl. the *_from_df drivers), metrics (fake embedding model, mocked sentence-transformers, real sklearn), jina_embedder (mocked httpx), pipeline (fake parser), and the split/metrics/mentions directory drivers (fake splitters/models, parquet fixtures). Adds a coverage config scoping the target to the importable library and excluding paper/ (research replication scripts) and parsing.py (thin adapters over heavy optional SDKs: Azure DI / Docling / PyMuPDF). On that scope coverage is 92% (was ~19%); the remaining gap is the spaCy/maverick coref internals in metrics.py, which need real models to exercise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
split_documents, compute_metrics, postprocessing (*_from_df) and extract_mentions all read/write .parquet via pandas, but pandas>=2.2 does not pull a parquet engine automatically, so these functions (and their tests) fail at runtime without one. Add pyarrow as an explicit dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings the importable core library from ~19% to 92% test coverage, and adds a coverage config that scopes the target sensibly.
What's added
Pure-logic and mockable tests (no heavy ML deps, no network, no model downloads):
group_chunks/group_pages/combine_blocks/regex_splitter.*_from_dfdrivers with tiny parquet fixtures.sentence_transformersmocked viasys.modules, realscikit-learnfor the lexical metric, and the pure coref helpers.httpxmocked.chunk_filesend-to-end with a fake parser.Coverage config (
[tool.coverage]in pyproject)Scopes measurement to the library and omits:
paper/*— one-off research replication scripts, not part of the API.parsing.py— thin adapters over Azure DI / Docling / PyMuPDF, which are heavy optional dependencies needing model downloads (still exercised bytest_parsing.py, just not measured).The remaining uncovered gap is the spaCy/maverick coreference internals in
metrics.py(extract_entity_pronoun_pairs,CoreferenceSolver.__init__/find_mentions), which require real models to run.Result
pytest: 211 passed, 1 skipped; 92% on the scoped target.🤖 Generated with Claude Code