test(backend): add unit tests for PDF chunker table parsing#493
Open
nancysangani wants to merge 8 commits into
Open
test(backend): add unit tests for PDF chunker table parsing#493nancysangani wants to merge 8 commits into
nancysangani wants to merge 8 commits into
Conversation
Contributor
Author
e3d1ca1 to
8fb276f
Compare
…ancysangani/PDF-Assistant-RAG into test/pdf-chunker-table-parsing
Contributor
Author
|
@param20h please review this PR, all checks have passed and resolved the conflicts. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔗 Related Issue
Closes #451
📝 What does this PR do?
Extends
backend/tests/test_chunker.pywith 11 new unit tests that verifyPDF table parsing across all three extraction paths and edge cases. All tests
use in-memory fakes and monkeypatching — no real PDF files are committed and
no new dependencies are introduced.
_table_to_markdownedge cases:\|---pdfplumber path:
table_indexincrements correctly across multiple tables on one page[0.0, 1.0]PyMuPDF fallback path:
chunk_type=textchunksImage + table interleaving:
chunk_index continuity:
chunk_indexis 0-based and monotonically increasing across the fulldocument with no resets or gaps
🗂️ Type of Change
🧪 How was this tested?
pytest backend/tests/test_chunker.py -v— all 11 new tests passalongside the 7 existing tests (18 total, 0 failures)
✅ Self-Review Checklist
dev, notmainmainbranch or any HuggingFace deployment config