Skip to content

test(backend): add unit tests for PDF chunker table parsing#493

Open
nancysangani wants to merge 8 commits into
param20h:devfrom
nancysangani:test/pdf-chunker-table-parsing
Open

test(backend): add unit tests for PDF chunker table parsing#493
nancysangani wants to merge 8 commits into
param20h:devfrom
nancysangani:test/pdf-chunker-table-parsing

Conversation

@nancysangani

Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #451


📝 What does this PR do?

Extends backend/tests/test_chunker.py with 11 new unit tests that verify
PDF table parsing across all three extraction paths and edge cases. All tests
use in-memory fakes and monkeypatching — no real PDF files are committed and
no new dependencies are introduced.

_table_to_markdown edge cases:

  • Empty row list returns empty string
  • All-blank cells return empty string
  • Single header row renders correctly with separator
  • Ragged rows (unequal column counts) are padded to uniform width
  • Whitespace in cells (tabs, newlines) is normalised to single spaces
  • Pipe characters in cells are escaped as \|
  • Separator row always uses ---

pdfplumber path:

  • Multi-page PDF produces separate table chunks with correct page numbers
  • Table whose cells are all blank emits no chunk
  • table_index increments correctly across multiple tables on one page
  • Normalised bbox values are each within [0.0, 1.0]

PyMuPDF fallback path:

  • When unstructured and pdfplumber are both absent, PyMuPDF still produces
    chunk_type=text chunks

Image + table interleaving:

  • Image chunks are appended after table/text chunks on the same page

chunk_index continuity:

  • chunk_index is 0-based and monotonically increasing across the full
    document with no resets or gaps

🗂️ Type of Change

  • 🧪 Tests

🧪 How was this tested?

  • pytest backend/tests/test_chunker.py -v — all 11 new tests pass
    alongside the 7 existing tests (18 total, 0 failures)
  • No real PDF files used — all fixtures are in-memory fakes
  • No new imports outside stdlib + existing project deps

✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@nancysangani nancysangani requested a review from param20h as a code owner June 6, 2026 07:03
@nancysangani

Copy link
Copy Markdown
Contributor Author

Hi @param20h, I have opened this PR to fix the issue #451. Please review it when you get a chance. Thanks!

@nancysangani nancysangani force-pushed the test/pdf-chunker-table-parsing branch from e3d1ca1 to 8fb276f Compare June 11, 2026 06:46
@nancysangani

Copy link
Copy Markdown
Contributor Author

@param20h please review this PR, all checks have passed and resolved the conflicts. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(backend): Add unit tests for PDF Chunker tables parsing

1 participant