test(backend): add unit tests for PDF chunker table parsing by nancysangani · Pull Request #493 · param20h/PDF-Assistant-RAG

nancysangani · 2026-06-06T07:03:22Z

🔗 Related Issue

Closes #451

📝 What does this PR do?

Extends backend/tests/test_chunker.py with 11 new unit tests that verify
PDF table parsing across all three extraction paths and edge cases. All tests
use in-memory fakes and monkeypatching — no real PDF files are committed and
no new dependencies are introduced.

_table_to_markdown edge cases:

Empty row list returns empty string
All-blank cells return empty string
Single header row renders correctly with separator
Ragged rows (unequal column counts) are padded to uniform width
Whitespace in cells (tabs, newlines) is normalised to single spaces
Pipe characters in cells are escaped as \|
Separator row always uses ---

pdfplumber path:

Multi-page PDF produces separate table chunks with correct page numbers
Table whose cells are all blank emits no chunk
table_index increments correctly across multiple tables on one page
Normalised bbox values are each within [0.0, 1.0]

PyMuPDF fallback path:

When unstructured and pdfplumber are both absent, PyMuPDF still produces
chunk_type=text chunks

Image + table interleaving:

Image chunks are appended after table/text chunks on the same page

chunk_index continuity:

chunk_index is 0-based and monotonically increasing across the full
document with no resets or gaps

🗂️ Type of Change

🧪 Tests

🧪 How was this tested?

pytest backend/tests/test_chunker.py -v — all 11 new tests pass
alongside the 7 existing tests (18 total, 0 failures)
No real PDF files used — all fixtures are in-memory fakes
No new imports outside stdlib + existing project deps

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

nancysangani · 2026-06-06T07:07:12Z

Hi @param20h, I have opened this PR to fix the issue #451. Please review it when you get a chance. Thanks!

…ancysangani/PDF-Assistant-RAG into test/pdf-chunker-table-parsing

nancysangani · 2026-06-11T07:34:28Z

@param20h please review this PR, all checks have passed and resolved the conflicts. Thanks!

test(backend): add unit tests for PDF chunker table parsing

8fb276f

nancysangani requested a review from param20h as a code owner June 6, 2026 07:03

nancysangani force-pushed the test/pdf-chunker-table-parsing branch from e3d1ca1 to 8fb276f Compare June 11, 2026 06:46

nancysangani added 7 commits June 11, 2026 12:30

test(backend): resolve chunker test merge conflicts

7cbba2c

Merge branch 'dev' into test/pdf-chunker-table-parsing

fb2285e

test(backend): resolve chunker test merge conflicts

1489ecf

Merge branch 'test/pdf-chunker-table-parsing' of https://github.com/n…

bdff040

…ancysangani/PDF-Assistant-RAG into test/pdf-chunker-table-parsing

test(backend): resolve chunker test merge conflicts

7dcf617

test(backend): resolve chunker test merge conflicts

10b358f

test(backend): resolve chunker test merge conflicts

f4443d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(backend): add unit tests for PDF chunker table parsing#493

test(backend): add unit tests for PDF chunker table parsing#493
nancysangani wants to merge 8 commits into
param20h:devfrom
nancysangani:test/pdf-chunker-table-parsing

nancysangani commented Jun 6, 2026

Uh oh!

nancysangani commented Jun 6, 2026

Uh oh!

nancysangani commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nancysangani commented Jun 6, 2026

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

✅ Self-Review Checklist

Uh oh!

nancysangani commented Jun 6, 2026

Uh oh!

nancysangani commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant