Skip to content

Phase 4: Text Extraction #5

@tlkahn

Description

@tlkahn

Goal

Extract text from PDF pages: full text, positioned characters, styled segments, and search.

Tasks

  • Add text FFI to ffi.rs: FPDFText_LoadPage, FPDFText_ClosePage, FPDFText_CountChars, FPDFText_GetUnicode, FPDFText_GetCharBox, FPDFText_GetText, FPDFText_FindStart, FPDFText_FindNext, FPDFText_GetSchResultIndex, FPDFText_GetSchCount, FPDFText_FindClose
  • Add safe wrappers
  • Implement lazy FPDF_TEXTPAGE creation (cached in PageData)
  • Implement Document::page_text() — full text string
  • Implement Document::page_chars()Vec<CharInfo> with positions
  • Implement Document::page_text_segments() — runs with shared style
  • Implement Document::search_page() — find text with options (case-sensitive, whole word)
  • Tests: extract text from test PDFs, verify search results, verify character positions

Exit Criterion

Text extraction matches pdfium-render output on test PDFs. Search works.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions