Convert PDFs into chunks and embeddings ready for retrieval-augmented generation.
Available as CLI, Web UI and API β PDFstract is the first tool in your AI RAG pipeline. It's a simple tool to get your PDFs ready for AI, You can Extract Data, Chunk, Embed and use it in your RAG pipeline.
PDFStract is a tool to get your PDFs ready for AI - Extract Data, Chunk, Embed and use it in your RAG pipeline:
- β Extract structured text, tables, and metadata from PDFs using various libraries (PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more)
- β Chunk the text into smaller chunks using various libraries (Token, Sentence, Recursive, Table, Semantic, Code, Late, Neural, Slumber, and more)
- β Embed the chunks using various libraries (Sentence Transformers, OpenAI, etc.)
- β Use the chunks in your RAG pipeline
- π 10+ Conversion Libraries: PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more
- βοΈ 10+ Chunking Methods: Token, Sentence, Recursive, Semantic, Code, Neural, Fast (SIMD), and more via Chonkie
- π± Modern React UI: Beautiful, responsive design with Tailwind CSS
- π» Command-Line Interface: Full CLI with batch processing, chunking, multi-library comparison, and automation
- π― Multiple Output Formats: Markdown, JSON, and Plain Text
- β±οΈ Performance Benchmarking: Real-time timer shows conversion speed for each library
- ποΈ Live Preview: View converted content with syntax highlighting
- π Library Status Dashboard: See which libraries are available/unavailable with error messages
- β¬οΈ On-Demand Model Downloads: Download ML models only when needed
- πΎ Easy Download: Download results in your preferred format
- π³ Docker Support: One-command deployment
- π REST API: Programmatic access to conversion and chunking features
- β‘ Batch Processing: Parallel conversion of 100+ PDFs with detailed reporting
- π Dark Mode Ready: Works seamlessly in light and dark themes
| Library | Version | Type | Status | Notes |
|---|---|---|---|---|
| pymupdf4llm | >=0.0.26 | Text Extraction | Fast | Best for simple PDFs |
| markitdown | >=0.1.2 | Markdown | Balanced | Microsoft's conversion tool |
| marker | >=1.8.1 | Advanced ML | High Quality | Excellent results, slower |
| docling | >=2.41.0 | Document Intelligence | Advanced | IBM's document platform |
| paddleocr | >=3.3.2 | OCR | Accurate | Great for scanned PDFs |
| unstructured | >=0.15.0 | Document Parsing | Smart | Intelligent element extraction |
| deepseekocr | Latest | GPU OCR | Fast (GPU only) | Requires CUDA GPU |
| pytesseract | >=0.3.10 | OCR | Classic | Tesseract-based (requires system binary) |
PDFStract includes 10+ chunking methods powered by Chonkie:
| Chunker | Description | Best For |
|---|---|---|
| token | Fixed token-based chunking | Simple, predictable chunks |
| sentence | Sentence boundary splitting | Natural text segments |
| recursive | Hierarchical delimiter-based | Structured documents |
| table | Table-aware chunking | Documents with tables |
| semantic | Embedding-based similarity | Topic-coherent chunks |
| code | AST-aware code splitting | Source code files |
| fast | SIMD-accelerated (100+ GB/s) | High-throughput pipelines |
| late | Late interaction chunking | ColBERT-style retrieval |
| neural | ML boundary detection | Complex documents |
| slumber | LLM-powered agentic chunking | Highest quality (requires API key) |
- Python: 3.11+
- UV: Fast Python package manager (install)
- Node.js: 20+ (for frontend development)
- Docker (optional): For containerized deployment
- Clone the repository:
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract- Install Python dependencies:
uv sync- Install frontend dependencies:
cd frontend
npm install
cd ..Terminal 1: Start the FastAPI Backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reloadTerminal 2: Start the React Frontend (Development)
cd frontend
npm run devAccess the Application:
- Frontend: http://localhost:5173 (with hot-reload)
- Backend API: http://localhost:8000
Note: The frontend development server proxies API calls to the backend at port 8000 (configured in frontend/vite.config.js)
To build the React app for production:
cd frontend
npm run buildThis creates an optimized build in frontend/dist/ which gets copied to /static by the Docker build process.
docker-compose up --buildThe application will be available at http://localhost:8000
PDFStract includes a powerful CLI for batch processing and automation.
# List available libraries
pdfstract libs
# List available chunkers
pdfstract chunkers
# Convert a single PDF
pdfstract convert document.pdf --library pymupdf4llm --output result.md
# Convert and chunk in one command
pdfstract convert-chunk document.pdf --library pymupdf4llm --chunker semantic --output chunks.json
# Chunk an existing text file
pdfstract chunk document.md --chunker token --chunk-size 512 --output chunks.json
# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l pymupdf4llm -l marker -l docling --output ./comparison
# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library pymupdf4llm --output ./converted --parallel 4
# Download models for a specific library
pdfstract download marker| Command | Description |
|---|---|
pdfstract libs |
List all available extraction libraries and their status |
pdfstract chunkers |
List all available chunkers and their parameters |
pdfstract convert |
Convert a single PDF file |
pdfstract chunk |
Chunk a text/markdown file |
pdfstract convert-chunk |
Convert PDF and chunk in one step |
pdfstract compare |
Compare multiple libraries on one PDF |
pdfstract batch |
Batch convert multiple PDFs in parallel |
pdfstract batch-compare |
Compare libraries across multiple PDFs |
pdfstract download |
Download models for a specific library |
β¨ Full Features:
- Single file conversion with any library
- Text chunking with 10+ chunking methods
- Convert + Chunk in a single command
- Multi-library comparison
- Parallel batch processing (1-16 workers)
- On-demand model downloads
- JSON reporting with detailed statistics
- Progress indicators and rich formatting
π Batch Processing:
- Convert 1000+ PDFs with parallel workers
- Detailed JSON reports (success rate, per-file status)
- Automatic error handling and logging
- Perfect for production jobs and legacy migrations
β Full CLI Documentation - See complete guide with real-world examples
| Endpoint | Method | Description | Parameters |
|---|---|---|---|
/ |
GET | Web interface | - |
/health |
GET | Health check | - |
/libraries |
GET | List available libraries with status | - |
/libraries/{name}/status |
GET | Get detailed status for a library | - |
/libraries/{name}/download |
POST | Download models for a library | - |
/convert |
POST | Convert PDF | file, library, output_format |
/chunkers |
GET | List available chunkers | - |
/chunk |
POST | Chunk raw text | text, chunker, params |
/convert-and-chunk |
POST | Convert PDF and chunk | file, library, chunker, output_format, chunker_params |
/compare |
POST | Compare multiple libraries | file, libraries[], output_format |
/compare/{task_id} |
GET | Get comparison task status | - |
List available libraries:
curl http://localhost:8000/librariesList available chunkers:
curl http://localhost:8000/chunkersConvert a PDF:
curl -X POST \
-F "file=@sample.pdf" \
-F "library=pymupdf4llm" \
-F "output_format=markdown" \
http://localhost:8000/convertConvert and Chunk in one request:
curl -X POST \
-F "file=@sample.pdf" \
-F "library=pymupdf4llm" \
-F "chunker=semantic" \
-F "output_format=markdown" \
-F "chunker_params={\"chunk_size\": 512}" \
http://localhost:8000/convert-and-chunkChunk raw text:
curl -X POST \
-F "text=Your long document text here..." \
-F "chunker=token" \
-F "params={\"chunk_size\": 256}" \
http://localhost:8000/chunkDownload models for a library:
curl -X POST http://localhost:8000/libraries/marker/downloadContributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
If you encounter issues or have questions - please create an issue
- FastAPI: Modern Python web framework
- React: UI library
- Tailwind CSS: Utility-first CSS framework
- Lucide Icons: Beautiful icon library
- Chonkie: Text chunking library for RAG pipelines
- All the amazing PDF extraction libraries (PyMuPDF, Marker, Docling, etc.)
Made with β€οΈ for AI RAG pipelines

