A small FastAPI service that converts uploaded documents (PDF, PPTX, DOCX, and other office formats) into VLM content — a list of text/image segments shaped to be dropped straight into a vision-language model prompt.
The output is a JSON array of segments, each one of:
{"type": "text", "text": "..."}
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}It bundles three services in compose.yaml:
doc-processor— the FastAPI app inapi/(exposed on port8080)docling— docling-serve for converting office documents to Markdown with embedded images (port5001)tika— Apache Tika for fallback text extraction on anything else (port9998)
Bring everything up with Docker Compose:
docker compose up -d --build --force-recreateOnce the stack is healthy, send a file to the API:
curl -X POST http://localhost:8080/file2vlm-content \
-F "file=@api/test_data/short.pdf"The response is a JSON list of text / image_url segments.
Behaviour is tuned through environment variables on the doc-processor
service (defaults shown):
| Variable | Default | Description |
|---|---|---|
DOCLING_URL |
http://docling:5001 |
Docling-serve endpoint |
TIKA_URL |
http://tika:9998 |
Tika endpoint |
PDF_DPI |
150 |
DPI for direct PDF page rendering |
PP_DPI |
120 |
DPI for PPTX slide rendering |
MAX_DIRECT_INPUT_PDF_PAGES |
16 |
PDFs at or below this go straight to images |
MAX_DIRECT_INPUT_PP_SLIDES |
32 |
PPTXs at or below this go straight to images |
CACHE_DIR |
/var/cache/docs |
Where cached results are written |
DocProcessor.process (api/pipeline.py) dispatches on file extension and
size, preferring the path that preserves the most visual information and
falling back to text extraction when nothing better is available.
- Small PDFs (
.pdf, ≤MAX_DIRECT_INPUT_PDF_PAGES) — each page is rendered to a PNG with PyMuPDF atPDF_DPIand emitted as a singleimage_urlsegment. The VLM sees the page exactly as a human would. - Small PPTXs (
.pptx, ≤MAX_DIRECT_INPUT_PP_SLIDES) — converted to PDF via headless LibreOffice (soffice) using a temp user profile, then rendered the same way as small PDFs atPP_DPI. - Larger PDFs / PPTXs and all DOCX files — sent to docling-serve, which
returns Markdown with embedded base64 images. The Markdown is then split on
image tags into an interleaved sequence of
textandimage_urlsegments. - Anything else — handed to Apache Tika, which returns plain text wrapped
in a single
textsegment. - Failure fallback — any exception in steps 1–3 is logged and falls through to Tika so the request still returns something usable.
The text/image_url shape is identical across paths, so callers don't have
to care which branch produced the result.
Caching lives in api/cache.py and is wired into the request handler in
api/main.py.
- Key: SHA-256 hex digest of the raw uploaded bytes. Identical files always produce the same key, regardless of filename.
- Storage: one JSON file per document at
<CACHE_DIR>/<sha256>.jsoncontaining the VLM-content list returned by the pipeline. - Writes are atomic: results are written to
<sha256>.json.tmpfirst and thenos.replaced into place, so a concurrent reader never sees a partially written file. - Lookup flow in
/file2vlm-content:- Hash the uploaded bytes.
- If
<sha256>.jsonexists, load and return it — the pipeline isn't run. - Otherwise process the file, persist the result, then return it.
The hashing and pipeline calls are dispatched via
fastapi.concurrency.run_in_threadpool so the event loop stays responsive
during CPU-bound rendering or blocking HTTP calls to docling/tika.
To clear the cache, delete the contents of CACHE_DIR (e.g. by removing the
volume or rm-ing the JSON files); there is no eviction policy or TTL.