Skip to content

BigSweetPotatoStudio/ComfyUI-DeepseekOCR

 
 

Repository files navigation

ComfyUI-DeepseekOCR

Local Modified Version Notes

This local version has been modified for the current ComfyUI environment:

  • Video / batch image support: DeepSeek OCR: Run now processes ComfyUI IMAGE batches frame by frame. This works with Video Helper Suite video frames, not only single images.
  • New markdown and mask outputs: The Run node now outputs text, markdown, visualization, and mask.
  • Raw text output: The text output keeps the original DeepSeek-OCR result unchanged, including location markers such as <|ref|>...<|/ref|><|det|>...<|/det|>.
  • Clean markdown output: The markdown output removes location markers like <|ref|>...<|/ref|><|det|>...<|/det|> and keeps readable OCR / Markdown content only.
  • Visualization and mask use raw detection tags: The original OCR result is still used internally for drawing boxes and generating masks.
  • Blank-image hallucination filter: Very short or repeated hallucinated OCR outputs such as 中国 / on images without text are filtered to empty text.
  • Recommended transformers version: Keep transformers==4.46.3 and tokenizers==0.20.3 for DeepSeek-OCR compatibility.

Example cleaned output:

2026年上学期期中练习题

For multi-frame input, both text and markdown outputs are grouped as:

## Frame 1

...

---

## Frame 2

...

English | 中文

A custom node that wraps DeepSeek-OCR as a ComfyUI plugin, providing powerful OCR recognition and document parsing capabilities.

Install Dependencies

cd ComfyUI/custom_nodes/
git clone https://github.com/Geo1230/ComfyUI-DeepseekOCR.git

Portable/venv: Run

path/to/ComfUI/python_embeded/python.exe -s -m pip install -r requirements.txt

With system python Run

pip install -r requirements.txt

Start ComfyUI

Recommended transformers 4.46.3 If you encounter compatibility issues with transformers 4.55+, downgrade:

pip install transformers==4.46.3 tokenizers==0.20.3

Download Model

Create directories and navigate:

# 1. Navigate to ComfyUI's models directory
cd ComfyUI\models

# 2. Create deepseek-ocr directory (if it doesn't exist)
mkdir deepseek-ocr
cd deepseek-ocr

# 3. Create model directory
mkdir deepseek-ai_DeepSeek-OCR
cd deepseek-ai_DeepSeek-OCR

Download model to current directory:

huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir . --repo-type model

Note: Model will be downloaded to ComfyUI\models\deepseek-ocr\deepseek-ai_DeepSeek-OCR\ directory

Or Use Automatic Download (Not recommended, less stable):

Model will automatically download on first run of the Load node. Download progress is shown in the console.

To disable automatic download, set environment variable:

# Windows PowerShell
$env:DPSK_AUTODOWNLOAD = "0"

Usage

Node 1: DeepSeek OCR: Load Model

Loads and caches the model, outputs a model handle for use by the Run node.

Parameters:

  • dtype: Data precision
    • bf16 (Recommended, default) - Balance of precision and performance
    • fp16 - Use when VRAM is insufficient
    • fp32 - Best compatibility but high VRAM usage
  • device: Runtime device (default: cuda)

Node 2: DeepSeek OCR: Run

Performs OCR inference and outputs recognized text.

Parameters:

  • model: Model handle (from Load node)
  • image: Input image (ComfyUI IMAGE type)
  • task: Task mode
    • Free OCR: General OCR recognition
    • Convert to Markdown: Document to Markdown conversion
    • Parse Figure: Parse charts and figures
    • Locate by Reference: Locate specified objects (requires reference_text)
  • resolution: Resolution preset
    • Gundam (Recommended for long documents): 1024/640/crop/compress
    • Tiny: 512x512
    • Small: 640x640
    • Base: 1024x1024
    • Large: 1280x1280
  • output_type: Output type (determines what is returned)
    • all (default): Output both text and visualization image
    • text: Text only, image output is original image
    • image: Visualization image only (suitable for Locate task)
  • reference_text: (Optional) Only when task=Locate by Reference, description of object to locate
  • box_color: (Optional) Detection box color, default red
    • Preset colors: red, green, blue, yellow, cyan, magenta, white, black
    • Custom RGB: e.g., "255,0,0" (red), "0,255,0" (green)
  • box_width: (Optional) Detection box width, default 2 px, range 1-10

Outputs:

  • text: Recognized text content (STRING)
    • Contains original markers (e.g., <|ref|>...<|/ref|><|det|>[[coordinates]]<|/det|>)
  • visualization: Visualization image (IMAGE)
    • Locate by Reference task: Image with custom-styled bounding boxes
    • Other tasks: Returns original input image

Screenshots

Usage Guide

💡 Output Type Selection

  • all (default): Output both text and visualization image
  • text: Text only (OCR/Markdown conversion)
  • image: Visualization image only (Locate task)

🎯 Locate by Reference Task

Parameter Configuration:

  • task: Select Locate by Reference
  • reference_text: Enter the object to locate
    • Chinese examples: "价格", "标题", "二维码"
    • English examples: "the teacher", "price", "table", "logo"

🎨 Custom Bounding Box Style

Supported Preset Colors (16 types):

Color Name RGB Preview Color Name RGB Preview
red 255,0,0 🔴 Red (default) orange 255,165,0 🟠 Orange
green 0,255,0 🟢 Green purple 128,0,128 🟣 Purple
blue 0,0,255 🔵 Blue pink 255,192,203 🩷 Pink
yellow 255,255,0 🟡 Yellow lime 0,255,0 🟢 Lime
cyan 0,255,255 🔵 Cyan navy 0,0,128 🔵 Navy
magenta 255,0,255 🟣 Magenta teal 0,128,128 🔵 Teal
white 255,255,255 ⚪ White gold 255,215,0 🟡 Gold
black 0,0,0 ⚫ Black silver 192,192,192 ⚪ Silver

Custom RGB Format:

  • Input format: "R,G,B" (e.g., "255,128,0" for dark orange)
  • Range: 0-255

Box Width:

  • box_width: 1-10 pixels (default 2px)

Example Configuration:

box_color = "red"          → Red 2px border (default)
box_color = "orange"       → Orange border
box_color = "255,105,180"  → Hot pink border
box_width = 5              → 5px thick border

📌 Basic Workflow

LoadImage
   ↓
DeepSeek OCR: Load Model  
   ↓
DeepSeek OCR: Run
   ├─→ text → Display Text / Save Text
   └─→ visualization → Preview Image / Save Image

📚 Typical Use Cases

1. Document to Markdown

task = "Convert to Markdown"
resolution = "Gundam"
→ Output formatted Markdown text

2. Figure Parsing

task = "Parse Figure"
resolution = "Base"
→ Extract structured data from tables and charts

3. Object Localization

task = "Locate by Reference"
reference_text = "哆啦A梦"
box_color = "red"
box_width = 2
→ Text contains coordinates, image shows red box annotations
ComfyUI/
├─ models/
│  └─ deepseek-ocr/                    # ← Fixed weights directory
│     ├─ deepseek-ai_DeepSeek-OCR/     # Model weights
│     └─ hf_cache/                     # HuggingFace cache
├─ output/
│  └─ DeepseekOCR/                     # Output directory (visualization results)
│     └─ 2025-11-05_20-31-00/          # Timestamp directory
├─ log/
│  └─ deepseek_ocr.log                 # Plugin logs
└─ custom_nodes/
   └─ ComfyUI-DeepseekOCR/
      ├─ __init__.py
      ├─ config.py
      ├─ model_manager.py
      ├─ nodes.py
      ├─ resolver.py
      ├─ io_utils.py
      ├─ tool/
      │  └─ download_weights.py
      ├─ requirements.txt
      └─ README.md

Logging

Plugin logs are located at: ComfyUI/log/deepseek_ocr.log

Key log contents:

  • Model weight download progress
  • Model loading status (device/dtype/attn_impl)
  • Cache hit information
  • Fallback strategy trigger records
  • Error details and suggestions

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • DeepSeek AI - For providing the powerful DeepSeek-OCR model
  • ComfyUI - Excellent node-based UI framework
  • All contributors and users

About

DeepSeek-OCR integration for ComfyUI - Powerful OCR and document parsing with customizable visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%