Skip to content

open-compass/TextEdit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TextEdit: A High-Quality, Multi-Scenario Text Editing Benchmark for Generation Models

Paper Coming Soon data img/data

Danni Yang, Sitao Chen, Changyao Tian

If you find our work helpful, please give us a ⭐ or cite our paper. See the InternVL-U technical report appendix for more details.

πŸŽ‰ News

  • [2026/03/06] TextEdit benchmark released.
  • [2026/03/06] Evaluation code released.
  • [2026/03/06] Leaderboard updated with latest models.

πŸ“– Introduction

Text editing is a fundamental yet challenging capability for modern image generation and editing models. An increasing number of powerful multimodal generation models, such as Qwen-Image and Nano-Banana-Pro, are emerging with strong text rendering and editing capabilities. For text editing task, unlike general image editing, text manipulation requires:
  • Precise spatial alignment
  • Font and style consistency
  • Background preservation
  • Layout-constrained reasoning

We introduce TextEdit, a high-quality, multi-scenario benchmark designed to evaluate fine-grained text editing capabilities in image generation models.

TextEdit covers a diverse set of real-world and virtual scenarios, spanning 18 subcategories with a total of 2,148 high-quality source images and manually annotated edited ground-truth images.

To comprehensively assess model performance, we combine classic OCR, image-fidelity metrics and modern multimodal LLM-based evaluation across target accuracy, text preservation, scene integrity, local realism and visual coherence. This dual-track protocol enables comprehensive assessment.

Our goal is to provide a standardized, realistic, and scalable benchmark for text editing research.


πŸ† LeadBoard

πŸ“Š Full Benchmark Results
Models # Params Real Virtual
OA OP OR F1 NED CLIP AES OA OP OR F1 NED CLIP AES
Generation Models
Qwen-Image-Edit 20B 0.750.680.660.670.710.755.72 0.780.750.730.740.750.815.21
GPT-Image-1.5 - 0.740.690.670.680.680.755.78 0.730.720.710.710.700.805.28
Nano Banana Pro - 0.770.720.700.710.720.755.79 0.800.780.770.780.780.815.28
Unified Models
Lumina-DiMOO 8B 0.220.230.190.200.190.695.53 0.220.250.210.220.200.724.76
Ovis-U1 2.4B+1.2B 0.400.370.340.350.350.725.32 0.370.400.380.390.330.754.66
BAGEL 7B+7B 0.600.590.530.550.550.745.71 0.570.600.560.570.540.785.19
InternVL-U (Ours) 2B+1.7B 0.770.730.700.710.720.755.70 0.790.770.750.750.770.805.12
Models # Params Real Virtual
TA TP SI LR VC Avg TA TP SI LR VC Avg
Generation Models
Qwen-Image-Edit 20B 0.920.820.750.570.800.77 0.570.790.920.800.770.77
GPT-Image-1.5 - 0.960.940.860.800.930.90 0.820.930.960.910.870.90
Nano Banana Pro - 0.960.950.850.880.930.91 0.870.920.960.940.890.92
Unified Models
Lumina-DiMOO 8B 0.170.060.040.020.050.09 0.020.060.160.050.030.08
Ovis-U1 2.4B+1.2B 0.310.120.120.070.180.18 0.060.160.310.140.130.19
BAGEL 7B+7B 0.680.600.380.350.560.53 0.380.510.680.620.420.54
InternVL-U (Ours) 2B+1.7B 0.940.900.710.800.800.88 0.870.860.910.820.620.83
πŸ“Š Mini-set Benchmark Results(500 samples)
Models # Params Real Virtual
OA OP OR F1 NED CLIP AES OA OP OR F1 NED CLIP AES
Generation Models
Qwen-Image-Edit 20B 0.760.690.670.670.700.755.81 0.740.710.700.700.700.805.27
GPT-Image-1.5 - 0.720.680.660.670.670.755.85 0.680.690.680.680.650.805.32
Nano Banana Pro - 0.760.710.690.700.700.755.86 0.770.760.750.750.760.815.32
Unified Models
Lumina-DiMOO 8B 0.200.220.180.190.190.705.58 0.220.250.210.220.190.734.87
Ovis-U1 2.4B+1.2B 0.370.340.320.320.330.725.39 0.390.410.380.390.330.744.75
BAGEL 7B+7B 0.610.590.520.540.540.745.79 0.530.580.530.550.510.785.25
InternVL-U (Ours) 2B+1.7B 0.770.740.700.710.710.765.79 0.740.720.690.700.720.795.14
Models # Params Real Virtual
TA TP SI LR VC Avg TA TP SI LR VC Avg
Generation Models
Qwen-Image-Edit 20B 0.930.850.770.550.780.80 0.600.820.910.810.740.76
GPT-Image-1.5 - 0.970.940.860.790.920.91 0.850.930.950.920.830.88
Nano Banana Pro - 0.960.950.850.860.920.91 0.870.920.960.930.870.92
Unified Models
Lumina-DiMOO 8B 0.160.040.040.020.060.08 0.020.050.190.070.030.10
Ovis-U1 2.4B+1.2B 0.290.110.110.080.200.17 0.040.160.350.180.150.22
BAGEL 7B+7B 0.680.610.380.340.590.53 0.360.520.690.640.400.54
InternVL-U (Ours) 2B+1.7B 0.940.910.720.730.750.89 0.880.870.900.780.570.79

πŸ› οΈ Quick Start

πŸ“‚ 1. Data Preparation

You can download images from this page. The TextEdit benchmark data is organized under data/ by and category:

  • Virtual (categories 1.x.x): Synthetic/virtual scene images
  • Real (categories 2.x): Real-world scene images

Evaluation prompts are provided under eval_prompts/ in two subsets:

Subset Directory Description
Fullset eval_prompts/fullset/ Complete benchmark with all samples
Miniset (500) eval_prompts/miniset/ 500-sample subset uniformly sampled from the fullset

Each .jsonl file contains per-sample fields: id, prompt, original_image, gt_image, source_text, target_text, gt_caption.

πŸ€– 2. Model Output Preparation

You need to use your model to perform image editing inference process. Please organize the outputs in the folder structure shown below to facilitate evaluation.

output/
β”œβ”€β”€ internvl-u/                      # Your Model Name
β”‚   β”œβ”€β”€ 1.1.1                        # Category Name
β”‚       β”œβ”€β”€ 1007088003726.0.jpg      # Model Output Images
β”‚       β”œβ”€β”€ 1013932004096.0.jpg          
β”‚       β”œβ”€β”€ ...     
β”‚   β”œβ”€β”€ 1.1.2  
β”‚   β”œβ”€β”€ 1.1.3             
β”‚   β”œβ”€β”€ ...           
β”‚   └── 2.7 

πŸ“ 3. Model Evaluation

3.1 Classic Metrics Evaluation

Classic metrics evaluate text editing quality using OCR-based text accuracy, image-text alignment, and aesthetic quality. All metrics are reported separately for Virtual and Real splits.

Evaluated Metrics

Abbreviation Metric Description
OA OCR Accuracy Whether the target text is correctly rendered in the editing region
OP OCR Precision Precision of text content (target + background) in the generated image
OR OCR Recall Recall of text content (target + background) in the generated image
F1 OCR F1 Harmonic mean of OCR Precision and Recall
NED Normalized Edit Distance ROI-aware normalized edit distance between target and generated text
CLIP CLIPScore CLIP-based image-text alignment score
AES Aesthetic Score Predicted aesthetic quality score of the generated image

Usage

Evaluation scripts are provided separately for fullset and miniset:

  • eval_scripts/classic_metrics_eval_full.sh β€” evaluate on the full benchmark
  • eval_scripts/classic_metrics_eval_mini.sh β€” evaluate on the 500-sample miniset

Step 1. Modify the contents of the configure script according to your project directory. (e.g., eval_scripts/classic_metrics_eval_full.sh):

MODELS="model-a,model-b,model-c"                    # Comma-separated list of model names to be evaluated

path="your_project_path_here"
CACHE_DIR="$path/TextEdit/checkpoint"               # Directory for all model checkpoints (OCR, CLIP, etc.)

BENCHMARK_DIR="$path/TextEdit/eval_prompts/fullset"
GT_ROOT_DIR="$path/TextEdit/data"                   # Root path for original & GT images
MODEL_OUTPUT_ROOT="$path/TextEdit/output"           # Root path for model infer outputs
OUTPUT_DIR="$path/TextEdit/result/classic_fullset"  # Evaluation result root path for classic metric

Note: All required model checkpoints (PaddleOCR, CLIP, aesthetic model, etc.) should be placed under the CACHE_DIR directory.

Step 2.Run evaluation shell script to evaluate your model output.

# Fullset evaluation
bash eval_scripts/classic_metrics_eval_full.sh

# Miniset evaluation
bash eval_scripts/classic_metrics_eval_mini.sh

Results are saved as {model_name}.json under the output directory, containing per-sample scores and aggregated metrics for both Virtual and Real splits.


3.2 VLM-based Metrics Evaluation

Our VLM-based evaluation uses Gemini-3-Pro-Preview as an expert judge to score text editing quality across five fine-grained dimensions. The evaluation is a two-step pipeline.

Evaluated Metrics

Abbreviation Metric Description
TA Text Accuracy Spelling correctness and completeness of the target text (1–5)
TP Text Preservation Preservation of non-target background text (1–5)
SI Scene Integrity Geometric stability of non-edited background areas (1–5)
LR Local Realism Inpainting quality, edge cleanness, and seamlessness (1–5)
VC Visual Coherence Style matching (font, lighting, shadow, texture harmony) (1–5)
Avg Weighted Average Weighted average of all five dimensions (default weights: 0.4 / 0.3 / 0.1 / 0.1 / 0.1)

All raw scores (1–5) are normalized to 0–1 for reporting. A cutoff mechanism is available: if TA (Q1) < 4, the remaining dimensions are set to 0, reflecting that a failed text edit invalidates other quality dimensions.

Step 1: Gemini API Evaluation

Send (Original Image, GT Image, Edited Image) triplets to the Gemini API for scoring.

Configure and run eval_scripts/vlm_metrics_eval_step1.sh:

API_KEY="your_gemini_api_key_here"
BASE_URL="your_gemini_api_base_url_here"

python eval_pipeline/vlm_metrics_eval_step1.py \
  --input_data_dir <your_path>/TextEdit/eval_prompts/fullset \
  --model_output_root <your_path>/TextEdit/output \
  --gt_data_root <your_path>/TextEdit/data \
  --output_base_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
  --model_name "gemini-3-pro-preview" \
  --models "model-a,model-b,model-c" \
  --api_key "$API_KEY" \
  --base_url "$BASE_URL" \
  --num_workers 64

Per-model .jsonl answer files are saved under the output_base_dir.

Step 2: Score Aggregation & Report

Aggregate the per-sample Gemini responses into a final report.

Configure and run eval_scripts/vlm_metrics_eval_step2.sh:

# Fullset report
python eval_pipeline/vlm_metrics_eval_step2.py \
  --answer_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
  --output_file <your_path>/TextEdit/result/gemini_report_fullset.json \
  --weights 0.4 0.3 0.1 0.1 0.1 \
  --enable_cutoff

# Miniset report
python eval_pipeline/vlm_metrics_eval_step2.py \
  --answer_dir <your_path>/TextEdit/result/vlm_gemini_mini_answers \
  --output_file <your_path>/TextEdit/result/gemini_report_miniset.json \
  --weights 0.4 0.3 0.1 0.1 0.1 \
  --enable_cutoff

Key parameters:

  • --weights: Weights for Q1–Q5 (default: 0.4 0.3 0.1 0.1 0.1).
  • --enable_cutoff: Enable cutoff mechanism β€” if Q1 < 4, set Q2–Q5 to 0.

The output includes a JSON report, a CSV table, and a Markdown-formatted leaderboard printed to the console.


🎨 Visualization Ouput Example

Citation

If you find our TextEdit Bench useful, please cite our InternVL-U technical report using this BibTeX.

About

We provide TextEdit, a high-quality, multi-scenario text editing benchmark for generation models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors