lica-world · mohitgargai · May 5, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
@@ -46,7 +46,7 @@ jobs:
           reg = BenchmarkRegistry()
           reg.discover()
           ids = reg.list_ids()
-          assert len(ids) == 39, f'Expected 39 benchmarks, got {len(ids)}'
+          assert len(ids) == 40, f'Expected 40 benchmarks, got {len(ids)}'
           "
       - name: pytest
         run: pytest tests/ -v

@@ -26,6 +26,7 @@ auth/
 .venv/
 venv/
 env/
+models/
 
 .pytest_cache/
 .ruff_cache/

@@ -1,6 +1,6 @@
 # GDB: GraphicDesignBench
 
-**GDB** evaluates vision-language models on professional graphic design tasks — layout reasoning, typography, SVG editing, template matching, animation. The paper defines 49 evaluation tasks; this repo ships 39 benchmark pipelines covering 45 of them, organized into 7 code-level domains and built on the [Lica dataset](https://github.com/lica-world/lica-dataset) (1,148 real design layouts).
+**GDB** evaluates vision-language models on professional graphic design tasks — layout reasoning, typography, SVG editing, template matching, animation. The paper defines 49 evaluation tasks; this repo ships 40 benchmark pipelines covering 46 of them, organized into 7 code-level domains and built on the [Lica dataset](https://github.com/lica-world/lica-dataset) (1,148 real design layouts).
 
 **Paper:** [arXiv:2604.04192](https://arxiv.org/abs/2604.04192) &nbsp;|&nbsp; **Dataset:** [HuggingFace](https://huggingface.co/datasets/lica-world/GDB) &nbsp;|&nbsp; **Blog:** [lica.world](https://lica.world/blog/gdb-real-world-benchmark-for-graphic-design)
 
@@ -17,8 +17,8 @@ benchmark pipelines and the paper-level evaluation tasks they score.
 | svg | 8 | 8 | SVG reasoning and editing (perceptual and semantic Q/A, bug fixing, optimization, style editing) and generation (text-to-SVG, image-to-SVG, combined input) |
 | template | 5 | 5 | Template matching, retrieval, clustering, and generation (style completion, color transfer) |
 | temporal | 6 | 8 | Keyframe ordering; motion type classification; video/component duration and start-time estimation; generation (animation parameters, motion trajectory, short-form video) |
-| typography | 8 | 12 | Font family, color, size/weight/alignment/letter spacing/line height, style ranges, curvature, rotation, and generation (styled text element, styled text rendering to layout) |
-| **Totals** | **39** | **45** | |
+| typography | 9 | 13 | Font family, color, size/weight/alignment/letter spacing/line height, style ranges, curvature, rotation, and generation (styled text element, styled text rendering to layout, text removal/background inpainting as `image-6`) |
+| **Totals** | **40** | **46** | |
 
 Benchmarks and paper tasks are not 1:1. Two benchmarks score multiple paper tasks from a
 single model call: `typography-3` extracts font size, weight, alignment, letter spacing,
@@ -60,7 +60,7 @@ pip install -e ".[dev]"              # ruff linter
 
 ```bash
 gdb verify      # zero-config smoke test against a bundled fixture (~30s, no API keys)
-gdb list        # enumerate all 39 benchmarks
+gdb list        # enumerate all 40 benchmarks
 gdb suites      # named suites: v0-all, v0-smoke, v0-understanding, v0-generation
 ```
 
@@ -115,7 +115,7 @@ gdb eval --benchmarks svg-1 \
     --provider hf --device auto \
     --dataset-root data/gdb-dataset
 
-# Diffusion / image generation (defaults to FLUX.2 klein 4B)
+# Diffusion / image generation (defaults to FLUX.2 klein 9B)
 gdb eval --benchmarks layout-1 \
     --provider diffusion \
     --dataset-root data/gdb-dataset
@@ -132,7 +132,7 @@ python -m pip install --no-deps --ignore-requires-python \
 gdb eval --benchmarks layout-1 layout-3 layout-8 typography-7 typography-8 \
     --provider custom \
     --custom-entry gdb.models.local_models:Flux2Model \
-    --custom-init-kwargs '{"model_name":"flux.2-klein-4b"}' \
+    --custom-init-kwargs '{"model_name":"flux.2-klein-9b"}' \
     --custom-modality image_generation \
     --dataset-root data/gdb-dataset
 
@@ -160,7 +160,7 @@ helm-summarize --suite gdb-eval
 helm-server --suite gdb-eval
 ```
 
-All 39 benchmarks are available. See [integrations/helm/](integrations/helm/) for details.
+All 40 benchmarks are available. See [integrations/helm/](integrations/helm/) for details.
 
 ### API keys
 
@@ -219,7 +219,7 @@ GDB/
 │   │   ├── svg.py          #   svg-1 … svg-8
 │   │   ├── template.py     #   template-1 … template-5
 │   │   ├── temporal.py     #   temporal-1 … temporal-6
-│   │   └── typography.py   #   typography-1 … typography-8
+│   │   └── typography.py   #   typography-1 … typography-8 + image-6 implementation
 │   ├── models/             # Provider wrappers (OpenAI, Anthropic, Gemini, HF, vLLM)
 │   ├── metrics/            # Reusable metric functions (IoU, FID, SSIM, LPIPS, edit distance)
 │   ├── evaluation/

@@ -1,6 +1,6 @@
 # lica-gdb-helm
 
-HELM integration for [GDB (GraphicDesignBench)](https://github.com/lica-world/GDB) — run all 39 GDB benchmarks through Stanford CRFM's [HELM](https://github.com/stanford-crfm/helm) framework.
+HELM integration for [GDB (GraphicDesignBench)](https://github.com/lica-world/GDB) — run all 40 GDB benchmarks through Stanford CRFM's [HELM](https://github.com/stanford-crfm/helm) framework.
 
 ## Install
 
@@ -35,7 +35,7 @@ helm-server --suite gdb-eval
 
 ## Available benchmarks
 
-All 39 GDB benchmarks are available. Pass any benchmark ID:
+All 40 GDB benchmarks are available. Pass any benchmark ID:
 
 | Domain | Benchmark IDs |
 |--------|--------------|
@@ -44,7 +44,7 @@ All 39 GDB benchmarks are available. Pass any benchmark ID:
 | SVG | `svg-1` through `svg-8` |
 | Template | `template-1` through `template-5` |
 | Temporal | `temporal-1` through `temporal-6` |
-| Typography | `typography-1` through `typography-8` |
+| Typography | `typography-1` through `typography-8`, `image-6` |
 | Lottie | `lottie-1`, `lottie-2` |
 
 ## Options

@@ -73,6 +73,7 @@ class BenchmarkInfo:
     "typography-6": BenchmarkInfo(method="generation_multimodal", max_tokens=256, has_images=True),
 
     # -- typography: generation --
+    "image-6": BenchmarkInfo(method="generation", max_tokens=0, has_images=True, image_gen=True),
     "typography-7": BenchmarkInfo(method="generation", max_tokens=0, has_images=True, image_gen=True),
     "typography-8": BenchmarkInfo(method="generation", max_tokens=0, image_gen=True),
 

@@ -1,6 +1,6 @@
 """HELM Scenario that wraps any GDB benchmark.
 
-One parameterized class handles all 39 benchmarks by delegating data loading
+One parameterized class handles all 40 benchmarks by delegating data loading
 and prompt construction to the ``gdb`` package.
 """
 

@@ -46,7 +46,7 @@ python scripts/run_benchmarks.py --benchmarks svg-6 \
     --provider vllm --model-id Qwen/Qwen3-VL-4B-Instruct --top-k 20 --top-p 0.8 \
     --dataset-root data/gdb-dataset
 
-# Diffusion / image generation (defaults to FLUX.2 klein 4B)
+# Diffusion / image generation (defaults to FLUX.2 klein 9B)
 python scripts/run_benchmarks.py --benchmarks layout-1 \
     --provider diffusion \
     --dataset-root data/gdb-dataset
@@ -69,7 +69,7 @@ python -m pip install --no-deps --ignore-requires-python \
 python scripts/run_benchmarks.py --benchmarks layout-1 layout-3 layout-8 typography-7 typography-8 \
     --provider custom \
     --custom-entry gdb.models.local_models:Flux2Model \
-    --custom-init-kwargs '{"model_name":"flux.2-klein-4b"}' \
+    --custom-init-kwargs '{"model_name":"flux.2-klein-9b"}' \
     --custom-modality image_generation \
     --dataset-root data/gdb-dataset
 
@@ -100,7 +100,7 @@ from Hugging Face and can use either environment tokens (`HF_TOKEN`,
 `HF_HUB_TOKEN`) or an existing cached login/token file.
 
 The default local text/VLM model ID is now `Qwen/Qwen3-VL-4B-Instruct` for both
-`hf` and `vllm`, and the default `diffusion` model ID is `flux.2-klein-4b`.
+`hf` and `vllm`, and the default `diffusion` model ID is `flux.2-klein-9b`.
 
 ### Batch submit/collect (~50% cheaper)
 

@@ -75,7 +75,7 @@
     "anthropic": "claude-sonnet-4-20250514",
     "hf": "Qwen/Qwen3-VL-4B-Instruct",
     "vllm": "Qwen/Qwen3-VL-4B-Instruct",
-    "diffusion": "flux.2-klein-4b",
+    "diffusion": "flux.2-klein-9b",
     "custom": "custom-entrypoint",
 }
 

@@ -50,6 +50,31 @@ def _find_image(sample: Dict[str, Any]) -> Optional[str]:
     return None
 
 
+def _find_path(value: Any) -> Optional[str]:
+    if isinstance(value, str) and value and Path(value).exists() and _is_image_file(value):
+        return value
+    return None
+
+
+def _image_assets_for_sample(sample: Dict[str, Any]) -> Dict[str, Optional[str]]:
+    ground_truth = sample.get("ground_truth")
+    gt_image = None
+    if isinstance(ground_truth, dict):
+        gt_image = _find_path(
+            ground_truth.get("image")
+            or ground_truth.get("ground_truth_image")
+            or ground_truth.get("target_image")
+        )
+    elif isinstance(ground_truth, str):
+        gt_image = _find_path(ground_truth)
+
+    return {
+        "input_image_asset": _find_path(sample.get("input_image")),
+        "mask_asset": _find_path(sample.get("mask") or sample.get("text_mask")),
+        "ground_truth_image_asset": gt_image,
+    }
+
+
 def _is_video(path: str) -> bool:
     return path.lower().endswith(".mp4")
 
@@ -98,6 +123,7 @@ def load_via_registry(
                  for k, v in sample.items() if k not in metadata_skip}
 
         media_path_rel = _normalize_paths(img_path, dataset_root_str) if img_path else ""
+        image_assets = _image_assets_for_sample(sample)
 
         rows_out.append({
             "sample_id": str(sample.get("sample_id", "")),
@@ -108,6 +134,9 @@ def load_via_registry(
             "prompt": sample.get("prompt", ""),
             "ground_truth": _serialize(sample.get("ground_truth", "")),
             "image": img_path if has_image else None,
+            "input_image_asset": image_assets["input_image_asset"],
+            "mask_asset": image_assets["mask_asset"],
+            "ground_truth_image_asset": image_assets["ground_truth_image_asset"],
             "media_path": media_path_rel,
             "media_type": "video" if is_vid else ("image" if has_image else "none"),
             "metadata": json.dumps(extra, ensure_ascii=False, default=str) if extra else "{}",
@@ -143,6 +172,9 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
     import datasets
 
     has_images = any(r["image"] is not None for r in all_rows)
+    has_input_images = any(r.get("input_image_asset") is not None for r in all_rows)
+    has_masks = any(r.get("mask_asset") is not None for r in all_rows)
+    has_gt_images = any(r.get("ground_truth_image_asset") is not None for r in all_rows)
 
     features = datasets.Features({
         "sample_id": datasets.Value("string"),
@@ -153,6 +185,9 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
         "prompt": datasets.Value("large_string"),
         "ground_truth": datasets.Value("large_string"),
         "image": datasets.Image() if has_images else datasets.Value("string"),
+        "input_image_asset": datasets.Image() if has_input_images else datasets.Value("string"),
+        "mask_asset": datasets.Image() if has_masks else datasets.Value("string"),
+        "ground_truth_image_asset": datasets.Image() if has_gt_images else datasets.Value("string"),
         "media_path": datasets.Value("string"),
         "media_type": datasets.Value("string"),
         "metadata": datasets.Value("large_string"),
@@ -164,6 +199,16 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
                 r["image"] = None
         else:
             r["image"] = ""
+        for key, has_asset in (
+            ("input_image_asset", has_input_images),
+            ("mask_asset", has_masks),
+            ("ground_truth_image_asset", has_gt_images),
+        ):
+            if has_asset:
+                if not r.get(key):
+                    r[key] = None
+            else:
+                r[key] = ""
 
     return datasets.Dataset.from_list(all_rows, features=features)
 
@@ -233,7 +278,7 @@ def generate_dataset_card(config_names: Optional[List[str]] = None) -> str:
 
 # GDB: GraphicDesignBench
 
-39 benchmarks for evaluating vision-language models on graphic design tasks — layout, typography, SVG, template matching, animation. Built on 1,148 real design layouts from the [Lica dataset](https://lica.world).
+40 benchmarks for evaluating vision-language models on graphic design tasks — layout, typography, SVG, template matching, animation. Built on 1,148 real design layouts from the [Lica dataset](https://lica.world).
 
 **Paper:** [arXiv:2604.04192](https://arxiv.org/abs/2604.04192) &nbsp;|&nbsp; **Code:** [github.com/lica-world/GDB](https://github.com/lica-world/GDB) &nbsp;|&nbsp; **Blog:** [lica.world](https://lica.world/blog/gdb-real-world-benchmark-for-graphic-design)
 
@@ -256,6 +301,9 @@ def generate_dataset_card(config_names: Optional[List[str]] = None) -> str:
 | `prompt` | string | Evaluation prompt |
 | `ground_truth` | string | Expected answer (JSON for complex types) |
 | `image` | Image | Input image (when applicable) |
+| `input_image_asset` | Image | Auxiliary source image for generation/editing tasks |
+| `mask_asset` | Image | Auxiliary edit mask for generation/editing tasks |
+| `ground_truth_image_asset` | Image | Auxiliary target/reference image for generation metrics |
 | `metadata` | string | Task-specific fields as JSON |
 
 ## Evaluation

@@ -66,7 +66,7 @@
     "anthropic": "claude-sonnet-4-20250514",
     "hf": "Qwen/Qwen3-VL-4B-Instruct",
     "vllm": "Qwen/Qwen3-VL-4B-Instruct",
-    "diffusion": "flux.2-klein-4b",
+    "diffusion": "flux.2-klein-9b",
     "custom": "custom-entrypoint",
 }
-Original file line number
+Diff line change
@@ Expand Up / @@ -26,6 +26,7 @@ auth/ @@
     .venv/
     venv/
     env/
+    models/
     .pytest_cache/
     .ruff_cache/
@@ Expand Down @@