Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
reg = BenchmarkRegistry()
reg.discover()
ids = reg.list_ids()
assert len(ids) == 39, f'Expected 39 benchmarks, got {len(ids)}'
assert len(ids) == 40, f'Expected 40 benchmarks, got {len(ids)}'
"
- name: pytest
run: pytest tests/ -v
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ auth/
.venv/
venv/
env/
models/

.pytest_cache/
.ruff_cache/
Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GDB: GraphicDesignBench

**GDB** evaluates vision-language models on professional graphic design tasks — layout reasoning, typography, SVG editing, template matching, animation. The paper defines 49 evaluation tasks; this repo ships 39 benchmark pipelines covering 45 of them, organized into 7 code-level domains and built on the [Lica dataset](https://github.com/lica-world/lica-dataset) (1,148 real design layouts).
**GDB** evaluates vision-language models on professional graphic design tasks — layout reasoning, typography, SVG editing, template matching, animation. The paper defines 49 evaluation tasks; this repo ships 40 benchmark pipelines covering 46 of them, organized into 7 code-level domains and built on the [Lica dataset](https://github.com/lica-world/lica-dataset) (1,148 real design layouts).

**Paper:** [arXiv:2604.04192](https://arxiv.org/abs/2604.04192)  |  **Dataset:** [HuggingFace](https://huggingface.co/datasets/lica-world/GDB)  |  **Blog:** [lica.world](https://lica.world/blog/gdb-real-world-benchmark-for-graphic-design)

Expand All @@ -17,8 +17,8 @@ benchmark pipelines and the paper-level evaluation tasks they score.
| svg | 8 | 8 | SVG reasoning and editing (perceptual and semantic Q/A, bug fixing, optimization, style editing) and generation (text-to-SVG, image-to-SVG, combined input) |
| template | 5 | 5 | Template matching, retrieval, clustering, and generation (style completion, color transfer) |
| temporal | 6 | 8 | Keyframe ordering; motion type classification; video/component duration and start-time estimation; generation (animation parameters, motion trajectory, short-form video) |
| typography | 8 | 12 | Font family, color, size/weight/alignment/letter spacing/line height, style ranges, curvature, rotation, and generation (styled text element, styled text rendering to layout) |
| **Totals** | **39** | **45** | |
| typography | 9 | 13 | Font family, color, size/weight/alignment/letter spacing/line height, style ranges, curvature, rotation, and generation (styled text element, styled text rendering to layout, text removal/background inpainting as `image-6`) |
| **Totals** | **40** | **46** | |

Benchmarks and paper tasks are not 1:1. Two benchmarks score multiple paper tasks from a
single model call: `typography-3` extracts font size, weight, alignment, letter spacing,
Expand Down Expand Up @@ -60,7 +60,7 @@ pip install -e ".[dev]" # ruff linter

```bash
gdb verify # zero-config smoke test against a bundled fixture (~30s, no API keys)
gdb list # enumerate all 39 benchmarks
gdb list # enumerate all 40 benchmarks
gdb suites # named suites: v0-all, v0-smoke, v0-understanding, v0-generation
```

Expand Down Expand Up @@ -115,7 +115,7 @@ gdb eval --benchmarks svg-1 \
--provider hf --device auto \
--dataset-root data/gdb-dataset

# Diffusion / image generation (defaults to FLUX.2 klein 4B)
# Diffusion / image generation (defaults to FLUX.2 klein 9B)
gdb eval --benchmarks layout-1 \
--provider diffusion \
--dataset-root data/gdb-dataset
Expand All @@ -132,7 +132,7 @@ python -m pip install --no-deps --ignore-requires-python \
gdb eval --benchmarks layout-1 layout-3 layout-8 typography-7 typography-8 \
--provider custom \
--custom-entry gdb.models.local_models:Flux2Model \
--custom-init-kwargs '{"model_name":"flux.2-klein-4b"}' \
--custom-init-kwargs '{"model_name":"flux.2-klein-9b"}' \
--custom-modality image_generation \
--dataset-root data/gdb-dataset

Expand Down Expand Up @@ -160,7 +160,7 @@ helm-summarize --suite gdb-eval
helm-server --suite gdb-eval
```

All 39 benchmarks are available. See [integrations/helm/](integrations/helm/) for details.
All 40 benchmarks are available. See [integrations/helm/](integrations/helm/) for details.

### API keys

Expand Down Expand Up @@ -219,7 +219,7 @@ GDB/
│ │ ├── svg.py # svg-1 … svg-8
│ │ ├── template.py # template-1 … template-5
│ │ ├── temporal.py # temporal-1 … temporal-6
│ │ └── typography.py # typography-1 … typography-8
│ │ └── typography.py # typography-1 … typography-8 + image-6 implementation
│ ├── models/ # Provider wrappers (OpenAI, Anthropic, Gemini, HF, vLLM)
│ ├── metrics/ # Reusable metric functions (IoU, FID, SSIM, LPIPS, edit distance)
│ ├── evaluation/
Expand Down
6 changes: 3 additions & 3 deletions integrations/helm/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# lica-gdb-helm

HELM integration for [GDB (GraphicDesignBench)](https://github.com/lica-world/GDB) — run all 39 GDB benchmarks through Stanford CRFM's [HELM](https://github.com/stanford-crfm/helm) framework.
HELM integration for [GDB (GraphicDesignBench)](https://github.com/lica-world/GDB) — run all 40 GDB benchmarks through Stanford CRFM's [HELM](https://github.com/stanford-crfm/helm) framework.

## Install

Expand Down Expand Up @@ -35,7 +35,7 @@ helm-server --suite gdb-eval

## Available benchmarks

All 39 GDB benchmarks are available. Pass any benchmark ID:
All 40 GDB benchmarks are available. Pass any benchmark ID:

| Domain | Benchmark IDs |
|--------|--------------|
Expand All @@ -44,7 +44,7 @@ All 39 GDB benchmarks are available. Pass any benchmark ID:
| SVG | `svg-1` through `svg-8` |
| Template | `template-1` through `template-5` |
| Temporal | `temporal-1` through `temporal-6` |
| Typography | `typography-1` through `typography-8` |
| Typography | `typography-1` through `typography-8`, `image-6` |
| Lottie | `lottie-1`, `lottie-2` |

## Options
Expand Down
1 change: 1 addition & 0 deletions integrations/helm/src/gdb_helm/_benchmark_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ class BenchmarkInfo:
"typography-6": BenchmarkInfo(method="generation_multimodal", max_tokens=256, has_images=True),

# -- typography: generation --
"image-6": BenchmarkInfo(method="generation", max_tokens=0, has_images=True, image_gen=True),
"typography-7": BenchmarkInfo(method="generation", max_tokens=0, has_images=True, image_gen=True),
"typography-8": BenchmarkInfo(method="generation", max_tokens=0, image_gen=True),

Expand Down
2 changes: 1 addition & 1 deletion integrations/helm/src/gdb_helm/scenarios.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""HELM Scenario that wraps any GDB benchmark.

One parameterized class handles all 39 benchmarks by delegating data loading
One parameterized class handles all 40 benchmarks by delegating data loading
and prompt construction to the ``gdb`` package.
"""

Expand Down
6 changes: 3 additions & 3 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ python scripts/run_benchmarks.py --benchmarks svg-6 \
--provider vllm --model-id Qwen/Qwen3-VL-4B-Instruct --top-k 20 --top-p 0.8 \
--dataset-root data/gdb-dataset

# Diffusion / image generation (defaults to FLUX.2 klein 4B)
# Diffusion / image generation (defaults to FLUX.2 klein 9B)
python scripts/run_benchmarks.py --benchmarks layout-1 \
--provider diffusion \
--dataset-root data/gdb-dataset
Expand All @@ -69,7 +69,7 @@ python -m pip install --no-deps --ignore-requires-python \
python scripts/run_benchmarks.py --benchmarks layout-1 layout-3 layout-8 typography-7 typography-8 \
--provider custom \
--custom-entry gdb.models.local_models:Flux2Model \
--custom-init-kwargs '{"model_name":"flux.2-klein-4b"}' \
--custom-init-kwargs '{"model_name":"flux.2-klein-9b"}' \
--custom-modality image_generation \
--dataset-root data/gdb-dataset

Expand Down Expand Up @@ -100,7 +100,7 @@ from Hugging Face and can use either environment tokens (`HF_TOKEN`,
`HF_HUB_TOKEN`) or an existing cached login/token file.

The default local text/VLM model ID is now `Qwen/Qwen3-VL-4B-Instruct` for both
`hf` and `vllm`, and the default `diffusion` model ID is `flux.2-klein-4b`.
`hf` and `vllm`, and the default `diffusion` model ID is `flux.2-klein-9b`.

### Batch submit/collect (~50% cheaper)

Expand Down
2 changes: 1 addition & 1 deletion scripts/run_benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@
"anthropic": "claude-sonnet-4-20250514",
"hf": "Qwen/Qwen3-VL-4B-Instruct",
"vllm": "Qwen/Qwen3-VL-4B-Instruct",
"diffusion": "flux.2-klein-4b",
"diffusion": "flux.2-klein-9b",
"custom": "custom-entrypoint",
}

Expand Down
50 changes: 49 additions & 1 deletion scripts/upload_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,31 @@ def _find_image(sample: Dict[str, Any]) -> Optional[str]:
return None


def _find_path(value: Any) -> Optional[str]:
if isinstance(value, str) and value and Path(value).exists() and _is_image_file(value):
return value
return None


def _image_assets_for_sample(sample: Dict[str, Any]) -> Dict[str, Optional[str]]:
ground_truth = sample.get("ground_truth")
gt_image = None
if isinstance(ground_truth, dict):
gt_image = _find_path(
ground_truth.get("image")
or ground_truth.get("ground_truth_image")
or ground_truth.get("target_image")
)
elif isinstance(ground_truth, str):
gt_image = _find_path(ground_truth)

return {
"input_image_asset": _find_path(sample.get("input_image")),
"mask_asset": _find_path(sample.get("mask") or sample.get("text_mask")),
"ground_truth_image_asset": gt_image,
}


def _is_video(path: str) -> bool:
return path.lower().endswith(".mp4")

Expand Down Expand Up @@ -98,6 +123,7 @@ def load_via_registry(
for k, v in sample.items() if k not in metadata_skip}

media_path_rel = _normalize_paths(img_path, dataset_root_str) if img_path else ""
image_assets = _image_assets_for_sample(sample)

rows_out.append({
"sample_id": str(sample.get("sample_id", "")),
Expand All @@ -108,6 +134,9 @@ def load_via_registry(
"prompt": sample.get("prompt", ""),
"ground_truth": _serialize(sample.get("ground_truth", "")),
"image": img_path if has_image else None,
"input_image_asset": image_assets["input_image_asset"],
"mask_asset": image_assets["mask_asset"],
"ground_truth_image_asset": image_assets["ground_truth_image_asset"],
"media_path": media_path_rel,
"media_type": "video" if is_vid else ("image" if has_image else "none"),
"metadata": json.dumps(extra, ensure_ascii=False, default=str) if extra else "{}",
Expand Down Expand Up @@ -143,6 +172,9 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
import datasets

has_images = any(r["image"] is not None for r in all_rows)
has_input_images = any(r.get("input_image_asset") is not None for r in all_rows)
has_masks = any(r.get("mask_asset") is not None for r in all_rows)
has_gt_images = any(r.get("ground_truth_image_asset") is not None for r in all_rows)

features = datasets.Features({
"sample_id": datasets.Value("string"),
Expand All @@ -153,6 +185,9 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
"prompt": datasets.Value("large_string"),
"ground_truth": datasets.Value("large_string"),
"image": datasets.Image() if has_images else datasets.Value("string"),
"input_image_asset": datasets.Image() if has_input_images else datasets.Value("string"),
"mask_asset": datasets.Image() if has_masks else datasets.Value("string"),
"ground_truth_image_asset": datasets.Image() if has_gt_images else datasets.Value("string"),
"media_path": datasets.Value("string"),
"media_type": datasets.Value("string"),
"metadata": datasets.Value("large_string"),
Expand All @@ -164,6 +199,16 @@ def build_dataset(all_rows: List[Dict[str, Any]]):
r["image"] = None
else:
r["image"] = ""
for key, has_asset in (
("input_image_asset", has_input_images),
("mask_asset", has_masks),
("ground_truth_image_asset", has_gt_images),
):
if has_asset:
if not r.get(key):
r[key] = None
else:
r[key] = ""

return datasets.Dataset.from_list(all_rows, features=features)

Expand Down Expand Up @@ -233,7 +278,7 @@ def generate_dataset_card(config_names: Optional[List[str]] = None) -> str:

# GDB: GraphicDesignBench

39 benchmarks for evaluating vision-language models on graphic design tasks — layout, typography, SVG, template matching, animation. Built on 1,148 real design layouts from the [Lica dataset](https://lica.world).
40 benchmarks for evaluating vision-language models on graphic design tasks — layout, typography, SVG, template matching, animation. Built on 1,148 real design layouts from the [Lica dataset](https://lica.world).

**Paper:** [arXiv:2604.04192](https://arxiv.org/abs/2604.04192)  |  **Code:** [github.com/lica-world/GDB](https://github.com/lica-world/GDB)  |  **Blog:** [lica.world](https://lica.world/blog/gdb-real-world-benchmark-for-graphic-design)

Expand All @@ -256,6 +301,9 @@ def generate_dataset_card(config_names: Optional[List[str]] = None) -> str:
| `prompt` | string | Evaluation prompt |
| `ground_truth` | string | Expected answer (JSON for complex types) |
| `image` | Image | Input image (when applicable) |
| `input_image_asset` | Image | Auxiliary source image for generation/editing tasks |
| `mask_asset` | Image | Auxiliary edit mask for generation/editing tasks |
| `ground_truth_image_asset` | Image | Auxiliary target/reference image for generation metrics |
| `metadata` | string | Task-specific fields as JSON |

## Evaluation
Expand Down
2 changes: 1 addition & 1 deletion src/gdb/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
"anthropic": "claude-sonnet-4-20250514",
"hf": "Qwen/Qwen3-VL-4B-Instruct",
"vllm": "Qwen/Qwen3-VL-4B-Instruct",
"diffusion": "flux.2-klein-4b",
"diffusion": "flux.2-klein-9b",
"custom": "custom-entrypoint",
}

Expand Down
Loading
Loading