Skip to content

Add Gemma 4 vLLM runtimes#599

Open
ankrovv wants to merge 5 commits into
ome-projects:mainfrom
ankrovv:feat/gemma4
Open

Add Gemma 4 vLLM runtimes#599
ankrovv wants to merge 5 commits into
ome-projects:mainfrom
ankrovv:feat/gemma4

Conversation

@ankrovv
Copy link
Copy Markdown

@ankrovv ankrovv commented May 5, 2026

Summary

Adds Gemma 4 vLLM ClusterServingRuntime manifests:

  • vllm-gemma-4-tp1: E2B, E4B, and 26B-A4B
  • vllm-gemma-4-tp2: 31B

Both runtimes match Gemma4ForConditionalGeneration and use per-accelerator tensor parallel overrides for A100-80G, H100, H200, and B200.

Config

  • Engine image: vLLM nightly v0.19.1.dev6+g6d4a8e6d2 with transformers 5.5.0.dev0
  • Router image: docker.io/lightseekorg/smg:1.4.1
  • Gemma 4 parser flags: --reasoning-parser=gemma4, --tool-call-parser=gemma4, --enable-auto-tool-choice
  • Long-context flags: --max-model-len=-1, --no-scheduler-reserve-full-isl
  • Multimodal caps:
    • tp1: image=10,audio=1,video=1
    • tp2: image=10,audio=0,video=1

Validation

  • Quality and feature validation passed for all four IT variants.
  • Context length verified up to 250K tokens on 26B-A4B and 31B.
  • Runtime smoke validation passed for the declared accelerator classes.
  • modelSizeRange keeps tp1 and tp2 auto-selection non-overlapping.

Test plan

kubectl apply --dry-run=server -f config/runtimes/vllm/gemma-4-tp1-rt.yaml
kubectl apply --dry-run=server -f config/runtimes/vllm/gemma-4-tp2-rt.yaml
kubectl kustomize config/runtimes

@github-actions github-actions Bot added runtime Runtime configuration changes config Configuration changes labels May 5, 2026
@ankrovv ankrovv changed the title Add Gemma 4 vLLM runtimes (TP1 and TP2) Add Gemma 4 vLLM runtimes May 5, 2026
Copy link
Copy Markdown
Collaborator

@YouNeedCryDear YouNeedCryDear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add the clusterbasemodel and a sample inference service? You can refer to what I did for the deepseek v4 #598 @ankrovv

@github-actions github-actions Bot added the models Model configuration changes label May 5, 2026
Comment thread config/runtimes/vllm/gemma-4-tp1-rt.yaml
Comment thread config/runtimes/vllm/gemma-4-tp1-rt.yaml Outdated
Comment thread config/runtimes/kustomization.yaml Outdated
version: "1.0.0"
modelSizeRange:
min: 4.6B
max: 27.7B
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change didn't follow old pattern to create one runtime for each model, why we switch to use one runtime for multiple models?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Gemma4 models use the same underlying model architecture so creating one runtime per model would create redundancies and consolidating by tp-size also allows for future expansion and support if google were to release new gemma4 models using the same architecture. cc @YouNeedCryDear for more context

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We are not going direction of one runtime per model as it scaling. If multiple models in the same family are sharing the same architecture, same format and essentially same engine config. Then we are combining those into a single runtime. Parallelism and engine args overwrite will be controlled on Accelerator Class level. Please let me know if there is any concerns for it @XinyueZhang369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes models Model configuration changes runtime Runtime configuration changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants