Add Gemma 4 vLLM runtimes#599
Conversation
| version: "1.0.0" | ||
| modelSizeRange: | ||
| min: 4.6B | ||
| max: 27.7B |
There was a problem hiding this comment.
The change didn't follow old pattern to create one runtime for each model, why we switch to use one runtime for multiple models?
There was a problem hiding this comment.
All the Gemma4 models use the same underlying model architecture so creating one runtime per model would create redundancies and consolidating by tp-size also allows for future expansion and support if google were to release new gemma4 models using the same architecture. cc @YouNeedCryDear for more context
There was a problem hiding this comment.
Correct. We are not going direction of one runtime per model as it scaling. If multiple models in the same family are sharing the same architecture, same format and essentially same engine config. Then we are combining those into a single runtime. Parallelism and engine args overwrite will be controlled on Accelerator Class level. Please let me know if there is any concerns for it @XinyueZhang369
Summary
Adds Gemma 4 vLLM
ClusterServingRuntimemanifests:vllm-gemma-4-tp1: E2B, E4B, and 26B-A4Bvllm-gemma-4-tp2: 31BBoth runtimes match
Gemma4ForConditionalGenerationand use per-accelerator tensor parallel overrides for A100-80G, H100, H200, and B200.Config
v0.19.1.dev6+g6d4a8e6d2with transformers5.5.0.dev0docker.io/lightseekorg/smg:1.4.1--reasoning-parser=gemma4,--tool-call-parser=gemma4,--enable-auto-tool-choice--max-model-len=-1,--no-scheduler-reserve-full-islimage=10,audio=1,video=1image=10,audio=0,video=1Validation
modelSizeRangekeeps tp1 and tp2 auto-selection non-overlapping.Test plan