NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134
Open
bongwoobak wants to merge 5 commits into
Open
NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134bongwoobak wants to merge 5 commits into
bongwoobak wants to merge 5 commits into
Conversation
Add InferenceServiceTemplates that were missing from the repo. Covers the v0.22.0 set (7 presets) plus the preceding v0.20.1/v0.20.2 generation (6 presets). Models: DeepSeek-V4-Flash/Pro, Gemma-4-31B-it, GLM-5.1-FP8, GPT-OSS-120B, Kimi-K2.6, Qwen3.6-27B across H200-SXM/B300/L40S/H100-NVL. Normalized to repo conventions so names match the spec and presets stay deployment-agnostic: - Expert parallelism is declared via spec.parallelism.expert (the vllm runtime base assembles --enable-expert-parallel) instead of hardcoding the flag in ISVC_EXTRA_ARGS. The DeepSeek-V4 single-engine variants are therefore renamed tp8 -> tp8-moe-ep8 to reflect the actual topology. - MTP-using presets carry the -mtp suffix and mif.moreh.io/model.mtp label (DeepSeek-V4, GLM-5.1, Qwen3.6-27B). - Removed nodeSelector; GPU targeting is expressed via the mif.moreh.io/accelerator.* labels and pinned by the InferenceService. - Dropped ISVC_USE_KV_EVENTS (Heimdall inferencePool coupling) and the DP-only --no-enable-prefix-caching so the presets render standalone. Verified with `helm template` (13 templates render) and `helm lint`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR ports a set of Odin InferenceServiceTemplate Helm preset templates for NVIDIA vLLM (v0.22.0 plus rollback-era v0.20.1/v0.20.2) into deploy/helm/moai-inference-preset/templates/presets/vllm/, with naming/label normalization for model org/name, MTP tagging, and parallelism topology.
Changes:
- Add 7 new vLLM v0.22.0 presets (DeepSeek-V4 Flash/Pro, Gemma 4 31B IT, Qwen3.6 27B MTP, GLM-5.1 FP8 MTP, GPT-OSS 120B, Kimi K2.6).
- Add 6 rollback/reference presets for v0.20.2 (DeepSeek-V4 Flash/Pro TP/DP variants, Qwen3.6 27B MTP) and v0.20.1 (Kimi K2.6).
- Normalize labels and
spec.parallelism.expertusage for MoE/EP vs relying on extra CLI args.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml | Add GLM-5.1 FP8 MTP preset for B300 TP8. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml | Add Qwen3.6 27B MTP preset for L40S TP4. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp4.helm.yaml | Add GPT-OSS 120B preset for H100 NVL TP4 (Eagle3 draft config). |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml | Add Kimi K2.6 preset for B300 TP8. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml | Add Gemma 4 31B IT preset for L40S TP4. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | Add DeepSeek-V4 Pro MTP preset for H200 SXM TP8 + EP. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | Add DeepSeek-V4 Flash MTP preset for H200 SXM TP8 + EP. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml | Add rollback-era Qwen3.6 27B MTP preset for v0.20.2. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | Add rollback-era DeepSeek-V4 Pro TP8 + EP preset for v0.20.2. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | Add rollback-era DeepSeek-V4 Pro DP8 + EP preset for v0.20.2. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml | Add rollback-era DeepSeek-V4 Flash TP8 + EP preset for v0.20.2. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | Add rollback-era DeepSeek-V4 Flash DP8 + EP preset for v0.20.2. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml | Add rollback-era Kimi K2.6 preset for v0.20.1. |
…oss-120b preset Per deploy/helm/AGENTS.md this flag is a user-level tuning knob, not preset-defined, and no other gpt-oss preset in the repo hardcodes it. Deployments opt in via the InferenceService. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sets
Existing vLLM presets consistently carry the
moai.moreh.io/accelerator.{vendor,model} nodeSelector (per
deploy/helm/AGENTS.md it is preset-defined), so add it back to all 13
new presets to keep scheduling deterministic and aligned with the rest
of the chart. Verified with `helm template` and `helm lint`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines
+24
to
+26
| - name: main | ||
| image: vllm/vllm-openai:v0.22.0 | ||
| env: |
Comment on lines
+43
to
+46
| nodeSelector: | ||
| moai.moreh.io/accelerator.vendor: nvidia | ||
| moai.moreh.io/accelerator.model: l40s | ||
| tolerations: |
…0 image The v0.19.0 GLM-5.1 B300 preset needed the dedicated glm51-cu130 image; v0.22.0 mainline bundles B300 (SM103) + transformers support, so the generic image is sufficient. Add an in-file comment to explain the divergence from the older preset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nulledge
reviewed
Jun 16, 2026
| image: vllm/vllm-openai:v0.20.1 | ||
| image: vllm/vllm-openai:v0.22.0 | ||
| env: | ||
| - name: ISVC_EXTRA_ARGS |
Contributor
There was a problem hiding this comment.
--no-enable-prefix-caching 옵션은 vLLM 레시피입니다.
출처: https://recipes.vllm.ai/openai/gpt-oss-120b
aiand 배포와 일관되지 않습니다.
- aiand:
--no-enable-prefix-caching유지 - pr: 없음
gpt-oss를 nvidia 장비에서 --no-enable-prefix-caching 없이 테스트하지 못해 결정을 내릴 근거가 부족하네요.
…esets (#135) Add two InferenceServiceTemplate presets: - vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8: GLM-5.2 needs vLLM v0.23.0 day-0 support (v0.22.0 is insufficient). Follows the recipe's 8xB200/B300 full-1M config (fp8_e4m3 KV cache, MTP num_speculative_tokens 5, max-num-seqs 32, VLLM_DEEP_GEMM_WARMUP=skip). Drops --trust-remote-code since the repo ships no remote .py (unlike the GLM-5.1 preset). - vllm-v0.22.0-moonshotai-kimi-k2.7-code-nvidia-b300-tp8: verified on vLLM >= 0.19.1, so the v0.22.0 image suffices. INT4 (compressed-tensors, auto-detected). Carries over K2.6's multimodal tuning and TRTLLM_RAGGED MLA prefill backend since K2.7-Code reuses the K2.5 vision stack. No --speculative-config: the checkpoint has no native MTP (num_nextn_predict_layers=0) and no K2.7 eagle3 draft is published yet. Both rendered with `helm template -s`. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| apiVersion: odin.moreh.io/v1alpha1 | ||
| kind: InferenceServiceTemplate | ||
| metadata: | ||
| name: vllm-v0.22.0-openai-gpt-oss-120b-nvidia-h100-nvl-tp4 |
| mif.moreh.io/role: e2e | ||
| mif.moreh.io/accelerator.vendor: nvidia | ||
| mif.moreh.io/accelerator.model: h100-nvl | ||
| mif.moreh.io/parallelism: "tp4" |
| apiVersion: odin.moreh.io/v1alpha1 | ||
| kind: InferenceServiceTemplate | ||
| metadata: | ||
| name: vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8 |
134aef4 to
d8083b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add
InferenceServiceTemplateHelm presets that were missing from this repo, undertemplates/presets/vllm/.deepseek-ai-deepseek-v4-flash-mtp-…-h200-sxm-tp8-moe-ep8deepseek-ai-deepseek-v4-pro-mtp-…-h200-sxm-tp8-moe-ep8google-gemma-4-31b-it-…-l40s-tp4zai-org-glm-5.1-fp8-mtp-…-b300-tp8-moe-tp8openai-gpt-oss-120b-…-h100-nvl-tp4moonshotai-kimi-k2.6-…-b300-tp8qwen-qwen3.6-27b-mtp-…-l40s-tp4dp8-moe-ep8,tp8-moe-ep8}qwen-qwen3.6-27b-mtp-…-l40s-tp4moonshotai-kimi-k2.6-…-b300-tp8Normalization (generality + spec-matching names)
These templates deviated from repo conventions. Normalized before committing:
spec.parallelism.expert, not hardcoded--enable-expert-parallelinISVC_EXTRA_ARGS— thevllmruntime base assembles the flag. The DeepSeek-V4 single-engine variants are therefore namedtp8-moe-ep8so the name reflects the real topology. (Behavior-preserving: same final vLLM command.)-mtpsuffix +mif.moreh.io/model.mtp: "true"on MTP presets (DeepSeek-V4, GLM-5.1, Qwen3.6-27B). Eagle3 presets (GPT-OSS, Kimi) intentionally have no-mtp.nodeSelectorretained (moai.moreh.io/accelerator.{vendor,model}) to match existing presets and keep scheduling deterministic, perdeploy/helm/AGENTS.md.ISVC_USE_KV_EVENTS(Heimdall inferencePool coupling) and--no-enable-prefix-caching(a user-level knob per AGENTS.md; no other gpt-oss preset hardcodes it), so presets render standalone. Users opt in at theInferenceService.Because names were normalized, these repo presets do not match the previously-deployed IST names (e.g.
…-deepseek-v4-pro-nvidia-h200-sxm-tp8↔ repo…-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8). Adopting these as the source of truth requires a follow-up to update theInferenceServicetemplateRefs(and re-apply) — out of scope for this PR.GLM-5.1 B300 uses the generic
vllm/vllm-openai:v0.22.0image (not the v0.19.0-eraglm51-cu130); v0.22.0 mainline bundles B300/transformers support.Verification
helm template→ all 13 templates renderhelm lint→ 0 failures