NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets by bongwoobak · Pull Request #134 · moreh-dev/mif

bongwoobak · 2026-06-15T13:05:16Z

What

Add InferenceServiceTemplate Helm presets that were missing from this repo, under templates/presets/vllm/.

v0.22.0 (7) — current-generation presets
v0.20.1 / v0.20.2 (6) — the preceding generation (rollback references)

Version	Preset	Parallelism
v0.22.0	`deepseek-ai-deepseek-v4-flash-mtp-…-h200-sxm-tp8-moe-ep8`	tp8 + EP
v0.22.0	`deepseek-ai-deepseek-v4-pro-mtp-…-h200-sxm-tp8-moe-ep8`	tp8 + EP
v0.22.0	`google-gemma-4-31b-it-…-l40s-tp4`	tp4
v0.22.0	`zai-org-glm-5.1-fp8-mtp-…-b300-tp8-moe-tp8`	tp8
v0.22.0	`openai-gpt-oss-120b-…-h100-nvl-tp4`	tp4
v0.22.0	`moonshotai-kimi-k2.6-…-b300-tp8`	tp8
v0.22.0	`qwen-qwen3.6-27b-mtp-…-l40s-tp4`	tp4
v0.20.2	deepseek-v4-flash/pro × {`dp8-moe-ep8`, `tp8-moe-ep8`}	DP/TP + EP
v0.20.2	`qwen-qwen3.6-27b-mtp-…-l40s-tp4`	tp4
v0.20.1	`moonshotai-kimi-k2.6-…-b300-tp8`	tp8

Normalization (generality + spec-matching names)

These templates deviated from repo conventions. Normalized before committing:

Expert parallelism via spec.parallelism.expert, not hardcoded --enable-expert-parallel in ISVC_EXTRA_ARGS — the vllm runtime base assembles the flag. The DeepSeek-V4 single-engine variants are therefore named tp8-moe-ep8 so the name reflects the real topology. (Behavior-preserving: same final vLLM command.)
-mtp suffix + mif.moreh.io/model.mtp: "true" on MTP presets (DeepSeek-V4, GLM-5.1, Qwen3.6-27B). Eagle3 presets (GPT-OSS, Kimi) intentionally have no -mtp.
nodeSelector retained (moai.moreh.io/accelerator.{vendor,model}) to match existing presets and keep scheduling deterministic, per deploy/helm/AGENTS.md.
Dropped ISVC_USE_KV_EVENTS (Heimdall inferencePool coupling) and --no-enable-prefix-caching (a user-level knob per AGENTS.md; no other gpt-oss preset hardcodes it), so presets render standalone. Users opt in at the InferenceService.

⚠️ Note for reviewers

Because names were normalized, these repo presets do not match the previously-deployed IST names (e.g. …-deepseek-v4-pro-nvidia-h200-sxm-tp8 ↔ repo …-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8). Adopting these as the source of truth requires a follow-up to update the InferenceService templateRefs (and re-apply) — out of scope for this PR.

GLM-5.1 B300 uses the generic vllm/vllm-openai:v0.22.0 image (not the v0.19.0-era glm51-cu130); v0.22.0 mainline bundles B300/transformers support.

Verification

helm template → all 13 templates render
helm lint → 0 failures

Add InferenceServiceTemplates that were missing from the repo. Covers the v0.22.0 set (7 presets) plus the preceding v0.20.1/v0.20.2 generation (6 presets). Models: DeepSeek-V4-Flash/Pro, Gemma-4-31B-it, GLM-5.1-FP8, GPT-OSS-120B, Kimi-K2.6, Qwen3.6-27B across H200-SXM/B300/L40S/H100-NVL. Normalized to repo conventions so names match the spec and presets stay deployment-agnostic: - Expert parallelism is declared via spec.parallelism.expert (the vllm runtime base assembles --enable-expert-parallel) instead of hardcoding the flag in ISVC_EXTRA_ARGS. The DeepSeek-V4 single-engine variants are therefore renamed tp8 -> tp8-moe-ep8 to reflect the actual topology. - MTP-using presets carry the -mtp suffix and mif.moreh.io/model.mtp label (DeepSeek-V4, GLM-5.1, Qwen3.6-27B). - Removed nodeSelector; GPU targeting is expressed via the mif.moreh.io/accelerator.* labels and pinned by the InferenceService. - Dropped ISVC_USE_KV_EVENTS (Heimdall inferencePool coupling) and the DP-only --no-enable-prefix-caching so the presets render standalone. Verified with `helm template` (13 templates render) and `helm lint`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR ports a set of Odin InferenceServiceTemplate Helm preset templates for NVIDIA vLLM (v0.22.0 plus rollback-era v0.20.1/v0.20.2) into deploy/helm/moai-inference-preset/templates/presets/vllm/, with naming/label normalization for model org/name, MTP tagging, and parallelism topology.

Changes:

Add 7 new vLLM v0.22.0 presets (DeepSeek-V4 Flash/Pro, Gemma 4 31B IT, Qwen3.6 27B MTP, GLM-5.1 FP8 MTP, GPT-OSS 120B, Kimi K2.6).
Add 6 rollback/reference presets for v0.20.2 (DeepSeek-V4 Flash/Pro TP/DP variants, Qwen3.6 27B MTP) and v0.20.1 (Kimi K2.6).
Normalize labels and spec.parallelism.expert usage for MoE/EP vs relying on extra CLI args.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml	Add GLM-5.1 FP8 MTP preset for B300 TP8.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml	Add Qwen3.6 27B MTP preset for L40S TP4.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp4.helm.yaml	Add GPT-OSS 120B preset for H100 NVL TP4 (Eagle3 draft config).
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml	Add Kimi K2.6 preset for B300 TP8.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml	Add Gemma 4 31B IT preset for L40S TP4.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	Add DeepSeek-V4 Pro MTP preset for H200 SXM TP8 + EP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	Add DeepSeek-V4 Flash MTP preset for H200 SXM TP8 + EP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml	Add rollback-era Qwen3.6 27B MTP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	Add rollback-era DeepSeek-V4 Pro TP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	Add rollback-era DeepSeek-V4 Pro DP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml	Add rollback-era DeepSeek-V4 Flash TP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	Add rollback-era DeepSeek-V4 Flash DP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml	Add rollback-era Kimi K2.6 preset for v0.20.1.

…oss-120b preset Per deploy/helm/AGENTS.md this flag is a user-level tuning knob, not preset-defined, and no other gpt-oss preset in the repo hardcodes it. Deployments opt in via the InferenceService. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sets Existing vLLM presets consistently carry the moai.moreh.io/accelerator.{vendor,model} nodeSelector (per deploy/helm/AGENTS.md it is preset-defined), so add it back to all 13 new presets to keep scheduling deterministic and aligned with the rest of the chart. Verified with `helm template` and `helm lint`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

+        - name: main
+          image: vllm/vllm-openai:v0.22.0
+          env:


+      nodeSelector:
+        moai.moreh.io/accelerator.vendor: nvidia
+        moai.moreh.io/accelerator.model: l40s
+      tolerations:


…0 image The v0.19.0 GLM-5.1 B300 preset needed the dedicated glm51-cu130 image; v0.22.0 mainline bundles B300 (SM103) + transformers support, so the generic image is sufficient. Add an in-file comment to explain the divergence from the older preset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nulledge · 2026-06-16T03:02:58Z

-          image: vllm/vllm-openai:v0.20.1
+          image: vllm/vllm-openai:v0.22.0
          env:
            - name: ISVC_EXTRA_ARGS


--no-enable-prefix-caching 옵션은 vLLM 레시피입니다.
출처: https://recipes.vllm.ai/openai/gpt-oss-120b

aiand 배포와 일관되지 않습니다.

aiand: --no-enable-prefix-caching 유지

pr: 없음

gpt-oss를 nvidia 장비에서 --no-enable-prefix-caching 없이 테스트하지 못해 결정을 내릴 근거가 부족하네요.

…esets (#135) Add two InferenceServiceTemplate presets: - vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8: GLM-5.2 needs vLLM v0.23.0 day-0 support (v0.22.0 is insufficient). Follows the recipe's 8xB200/B300 full-1M config (fp8_e4m3 KV cache, MTP num_speculative_tokens 5, max-num-seqs 32, VLLM_DEEP_GEMM_WARMUP=skip). Drops --trust-remote-code since the repo ships no remote .py (unlike the GLM-5.1 preset). - vllm-v0.22.0-moonshotai-kimi-k2.7-code-nvidia-b300-tp8: verified on vLLM >= 0.19.1, so the v0.22.0 image suffices. INT4 (compressed-tensors, auto-detected). Carries over K2.6's multimodal tuning and TRTLLM_RAGGED MLA prefill backend since K2.7-Code reuses the K2.5 vision stack. No --speculative-config: the checkpoint has no native MTP (num_nextn_predict_layers=0) and no K2.7 eagle3 draft is published yet. Both rendered with `helm template -s`. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

+apiVersion: odin.moreh.io/v1alpha1
+kind: InferenceServiceTemplate
+metadata:
+  name: vllm-v0.22.0-openai-gpt-oss-120b-nvidia-h100-nvl-tp4


+    mif.moreh.io/role: e2e
+    mif.moreh.io/accelerator.vendor: nvidia
+    mif.moreh.io/accelerator.model: h100-nvl
+    mif.moreh.io/parallelism: "tp4"


+apiVersion: odin.moreh.io/v1alpha1
+kind: InferenceServiceTemplate
+metadata:
+  name: vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8


bongwoobak requested a review from a team as a code owner June 15, 2026 13:05

Copilot AI review requested due to automatic review settings June 15, 2026 13:05

gitgod-bot assigned bongwoobak Jun 15, 2026

Copilot started reviewing on behalf of bongwoobak June 15, 2026 13:06 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

bongwoobak and others added 2 commits June 15, 2026 22:16

Copilot AI review requested due to automatic review settings June 15, 2026 13:23

Copilot started reviewing on behalf of bongwoobak June 15, 2026 13:23 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

nulledge reviewed Jun 16, 2026

View reviewed changes

bongwoobak mentioned this pull request Jun 18, 2026

NO-ISSUE: feat(preset): add GLM-5.2 and Kimi-K2.7-Code NVIDIA B300 presets #135

Merged

Copilot AI review requested due to automatic review settings June 18, 2026 14:53

Copilot started reviewing on behalf of hhk7734 June 18, 2026 14:53 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

bongwoobak changed the title ~~NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets from aiand-rke2~~ NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets Jun 18, 2026

bongwoobak force-pushed the feat/aiand-nvidia-presets-v020-v022 branch from 134aef4 to d8083b6 Compare June 18, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134

NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134
bongwoobak wants to merge 5 commits into
mainfrom
feat/aiand-nvidia-presets-v020-v022

bongwoobak commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

nulledge Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bongwoobak commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Normalization (generality + spec-matching names)

⚠️ Note for reviewers

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

nulledge Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bongwoobak commented Jun 15, 2026 •

edited

Loading