Skip to content

NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134

Open
bongwoobak wants to merge 5 commits into
mainfrom
feat/aiand-nvidia-presets-v020-v022
Open

NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets#134
bongwoobak wants to merge 5 commits into
mainfrom
feat/aiand-nvidia-presets-v020-v022

Conversation

@bongwoobak

@bongwoobak bongwoobak commented Jun 15, 2026

Copy link
Copy Markdown
Member

What

Add InferenceServiceTemplate Helm presets that were missing from this repo, under templates/presets/vllm/.

  • v0.22.0 (7) — current-generation presets
  • v0.20.1 / v0.20.2 (6) — the preceding generation (rollback references)
Version Preset Parallelism
v0.22.0 deepseek-ai-deepseek-v4-flash-mtp-…-h200-sxm-tp8-moe-ep8 tp8 + EP
v0.22.0 deepseek-ai-deepseek-v4-pro-mtp-…-h200-sxm-tp8-moe-ep8 tp8 + EP
v0.22.0 google-gemma-4-31b-it-…-l40s-tp4 tp4
v0.22.0 zai-org-glm-5.1-fp8-mtp-…-b300-tp8-moe-tp8 tp8
v0.22.0 openai-gpt-oss-120b-…-h100-nvl-tp4 tp4
v0.22.0 moonshotai-kimi-k2.6-…-b300-tp8 tp8
v0.22.0 qwen-qwen3.6-27b-mtp-…-l40s-tp4 tp4
v0.20.2 deepseek-v4-flash/pro × {dp8-moe-ep8, tp8-moe-ep8} DP/TP + EP
v0.20.2 qwen-qwen3.6-27b-mtp-…-l40s-tp4 tp4
v0.20.1 moonshotai-kimi-k2.6-…-b300-tp8 tp8

Normalization (generality + spec-matching names)

These templates deviated from repo conventions. Normalized before committing:

  1. Expert parallelism via spec.parallelism.expert, not hardcoded --enable-expert-parallel in ISVC_EXTRA_ARGS — the vllm runtime base assembles the flag. The DeepSeek-V4 single-engine variants are therefore named tp8-moe-ep8 so the name reflects the real topology. (Behavior-preserving: same final vLLM command.)
  2. -mtp suffix + mif.moreh.io/model.mtp: "true" on MTP presets (DeepSeek-V4, GLM-5.1, Qwen3.6-27B). Eagle3 presets (GPT-OSS, Kimi) intentionally have no -mtp.
  3. nodeSelector retained (moai.moreh.io/accelerator.{vendor,model}) to match existing presets and keep scheduling deterministic, per deploy/helm/AGENTS.md.
  4. Dropped ISVC_USE_KV_EVENTS (Heimdall inferencePool coupling) and --no-enable-prefix-caching (a user-level knob per AGENTS.md; no other gpt-oss preset hardcodes it), so presets render standalone. Users opt in at the InferenceService.

⚠️ Note for reviewers

Because names were normalized, these repo presets do not match the previously-deployed IST names (e.g. …-deepseek-v4-pro-nvidia-h200-sxm-tp8 ↔ repo …-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8). Adopting these as the source of truth requires a follow-up to update the InferenceService templateRefs (and re-apply) — out of scope for this PR.

GLM-5.1 B300 uses the generic vllm/vllm-openai:v0.22.0 image (not the v0.19.0-era glm51-cu130); v0.22.0 mainline bundles B300/transformers support.

Verification

  • helm template → all 13 templates render
  • helm lint → 0 failures

Add InferenceServiceTemplates that were missing from the repo. Covers the
v0.22.0 set (7 presets) plus the preceding v0.20.1/v0.20.2 generation
(6 presets).

Models: DeepSeek-V4-Flash/Pro, Gemma-4-31B-it, GLM-5.1-FP8, GPT-OSS-120B,
Kimi-K2.6, Qwen3.6-27B across H200-SXM/B300/L40S/H100-NVL.

Normalized to repo conventions so names match the spec and presets stay
deployment-agnostic:

- Expert parallelism is declared via spec.parallelism.expert (the vllm
  runtime base assembles --enable-expert-parallel) instead of hardcoding
  the flag in ISVC_EXTRA_ARGS. The DeepSeek-V4 single-engine variants are
  therefore renamed tp8 -> tp8-moe-ep8 to reflect the actual topology.
- MTP-using presets carry the -mtp suffix and mif.moreh.io/model.mtp label
  (DeepSeek-V4, GLM-5.1, Qwen3.6-27B).
- Removed nodeSelector; GPU targeting is expressed via the
  mif.moreh.io/accelerator.* labels and pinned by the InferenceService.
- Dropped ISVC_USE_KV_EVENTS (Heimdall inferencePool coupling) and the
  DP-only --no-enable-prefix-caching so the presets render standalone.

Verified with `helm template` (13 templates render) and `helm lint`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bongwoobak bongwoobak requested a review from a team as a code owner June 15, 2026 13:05
Copilot AI review requested due to automatic review settings June 15, 2026 13:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports a set of Odin InferenceServiceTemplate Helm preset templates for NVIDIA vLLM (v0.22.0 plus rollback-era v0.20.1/v0.20.2) into deploy/helm/moai-inference-preset/templates/presets/vllm/, with naming/label normalization for model org/name, MTP tagging, and parallelism topology.

Changes:

  • Add 7 new vLLM v0.22.0 presets (DeepSeek-V4 Flash/Pro, Gemma 4 31B IT, Qwen3.6 27B MTP, GLM-5.1 FP8 MTP, GPT-OSS 120B, Kimi K2.6).
  • Add 6 rollback/reference presets for v0.20.2 (DeepSeek-V4 Flash/Pro TP/DP variants, Qwen3.6 27B MTP) and v0.20.1 (Kimi K2.6).
  • Normalize labels and spec.parallelism.expert usage for MoE/EP vs relying on extra CLI args.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml Add GLM-5.1 FP8 MTP preset for B300 TP8.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml Add Qwen3.6 27B MTP preset for L40S TP4.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp4.helm.yaml Add GPT-OSS 120B preset for H100 NVL TP4 (Eagle3 draft config).
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml Add Kimi K2.6 preset for B300 TP8.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml Add Gemma 4 31B IT preset for L40S TP4.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml Add DeepSeek-V4 Pro MTP preset for H200 SXM TP8 + EP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.22.0/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml Add DeepSeek-V4 Flash MTP preset for H200 SXM TP8 + EP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/qwen-qwen3.6-27b-mtp-nvidia-l40s-tp4.helm.yaml Add rollback-era Qwen3.6 27B MTP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml Add rollback-era DeepSeek-V4 Pro TP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-pro-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml Add rollback-era DeepSeek-V4 Pro DP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-tp8-moe-ep8.helm.yaml Add rollback-era DeepSeek-V4 Flash TP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.2/deepseek-ai-deepseek-v4-flash-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml Add rollback-era DeepSeek-V4 Flash DP8 + EP preset for v0.20.2.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml Add rollback-era Kimi K2.6 preset for v0.20.1.

bongwoobak and others added 2 commits June 15, 2026 22:16
…oss-120b preset

Per deploy/helm/AGENTS.md this flag is a user-level tuning knob, not
preset-defined, and no other gpt-oss preset in the repo hardcodes it.
Deployments opt in via the InferenceService.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sets

Existing vLLM presets consistently carry the
moai.moreh.io/accelerator.{vendor,model} nodeSelector (per
deploy/helm/AGENTS.md it is preset-defined), so add it back to all 13
new presets to keep scheduling deterministic and aligned with the rest
of the chart. Verified with `helm template` and `helm lint`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 15, 2026 13:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment on lines +24 to +26
- name: main
image: vllm/vllm-openai:v0.22.0
env:
Comment on lines +43 to +46
nodeSelector:
moai.moreh.io/accelerator.vendor: nvidia
moai.moreh.io/accelerator.model: l40s
tolerations:
…0 image

The v0.19.0 GLM-5.1 B300 preset needed the dedicated glm51-cu130 image;
v0.22.0 mainline bundles B300 (SM103) + transformers support, so the
generic image is sufficient. Add an in-file comment to explain the
divergence from the older preset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
image: vllm/vllm-openai:v0.20.1
image: vllm/vllm-openai:v0.22.0
env:
- name: ISVC_EXTRA_ARGS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-enable-prefix-caching 옵션은 vLLM 레시피입니다.
출처: https://recipes.vllm.ai/openai/gpt-oss-120b

aiand 배포와 일관되지 않습니다.

  • aiand: --no-enable-prefix-caching 유지
  • pr: 없음

gpt-oss를 nvidia 장비에서 --no-enable-prefix-caching 없이 테스트하지 못해 결정을 내릴 근거가 부족하네요.

…esets (#135)

Add two InferenceServiceTemplate presets:

- vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8: GLM-5.2 needs
  vLLM v0.23.0 day-0 support (v0.22.0 is insufficient). Follows the recipe's
  8xB200/B300 full-1M config (fp8_e4m3 KV cache, MTP num_speculative_tokens 5,
  max-num-seqs 32, VLLM_DEEP_GEMM_WARMUP=skip). Drops --trust-remote-code since
  the repo ships no remote .py (unlike the GLM-5.1 preset).

- vllm-v0.22.0-moonshotai-kimi-k2.7-code-nvidia-b300-tp8: verified on
  vLLM >= 0.19.1, so the v0.22.0 image suffices. INT4 (compressed-tensors,
  auto-detected). Carries over K2.6's multimodal tuning and TRTLLM_RAGGED MLA
  prefill backend since K2.7-Code reuses the K2.5 vision stack. No
  --speculative-config: the checkpoint has no native MTP
  (num_nextn_predict_layers=0) and no K2.7 eagle3 draft is published yet.

Both rendered with `helm template -s`.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 18, 2026 14:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: vllm-v0.22.0-openai-gpt-oss-120b-nvidia-h100-nvl-tp4
mif.moreh.io/role: e2e
mif.moreh.io/accelerator.vendor: nvidia
mif.moreh.io/accelerator.model: h100-nvl
mif.moreh.io/parallelism: "tp4"
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: vllm-v0.23.0-zai-org-glm-5.2-fp8-mtp-nvidia-b300-tp8-moe-tp8
@bongwoobak bongwoobak changed the title NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets from aiand-rke2 NO-ISSUE: feat(preset): add NVIDIA vLLM v0.20.x/v0.22.0 presets Jun 18, 2026
@bongwoobak bongwoobak force-pushed the feat/aiand-nvidia-presets-v020-v022 branch from 134aef4 to d8083b6 Compare June 18, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants