[1/N] add fp8 fp32 scale support for custom RL model by yiakwy-xpu-ml-framework-team · Pull Request #368 · antirez/ds4

yiakwy-xpu-ml-framework-team · 2026-06-09T08:20:17Z

Background

We added fp8 RL+SFT version of Deepseek V4 in week 0 support and suppressed DeepSeek V4 baseline in all major dimensions from our internal evaluation.

Hence we want to add 2 bit support for DeepSeek V4 with our Expert Pruning technology:

Noted, in H100/H800, we usually don't use E8M0 for scale, since it will introduce runtime overhead. FP32 scale is the best.

yiakwy-xpu-ml-framework-team · 2026-06-09T08:20:35Z

@antirez could you have a look at it ?

antirez · 2026-06-09T11:05:53Z

Hi, the PR itself has a few quality issues but especially it is not clear why it would be useful for the proejct as a whole given that we convert from DS4 hugging face formats.

yiakwy-xpu-ml-framework-team · 2026-06-10T03:12:20Z

Quantization is successful.

@antirez Thank you for the quick response, let me explain.

our sft/RL model of deepseek v4 has embedding layer (bf16 or int32), while deepseek model has embedding with type int64
since we are running in Hopper platform , our expert weight stored with E4M3 FP8 weight and weight scale stored with FP32 for best performance (which can verified in SGLang):

Customer DSV4 sglang fp8 serving in Hopper platform with identity injectioin, private/public knowledge injection and enhanced security shield module
Huggingface model is not SGLang compatible version, while our version is; and huggingface does not consider convert SFT/RL model from Bf16 to FP8 variants

The model is tuned specifically to handle Candonese, Chinese madarin and English efficiently.

yiakwy-xpu-ml-framework-team · 2026-06-11T09:22:11Z

Hi @antirez I can make sure the modification can generate correct model checkpoint for ds4. Wish your attention.

GB10 (2 bit dsv4-sft-rl, 15 toks/sec) :

Serving with raw model (no system prompt):

add fp8 fp32 scale support for custom RL model

4decca9

This was referenced Jun 10, 2026

[2/N] add cuda imatrix support for custom RL model #377

Open

Distributed CUDA worker host-registers the whole GGUF, not just its layer slice → Q4 OOM on 128GB DGX Spark #293

Open

our sft/rl model does not contain fp8 scale for this weight

fe08b54

yiakwy-xpu-ml-framework-team mentioned this pull request Jun 12, 2026

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/N] add fp8 fp32 scale support for custom RL model#368

[1/N] add fp8 fp32 scale support for custom RL model#368
yiakwy-xpu-ml-framework-team wants to merge 2 commits into
antirez:mainfrom
yiakwy-xpu-ml-framework-team:add_fp8_fp32_scale_support

yiakwy-xpu-ml-framework-team commented Jun 9, 2026 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 9, 2026

Uh oh!

antirez commented Jun 9, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 10, 2026 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yiakwy-xpu-ml-framework-team commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 9, 2026

Uh oh!

antirez commented Jun 9, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yiakwy-xpu-ml-framework-team commented Jun 9, 2026 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jun 10, 2026 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Jun 11, 2026 •

edited

Loading