Skip to content

feat: add checkpoint management workflow#411

Open
Alicemath1029 wants to merge 4 commits into
raids-lab:mainfrom
Alicemath1029:rb-20260518-00
Open

feat: add checkpoint management workflow#411
Alicemath1029 wants to merge 4 commits into
raids-lab:mainfrom
Alicemath1029:rb-20260518-00

Conversation

@Alicemath1029

Copy link
Copy Markdown

Summary

Add platform-level checkpoint management for training jobs.

This includes checkpoint config on job creation, CRATER_* runtime env injection, writable persistent mount validation, checkpoint indexing, latest checkpoint detection, restore-from-checkpoint
workflow, cleanup/quota/audit support, frontend interactions, and an always-on checkpoint-scanner service.

Changes

  • Add checkpoint config normalization, validation, annotations, and env injection
  • Add Job checkpoint metadata and job_checkpoints index table
  • Add checkpoint list, scan, restore, delete, and cleanup APIs
  • Add always-on checkpoint-scanner service and Helm deployment
  • Add frontend checkpoint form and job detail panel
  • Add docs for implementation design and PR review
  • Add validation scripts for multi-framework restore scenarios

Testing

  • go test ./internal/service/vcjob/checkpoint ./cmd/checkpoint-scanner
  • go test ./internal/handler/vcjob
  • Manually validated scanner-service based scan and restore flows across:
    • PyTorch
    • Hugging Face Trainer
    • Lightning
    • DeepSpeed
    • VERL

Restore proof logs showed:
loaded_step=4 continued_to=7 weights_changed=True

@Alicemath1029

Copy link
Copy Markdown
Author

功能概述

本 PR 为 Crater 训练作业补齐平台级 checkpoint 能力。

用户可以在创建训练作业时启用 checkpoint,平台会注入标准 CRATER_* 环境变量,并要求 checkpoint 目录位于可写持久化挂载下。作业运行后,用户可以在详情页扫描 checkpoint 列表、识别 latest checkpoint、从历史 checkpoint 一键恢复新作业,并进行删除、清理和配额查看。

主要变化

后端:

  • 作业创建公共结构增加 checkpoint 配置。
  • 新增 checkpoint 配置归一化、校验、环境变量注入、annotation 写入。
  • Job 增加 checkpoint JSON 信息。
  • 新增 job_checkpoints 表保存 checkpoint 索引。
  • 新增 checkpoint 列表、扫描、恢复、删除、清理 API。
  • 新增操作审计类型:扫描、恢复、删除、清理。
  • 新增 cmd/checkpoint-scanner 常驻扫描服务。

前端:

  • 新建作业页增加 Checkpoint 配置卡片。
  • 作业详情页增加 Checkpoint 面板。
  • 支持扫描、latest 展示、一键恢复、删除、清理和配额展示。

部署:

  • backend 镜像新增 /checkpoint-scanner 二进制。
  • Helm 新增 checkpoint-scanner Deployment 和 Service。
  • Helm 自动为 backend 配置 scanner service endpoint。
  • CI 增加 scanner 构建检查。

支持的框架

当前支持以下框架的 checkpoint 候选识别:

框架 典型 checkpoint
PyTorch checkpoint-0007.ptcheckpoint-0007/
Hugging Face Trainer checkpoint-0007/
Lightning epoch=0-step_0007.ckpt
DeepSpeed global_step0007/
VERL global_step_0007/
Custom 目录、.pt.pth.ckpt

平台负责识别候选 checkpoint 和恢复路径注入。真实训练状态保存和加载仍由训练框架或用户脚本完成。

API 入口

方法 路径 说明
GET /api/v1/vcjobs/:name/checkpoints?autoScan=true 查看 checkpoint 列表
POST /api/v1/vcjobs/:name/checkpoints/scan 手动扫描
POST /api/v1/vcjobs/:name/checkpoints/:checkpointID/restore 从 checkpoint 恢复
DELETE /api/v1/vcjobs/:name/checkpoints/:checkpointID 删除 checkpoint
POST /api/v1/vcjobs/:name/checkpoints/cleanup 批量清理

数据库迁移

需要执行数据库迁移:

  • jobs.checkpoint
  • job_checkpoints

本地验证命令:

cd backend
GOCACHE=$PWD/.cache/go-build make migrate

验证方式

后端单元测试:

cd backend
env GOCACHE=$PWD/.cache/go-build go test ./internal/service/vcjob/checkpoint ./cmd/checkpoint-scanner

Handler 测试:

cd backend
env CRATER_DEBUG_CONFIG_PATH=$PWD/etc/debug-config.yaml GOCACHE=$PWD/.cache/go-build go test ./internal/handler/vcjob

集群手工验证:

  1. 部署或启动 checkpoint-scanner
  2. 创建启用 checkpoint 的训练作业。
  3. 确认作业容器环境变量包含 CRATER_CHECKPOINT_DIRCRATER_RESUME_FROM
  4. 训练脚本写出多个 checkpoint。
  5. 在详情页点击扫描。
  6. 确认 latest 指向最大 step。
  7. 选择历史 checkpoint 点击恢复。
  8. 确认新作业 CRATER_RESUME_FROM 指向所选 checkpoint。
  9. 查看训练日志,确认从历史 step 继续训练。
  10. 执行 dryRun 清理,再执行真实清理。

已验证框架:

  • PyTorch
  • Hugging Face Trainer
  • Lightning
  • DeepSpeed
  • VERL

验证结论:以上框架均可以被 scanner 识别 latest,并可通过恢复作业把 CRATER_RESUME_FROM 注入到新作业中。真实恢复脚本证明 loaded_step=4 continued_to=7 weights_changed=True

风险和边界

  • 如果用户脚本不读取 CRATER_RESUME_FROM,平台只能创建恢复作业,不能保证训练状态恢复。
  • 平台不保存 checkpoint 文件内容,只保存索引和元数据。
  • scanner 服务不可用时,扫描和列表自动刷新会失败,但不会影响作业创建和训练运行。
  • 当前清理为用户触发,尚未实现定时清理。

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 为训练作业引入平台级 Checkpoint 管理能力,覆盖作业创建时的 checkpoint 配置归一化/校验与注入、checkpoint 扫描索引与“latest”判定、从 checkpoint 恢复作业、清理/配额/审计操作,以及前端创建表单与作业详情页的交互,并新增常驻的 checkpoint-scanner 服务及 Helm 部署配置。

Changes:

  • 前端:新增 checkpoint 配置 schema、创建作业表单卡片与作业详情 Checkpoint 面板;扩展 vcjob API 类型与请求封装。
  • 后端:新增 checkpoint 配置处理(Normalize/Validate/Annotation/Env 注入)、checkpoint 扫描与索引表、扫描/恢复/删除/清理等接口与操作日志类型;新增 checkpoint-scanner 二进制与调用链路。
  • 部署与交付:Helm 增加 checkpoint-scanner 的 values 与模板;后端镜像与 CI 构建产物包含 checkpoint-scanner。

综合评估 / 阻断项

  • Helm Chart 变更后未同步 bump charts/crater/Chart.yaml version,且 README 未补充新 values 参数说明(阻断)。
  • Helm _helpers.tpl 目前仅在 .Values.checkpointScanner.enabled=true 时注入 backendConfig.checkpointScanner.endpoint,会导致“禁用内置 scanner 但配置外部 endpoint”场景失效(阻断)。
  • 前端 Checkpoint 面板的“清理”操作在未配置 maxToKeep(=0)时仍会默认传 keepLast=3,存在误删风险(阻断)。
  • 前端 checkpoint 相关新增文案存在多处硬编码(含校验 message/toast/tab label 等),与项目 i18n 约束不一致(阻断,需统一接入 i18n)。

Reviewed changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
frontend/src/utils/form.ts 新增 checkpointSchema 与默认值,用于作业创建表单校验与初始数据。
frontend/src/services/api/vcjob.ts 增加 checkpoint 相关类型定义与 checkpoints/scan/restore/delete/cleanup API 封装。
frontend/src/routes/portal/jobs/new/tensorflow-ps-job.tsx TF 作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。
frontend/src/routes/portal/jobs/new/single-job.tsx 单机作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。
frontend/src/routes/portal/jobs/new/pytorch-ddp-job.tsx PyTorch DDP 作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。
frontend/src/components/job/detail/index.tsx 作业详情页新增 Checkpoint Tab(根据后端返回的 checkpoint.enabled 控制显示)。
frontend/src/components/job/detail/checkpoint-panel.tsx 新增作业 checkpoint 列表/扫描/恢复/删除/清理等前端面板实现。
frontend/src/components/form/checkpoint-form-field.tsx 新增 CheckpointFormCard 表单组件,支持框架/目录/恢复策略/保留策略等配置。
charts/crater/values.yaml 新增 checkpointScanner 与 backendConfig.checkpointScanner values 配置项。
charts/crater/templates/checkpoint-scanner/service.yaml 新增 checkpoint-scanner ClusterIP Service 模板。
charts/crater/templates/checkpoint-scanner/deployment.yaml 新增 checkpoint-scanner Deployment 模板(使用 backend 镜像、挂载 storage PVC 只读)。
charts/crater/templates/_helpers.tpl backendConfig 注入逻辑扩展 checkpointScanner endpoint 默认值填充。
backend/pkg/reconciler/vcjob-reconciler.go Job 记录落库与模型构建时新增 checkpoint 字段持久化(从注解解析)。
backend/pkg/constants/const_op.go 新增 checkpoint 扫描/恢复/删除/清理对应的操作日志类型常量。
backend/pkg/config/config.go Config 增加 CheckpointScanner 配置(endpoint/timeoutSeconds)。
backend/Makefile 增加 build-checkpoint-scanner 目标,用于构建 scanner 二进制。
backend/internal/storage/file.go 新增相对路径的 List/Stat/Remove API,供 checkpoint 扫描与删除使用。
backend/internal/service/vcjob/runtime.go JobRecord 增加 checkpoint 持久化;恢复作业时清理 PodTemplate 元信息字段。
backend/internal/service/vcjob/checkpoint/validator.go 新增 checkpoint 配置校验(框架、恢复模式、目录/挂载可写性、resumeFrom 范围等)。
backend/internal/service/vcjob/checkpoint/types.go 定义 checkpoint Config/PrepareInput 与默认值、与 model.CheckpointInfo 转换。
backend/internal/service/vcjob/checkpoint/scanner.go 基于 storage API 的 checkpoint 发现、latest 判定、索引持久化与 job.Checkpoint 回写。
backend/internal/service/vcjob/checkpoint/scanner_service.go scanner-service 侧本地文件系统扫描实现与 /healthz handler。
backend/internal/service/vcjob/checkpoint/scanner_client.go 后端调用 scanner-service 的 HTTP client(endpoint/timeout 归一化、请求与解析)。
backend/internal/service/vcjob/checkpoint/processor.go 串联 Normalizer + Validator 的 Prepare 处理入口。
backend/internal/service/vcjob/checkpoint/policy.go 框架策略注册表与默认目录生成策略(含 HF Trainer 特例说明)。
backend/internal/service/vcjob/checkpoint/normalizer.go checkpoint 配置归一化(默认框架/实验名、目录默认值、路径清理等)。
backend/internal/service/vcjob/checkpoint/env.go 向容器 Env 注入 CRATER_* checkpoint 运行时变量,并过滤用户自定义 CRATER_*。
backend/internal/service/vcjob/checkpoint/cluster_scanner.go Kubernetes 扫描入口(当前实现委托给 scanner-service)。
backend/internal/service/vcjob/checkpoint/checkpoint_test.go 新增 checkpoint Prepare/Env/annotation/扫描相关单测覆盖。
backend/internal/service/vcjob/checkpoint/annotations.go checkpoint 配置与运行态写入/解析 Volcano Job annotations。
backend/internal/handler/vcjob/webide.go WebIDE 创建流程接入 checkpoint 注解写入。
backend/internal/handler/vcjob/vcjob.go 路由注册新增 checkpoints 相关 API;JobDetail 响应增加 checkpoint 字段。
backend/internal/handler/vcjob/util.go 交互式 PodSpec env 注入接入 checkpoint;generateInteractivePodSpec 签名增加 jobName。
backend/internal/handler/vcjob/tensorflow.go Tensorflow 创建流程接入 checkpoint env 注入与注解写入。
backend/internal/handler/vcjob/pytorch.go Pytorch 创建流程接入 checkpoint env 注入与注解写入。
backend/internal/handler/vcjob/jupyter.go Jupyter 创建流程接入 checkpoint 注解写入。
backend/internal/handler/vcjob/custom.go Custom/Training 创建流程接入 checkpoint env 注入与注解写入;GenerateCustomPodSpec 签名增加 jobName。
backend/internal/handler/vcjob/checkpoint.go handler 层封装 prepareCheckpointConfig / env 注入 / annotation 写入桥接函数。
backend/internal/handler/vcjob/checkpoint_runtime.go 新增 checkpoints 列表/扫描/恢复/删除/清理 API 及配额/审计细节。
backend/internal/handler/aijob/new.go AIJob CreateCustom 调用签名适配(传入 jobName)。
backend/hack/checkpoint_restore_realistic_train.py 新增 realistic restore 验证脚本(生成连续性 proof)。
backend/hack/checkpoint_restore_real_framework_train.py 新增真实框架(pytorch/lightning/deepspeed/verl)restore 验证脚本。
backend/hack/checkpoint_restore_multiframework_train.py 新增多框架布局一致性验证脚本。
backend/etc/example-config.yaml example 配置新增 checkpointScanner endpoint/timeoutSeconds。
backend/Dockerfile 镜像中新增 checkpoint-scanner 二进制拷贝与可执行权限。
backend/dao/query/jobs.gen.go jobs 查询字段映射新增 checkpoint 列。
backend/dao/model/job.go Job 模型新增 CheckpointInfo 与 Job.Checkpoint JSON 字段。
backend/dao/model/job_checkpoint.go 新增 JobCheckpoint 表模型(含状态/索引/元数据)。
backend/cmd/gorm-gen/models/migrate.go 增加 Job.Checkpoint 列迁移与 JobCheckpoint 表迁移,并纳入 InitSchema。
backend/cmd/crater/main.go 启动时读取 CRATER_STORAGE_ROOT/ROOTDIR 并设置 storage root。
backend/cmd/checkpoint-scanner/main.go 新增常驻 checkpoint-scanner HTTP 服务入口(/scan, /healthz)。
.github/workflows/backend-pr.yml PR CI 构建产物新增 checkpoint-scanner 二进制。
.github/workflows/backend-build.yml 构建 workflow 为多平台产物新增 checkpoint-scanner 二进制。
Files not reviewed (1)
  • backend/dao/query/jobs.gen.go: Language not supported

Comment thread charts/crater/templates/_helpers.tpl Outdated
Comment on lines +91 to +100
const cleanupMutation = useMutation({
mutationFn: () =>
apiJobCheckpointCleanup(jobName, {
keepLast: data?.quota.maxToKeep || data?.checkpoint?.maxToKeep || 3,
}),
onSuccess: async (res) => {
toast.success(`清理完成,释放 ${formatBytes(res.data.reclaimedBytes)}`)
await refreshQueries()
},
})
Comment on lines +232 to +241
.superRefine((value, ctx) => {
if (!value.enabled) {
return
}
if (value.resumeMode === 'manual' && !value.resumeFrom) {
ctx.addIssue({
code: z.ZodIssueCode.custom,
message: '手动恢复时必须填写 checkpoint 路径',
path: ['resumeFrom'],
})
Comment thread charts/crater/values.yaml
Comment on lines +83 to +92
# -- Checkpoint scanner service configuration
checkpointScanner:
# -- Whether to deploy the always-on checkpoint scanner service
enabled: true
# -- Number of checkpoint scanner replicas
replicas: 1
# -- HTTP port exposed by the scanner service
port: 7330
# -- Max concurrent scan requests handled by one scanner pod
concurrency: 4
Comment on lines +52 to +57
addr := normalizePort(port)
klog.Infof("checkpoint-scanner starting on %s root=%s concurrency=%d version=%s commit=%s buildType=%s buildTime=%s",
addr, root, concurrency, AppVersion, CommitSHA, BuildType, BuildTime)
if err := http.ListenAndServe(addr, mux); err != nil {
klog.Fatalf("checkpoint-scanner failed: %v", err)
}
Alicemath1029 and others added 3 commits June 3, 2026 21:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

@AkashiSensei AkashiSensei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢您的贡献!

流程

我大概看了您的代码,对流程大概有这样的理解,也有些疑惑。

首先,用户可以对一个作业开启 checkpoint 功能,之后平台会向容器中注入环境变量,用户需要自行在训练代码中读取这些环境变量,并将 checkpoint 保存到指定位置。

  • 这里看起来是支持了各种的作业类型。

然后用户可以让一个单独的 scanner 扫描指定的自己的作业,此时 scanner 获取作业对应的 checkpoint 路径,扫描并将结果写入数据库。

  • 这里看起来处理了重复扫描;对于通过平台 checkpoint 删除/清理接口删除的情况,会同步更新数据库状态。如果 checkpoint 文件是通过挂载目录、存储 API 等外部方式删除的,则需要等下一次扫描后才会被标记为 missing。
  • 但是看起来不太能很好的处理多个作业指定相同 checkpoint 路径的情况。

用户可以选择有记录的 checkpoint,选择从对应的情况下恢复。此时平台会重新创建对应的作业,然后设置对应的环境变量启动作业,由用户作业代码处理从 checkpoint 恢复的逻辑。

  • 目前平台会在创建作业时留下一个配置表单,这个表单将会在克隆作业时被使用,这个表单同样有一个版本号,看起来这个 PR 中没有处理这个版本号,可能会带来一些更新前后的兼容性问题。
  • 目前这个“重新创建”作业似乎没有使用已有的作业克隆机制,而是直接基于数据库中保存的原始作业 attributes 重新提交;同时看起来也没有自己的一套版本管理或迁移机制。如果之后作业 attributes 的结构、Volcano Job spec 或相关 label/annotation 约定发生变化,可能会引起跨版本兼容性问题。

问题

总的来说,代码实现考虑的相对比较周全,但是可能需要在讨论下这个功能的必要性和支持方式。

在我个人看来,这个功能的支持应该是更加偏向镜像/用户代码侧的,因为平台在整个过程中实际上就多了一个指出 checkpoint 在哪并辅助管理的功能,在此之前用户也可以通过把共享存储中的 checkpoint 挂载到不同的 Pod 中来共享检查点,核心其实是要求用户的代码能够支持把 checkpoint 保存到环境变量指定的位置,并且从指定的位置恢复。

从这个角度来说,我们再单独部署一个 scanner,对现有代码做比较大的改动似乎不太值当。

对于用户来说,这有比较高的学习成本,却可能不太会获得很大的收益。如果是我,我可能会选择自己向共享存储保存 checkpoint 以及读取,而不是去识别和对接相应的环境变量。

因此我们可能还需要进一步的探讨这个功能。

再次感谢您的贡献!


Thank you for your contribution!

Workflow

I took a rough look at the code, and this is my current understanding of the workflow, along with a few questions.

First, users can enable the checkpoint feature for a job. After that, the platform injects environment variables into the container, and users need to read these environment variables in their training code and save checkpoints to the specified location.

  • This appears to support multiple job types.

Then users can ask a standalone scanner to scan their specified job. The scanner obtains the checkpoint path for that job, scans it, and writes the results into the database.

  • This appears to handle repeated scans. For checkpoints deleted through the platform's checkpoint delete/cleanup APIs, the database state is updated accordingly. If checkpoint files are deleted externally, for example through the mounted directory or storage APIs, they will only be marked as missing after the next scan.
  • However, it does not seem to handle the case where multiple jobs specify the same checkpoint path very well.

Users can select a recorded checkpoint and restore from it. In this process, the platform recreates the corresponding job, sets the related environment variables, and starts the job. The user's job code is then responsible for handling the actual restore logic from the checkpoint.

  • Currently, the platform saves a configuration form when creating a job. This form is later used when cloning a job, and it also has a version number. It looks like this PR does not handle that version number, which may introduce compatibility issues before and after updates.
  • The current "recreate" job flow does not seem to use the existing job clone mechanism. Instead, it directly resubmits the original job attributes stored in the database. At the same time, it does not appear to have its own version management or migration mechanism. If the structure of job attributes, the Volcano Job spec, or related label/annotation conventions change in the future, this may lead to cross-version compatibility issues.

Questions

Overall, the implementation appears to be relatively thoughtful, but we may still need to further discuss the necessity of this feature and the way it should be supported.

In my personal view, support for this feature should lean more toward the image/user-code side. Throughout the whole process, the platform mainly provides a pointer to where the checkpoint is located and offers some auxiliary management. Before this feature, users could already share checkpoints by mounting checkpoint files from shared storage into different Pods. The core requirement is still that user code must support saving checkpoints to the path specified by environment variables and restoring from that path.

From this perspective, deploying a separate scanner and making relatively large changes to the existing code may not be worth it.

For users, this introduces a relatively high learning cost, while the actual benefit may not be very large. If it were me, I might choose to save checkpoints to shared storage and read them myself, rather than identifying and integrating with these environment variables.

Therefore, we may still need to further discuss this feature.

Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants