feat: add checkpoint management workflow#411
Conversation
cbcbfa6 to
df08f42
Compare
功能概述本 PR 为 Crater 训练作业补齐平台级 checkpoint 能力。 用户可以在创建训练作业时启用 checkpoint,平台会注入标准 主要变化后端:
前端:
部署:
支持的框架当前支持以下框架的 checkpoint 候选识别:
平台负责识别候选 checkpoint 和恢复路径注入。真实训练状态保存和加载仍由训练框架或用户脚本完成。 API 入口
数据库迁移需要执行数据库迁移:
本地验证命令: cd backend
GOCACHE=$PWD/.cache/go-build make migrate验证方式后端单元测试: cd backend
env GOCACHE=$PWD/.cache/go-build go test ./internal/service/vcjob/checkpoint ./cmd/checkpoint-scannerHandler 测试: cd backend
env CRATER_DEBUG_CONFIG_PATH=$PWD/etc/debug-config.yaml GOCACHE=$PWD/.cache/go-build go test ./internal/handler/vcjob集群手工验证:
已验证框架:
验证结论:以上框架均可以被 scanner 识别 latest,并可通过恢复作业把 风险和边界
|
df08f42 to
93b8bd2
Compare
There was a problem hiding this comment.
Pull request overview
该 PR 为训练作业引入平台级 Checkpoint 管理能力,覆盖作业创建时的 checkpoint 配置归一化/校验与注入、checkpoint 扫描索引与“latest”判定、从 checkpoint 恢复作业、清理/配额/审计操作,以及前端创建表单与作业详情页的交互,并新增常驻的 checkpoint-scanner 服务及 Helm 部署配置。
Changes:
- 前端:新增 checkpoint 配置 schema、创建作业表单卡片与作业详情 Checkpoint 面板;扩展 vcjob API 类型与请求封装。
- 后端:新增 checkpoint 配置处理(Normalize/Validate/Annotation/Env 注入)、checkpoint 扫描与索引表、扫描/恢复/删除/清理等接口与操作日志类型;新增 checkpoint-scanner 二进制与调用链路。
- 部署与交付:Helm 增加 checkpoint-scanner 的 values 与模板;后端镜像与 CI 构建产物包含 checkpoint-scanner。
综合评估 / 阻断项
- Helm Chart 变更后未同步 bump
charts/crater/Chart.yamlversion,且 README 未补充新 values 参数说明(阻断)。 - Helm
_helpers.tpl目前仅在.Values.checkpointScanner.enabled=true时注入backendConfig.checkpointScanner.endpoint,会导致“禁用内置 scanner 但配置外部 endpoint”场景失效(阻断)。 - 前端 Checkpoint 面板的“清理”操作在未配置
maxToKeep(=0)时仍会默认传keepLast=3,存在误删风险(阻断)。 - 前端 checkpoint 相关新增文案存在多处硬编码(含校验 message/toast/tab label 等),与项目 i18n 约束不一致(阻断,需统一接入 i18n)。
Reviewed changes
Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/utils/form.ts | 新增 checkpointSchema 与默认值,用于作业创建表单校验与初始数据。 |
| frontend/src/services/api/vcjob.ts | 增加 checkpoint 相关类型定义与 checkpoints/scan/restore/delete/cleanup API 封装。 |
| frontend/src/routes/portal/jobs/new/tensorflow-ps-job.tsx | TF 作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。 |
| frontend/src/routes/portal/jobs/new/single-job.tsx | 单机作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。 |
| frontend/src/routes/portal/jobs/new/pytorch-ddp-job.tsx | PyTorch DDP 作业创建表单接入 CheckpointFormCard,并在提交 payload 中携带 checkpoint 配置。 |
| frontend/src/components/job/detail/index.tsx | 作业详情页新增 Checkpoint Tab(根据后端返回的 checkpoint.enabled 控制显示)。 |
| frontend/src/components/job/detail/checkpoint-panel.tsx | 新增作业 checkpoint 列表/扫描/恢复/删除/清理等前端面板实现。 |
| frontend/src/components/form/checkpoint-form-field.tsx | 新增 CheckpointFormCard 表单组件,支持框架/目录/恢复策略/保留策略等配置。 |
| charts/crater/values.yaml | 新增 checkpointScanner 与 backendConfig.checkpointScanner values 配置项。 |
| charts/crater/templates/checkpoint-scanner/service.yaml | 新增 checkpoint-scanner ClusterIP Service 模板。 |
| charts/crater/templates/checkpoint-scanner/deployment.yaml | 新增 checkpoint-scanner Deployment 模板(使用 backend 镜像、挂载 storage PVC 只读)。 |
| charts/crater/templates/_helpers.tpl | backendConfig 注入逻辑扩展 checkpointScanner endpoint 默认值填充。 |
| backend/pkg/reconciler/vcjob-reconciler.go | Job 记录落库与模型构建时新增 checkpoint 字段持久化(从注解解析)。 |
| backend/pkg/constants/const_op.go | 新增 checkpoint 扫描/恢复/删除/清理对应的操作日志类型常量。 |
| backend/pkg/config/config.go | Config 增加 CheckpointScanner 配置(endpoint/timeoutSeconds)。 |
| backend/Makefile | 增加 build-checkpoint-scanner 目标,用于构建 scanner 二进制。 |
| backend/internal/storage/file.go | 新增相对路径的 List/Stat/Remove API,供 checkpoint 扫描与删除使用。 |
| backend/internal/service/vcjob/runtime.go | JobRecord 增加 checkpoint 持久化;恢复作业时清理 PodTemplate 元信息字段。 |
| backend/internal/service/vcjob/checkpoint/validator.go | 新增 checkpoint 配置校验(框架、恢复模式、目录/挂载可写性、resumeFrom 范围等)。 |
| backend/internal/service/vcjob/checkpoint/types.go | 定义 checkpoint Config/PrepareInput 与默认值、与 model.CheckpointInfo 转换。 |
| backend/internal/service/vcjob/checkpoint/scanner.go | 基于 storage API 的 checkpoint 发现、latest 判定、索引持久化与 job.Checkpoint 回写。 |
| backend/internal/service/vcjob/checkpoint/scanner_service.go | scanner-service 侧本地文件系统扫描实现与 /healthz handler。 |
| backend/internal/service/vcjob/checkpoint/scanner_client.go | 后端调用 scanner-service 的 HTTP client(endpoint/timeout 归一化、请求与解析)。 |
| backend/internal/service/vcjob/checkpoint/processor.go | 串联 Normalizer + Validator 的 Prepare 处理入口。 |
| backend/internal/service/vcjob/checkpoint/policy.go | 框架策略注册表与默认目录生成策略(含 HF Trainer 特例说明)。 |
| backend/internal/service/vcjob/checkpoint/normalizer.go | checkpoint 配置归一化(默认框架/实验名、目录默认值、路径清理等)。 |
| backend/internal/service/vcjob/checkpoint/env.go | 向容器 Env 注入 CRATER_* checkpoint 运行时变量,并过滤用户自定义 CRATER_*。 |
| backend/internal/service/vcjob/checkpoint/cluster_scanner.go | Kubernetes 扫描入口(当前实现委托给 scanner-service)。 |
| backend/internal/service/vcjob/checkpoint/checkpoint_test.go | 新增 checkpoint Prepare/Env/annotation/扫描相关单测覆盖。 |
| backend/internal/service/vcjob/checkpoint/annotations.go | checkpoint 配置与运行态写入/解析 Volcano Job annotations。 |
| backend/internal/handler/vcjob/webide.go | WebIDE 创建流程接入 checkpoint 注解写入。 |
| backend/internal/handler/vcjob/vcjob.go | 路由注册新增 checkpoints 相关 API;JobDetail 响应增加 checkpoint 字段。 |
| backend/internal/handler/vcjob/util.go | 交互式 PodSpec env 注入接入 checkpoint;generateInteractivePodSpec 签名增加 jobName。 |
| backend/internal/handler/vcjob/tensorflow.go | Tensorflow 创建流程接入 checkpoint env 注入与注解写入。 |
| backend/internal/handler/vcjob/pytorch.go | Pytorch 创建流程接入 checkpoint env 注入与注解写入。 |
| backend/internal/handler/vcjob/jupyter.go | Jupyter 创建流程接入 checkpoint 注解写入。 |
| backend/internal/handler/vcjob/custom.go | Custom/Training 创建流程接入 checkpoint env 注入与注解写入;GenerateCustomPodSpec 签名增加 jobName。 |
| backend/internal/handler/vcjob/checkpoint.go | handler 层封装 prepareCheckpointConfig / env 注入 / annotation 写入桥接函数。 |
| backend/internal/handler/vcjob/checkpoint_runtime.go | 新增 checkpoints 列表/扫描/恢复/删除/清理 API 及配额/审计细节。 |
| backend/internal/handler/aijob/new.go | AIJob CreateCustom 调用签名适配(传入 jobName)。 |
| backend/hack/checkpoint_restore_realistic_train.py | 新增 realistic restore 验证脚本(生成连续性 proof)。 |
| backend/hack/checkpoint_restore_real_framework_train.py | 新增真实框架(pytorch/lightning/deepspeed/verl)restore 验证脚本。 |
| backend/hack/checkpoint_restore_multiframework_train.py | 新增多框架布局一致性验证脚本。 |
| backend/etc/example-config.yaml | example 配置新增 checkpointScanner endpoint/timeoutSeconds。 |
| backend/Dockerfile | 镜像中新增 checkpoint-scanner 二进制拷贝与可执行权限。 |
| backend/dao/query/jobs.gen.go | jobs 查询字段映射新增 checkpoint 列。 |
| backend/dao/model/job.go | Job 模型新增 CheckpointInfo 与 Job.Checkpoint JSON 字段。 |
| backend/dao/model/job_checkpoint.go | 新增 JobCheckpoint 表模型(含状态/索引/元数据)。 |
| backend/cmd/gorm-gen/models/migrate.go | 增加 Job.Checkpoint 列迁移与 JobCheckpoint 表迁移,并纳入 InitSchema。 |
| backend/cmd/crater/main.go | 启动时读取 CRATER_STORAGE_ROOT/ROOTDIR 并设置 storage root。 |
| backend/cmd/checkpoint-scanner/main.go | 新增常驻 checkpoint-scanner HTTP 服务入口(/scan, /healthz)。 |
| .github/workflows/backend-pr.yml | PR CI 构建产物新增 checkpoint-scanner 二进制。 |
| .github/workflows/backend-build.yml | 构建 workflow 为多平台产物新增 checkpoint-scanner 二进制。 |
Files not reviewed (1)
- backend/dao/query/jobs.gen.go: Language not supported
| const cleanupMutation = useMutation({ | ||
| mutationFn: () => | ||
| apiJobCheckpointCleanup(jobName, { | ||
| keepLast: data?.quota.maxToKeep || data?.checkpoint?.maxToKeep || 3, | ||
| }), | ||
| onSuccess: async (res) => { | ||
| toast.success(`清理完成,释放 ${formatBytes(res.data.reclaimedBytes)}`) | ||
| await refreshQueries() | ||
| }, | ||
| }) |
| .superRefine((value, ctx) => { | ||
| if (!value.enabled) { | ||
| return | ||
| } | ||
| if (value.resumeMode === 'manual' && !value.resumeFrom) { | ||
| ctx.addIssue({ | ||
| code: z.ZodIssueCode.custom, | ||
| message: '手动恢复时必须填写 checkpoint 路径', | ||
| path: ['resumeFrom'], | ||
| }) |
| # -- Checkpoint scanner service configuration | ||
| checkpointScanner: | ||
| # -- Whether to deploy the always-on checkpoint scanner service | ||
| enabled: true | ||
| # -- Number of checkpoint scanner replicas | ||
| replicas: 1 | ||
| # -- HTTP port exposed by the scanner service | ||
| port: 7330 | ||
| # -- Max concurrent scan requests handled by one scanner pod | ||
| concurrency: 4 |
| addr := normalizePort(port) | ||
| klog.Infof("checkpoint-scanner starting on %s root=%s concurrency=%d version=%s commit=%s buildType=%s buildTime=%s", | ||
| addr, root, concurrency, AppVersion, CommitSHA, BuildType, BuildTime) | ||
| if err := http.ListenAndServe(addr, mux); err != nil { | ||
| klog.Fatalf("checkpoint-scanner failed: %v", err) | ||
| } |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
AkashiSensei
left a comment
There was a problem hiding this comment.
感谢您的贡献!
流程
我大概看了您的代码,对流程大概有这样的理解,也有些疑惑。
首先,用户可以对一个作业开启 checkpoint 功能,之后平台会向容器中注入环境变量,用户需要自行在训练代码中读取这些环境变量,并将 checkpoint 保存到指定位置。
- 这里看起来是支持了各种的作业类型。
然后用户可以让一个单独的 scanner 扫描指定的自己的作业,此时 scanner 获取作业对应的 checkpoint 路径,扫描并将结果写入数据库。
- 这里看起来处理了重复扫描;对于通过平台 checkpoint 删除/清理接口删除的情况,会同步更新数据库状态。如果 checkpoint 文件是通过挂载目录、存储 API 等外部方式删除的,则需要等下一次扫描后才会被标记为 missing。
- 但是看起来不太能很好的处理多个作业指定相同 checkpoint 路径的情况。
用户可以选择有记录的 checkpoint,选择从对应的情况下恢复。此时平台会重新创建对应的作业,然后设置对应的环境变量启动作业,由用户作业代码处理从 checkpoint 恢复的逻辑。
- 目前平台会在创建作业时留下一个配置表单,这个表单将会在克隆作业时被使用,这个表单同样有一个版本号,看起来这个 PR 中没有处理这个版本号,可能会带来一些更新前后的兼容性问题。
- 目前这个“重新创建”作业似乎没有使用已有的作业克隆机制,而是直接基于数据库中保存的原始作业 attributes 重新提交;同时看起来也没有自己的一套版本管理或迁移机制。如果之后作业 attributes 的结构、Volcano Job spec 或相关 label/annotation 约定发生变化,可能会引起跨版本兼容性问题。
问题
总的来说,代码实现考虑的相对比较周全,但是可能需要在讨论下这个功能的必要性和支持方式。
在我个人看来,这个功能的支持应该是更加偏向镜像/用户代码侧的,因为平台在整个过程中实际上就多了一个指出 checkpoint 在哪并辅助管理的功能,在此之前用户也可以通过把共享存储中的 checkpoint 挂载到不同的 Pod 中来共享检查点,核心其实是要求用户的代码能够支持把 checkpoint 保存到环境变量指定的位置,并且从指定的位置恢复。
从这个角度来说,我们再单独部署一个 scanner,对现有代码做比较大的改动似乎不太值当。
对于用户来说,这有比较高的学习成本,却可能不太会获得很大的收益。如果是我,我可能会选择自己向共享存储保存 checkpoint 以及读取,而不是去识别和对接相应的环境变量。
因此我们可能还需要进一步的探讨这个功能。
再次感谢您的贡献!
Thank you for your contribution!
Workflow
I took a rough look at the code, and this is my current understanding of the workflow, along with a few questions.
First, users can enable the checkpoint feature for a job. After that, the platform injects environment variables into the container, and users need to read these environment variables in their training code and save checkpoints to the specified location.
- This appears to support multiple job types.
Then users can ask a standalone scanner to scan their specified job. The scanner obtains the checkpoint path for that job, scans it, and writes the results into the database.
- This appears to handle repeated scans. For checkpoints deleted through the platform's checkpoint delete/cleanup APIs, the database state is updated accordingly. If checkpoint files are deleted externally, for example through the mounted directory or storage APIs, they will only be marked as missing after the next scan.
- However, it does not seem to handle the case where multiple jobs specify the same checkpoint path very well.
Users can select a recorded checkpoint and restore from it. In this process, the platform recreates the corresponding job, sets the related environment variables, and starts the job. The user's job code is then responsible for handling the actual restore logic from the checkpoint.
- Currently, the platform saves a configuration form when creating a job. This form is later used when cloning a job, and it also has a version number. It looks like this PR does not handle that version number, which may introduce compatibility issues before and after updates.
- The current "recreate" job flow does not seem to use the existing job clone mechanism. Instead, it directly resubmits the original job attributes stored in the database. At the same time, it does not appear to have its own version management or migration mechanism. If the structure of job attributes, the Volcano Job spec, or related label/annotation conventions change in the future, this may lead to cross-version compatibility issues.
Questions
Overall, the implementation appears to be relatively thoughtful, but we may still need to further discuss the necessity of this feature and the way it should be supported.
In my personal view, support for this feature should lean more toward the image/user-code side. Throughout the whole process, the platform mainly provides a pointer to where the checkpoint is located and offers some auxiliary management. Before this feature, users could already share checkpoints by mounting checkpoint files from shared storage into different Pods. The core requirement is still that user code must support saving checkpoints to the path specified by environment variables and restoring from that path.
From this perspective, deploying a separate scanner and making relatively large changes to the existing code may not be worth it.
For users, this introduces a relatively high learning cost, while the actual benefit may not be very large. If it were me, I might choose to save checkpoints to shared storage and read them myself, rather than identifying and integrating with these environment variables.
Therefore, we may still need to further discuss this feature.
Thanks again for your contribution!
Summary
Add platform-level checkpoint management for training jobs.
This includes checkpoint config on job creation, CRATER_* runtime env injection, writable persistent mount validation, checkpoint indexing, latest checkpoint detection, restore-from-checkpoint
workflow, cleanup/quota/audit support, frontend interactions, and an always-on checkpoint-scanner service.
Changes
Testing
go test ./internal/service/vcjob/checkpoint ./cmd/checkpoint-scannergo test ./internal/handler/vcjobRestore proof logs showed:
loaded_step=4 continued_to=7 weights_changed=True