fix: prevent CNClaim Finalize stuck and scale-in race during migration#592
fix: prevent CNClaim Finalize stuck and scale-in race during migration#592xzxiong wants to merge 4 commits into
Conversation
Fixes 4 bugs that cause CNClaim Finalize to get stuck and scale-in to select migrating claims: 1. Finalize() stuck when all owned Pods are claimed by another CNClaim — now releases the claimed-by label and completes finalization 2. CNClaimSet scale-in selects claims mid-migration (spec.SourcePod != nil) — now excludes migrating claims from scale-in candidates 3. Sync() Pod NotFound doesn't clear spec.PodName — claim stays in Lost forever with stale podName 4. watchPodChange only triggers reconcile via Pod label — now also triggers for CNClaims referencing the pod via spec.podName Closes #591 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
This comment has been minimized.
This comment has been minimized.
Promote github.com/google/go-cmp from indirect to direct dependency in api/go.mod — `go mod tidy` with the CI toolchain (Go 1.23.1) requires this change for a clean working tree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: fix: prevent CNClaim Finalize stuck and scale-in race during migration
〇、总结(TL;DR)修复 CNClaim 生命周期中 4 个相互关联的 bug:Finalize 卡死、scale-in 与 migration 竞态、Pod NotFound 残留、watch 事件丢失。每个 fix 独立且最小化。 问题统计:🔴 0 | 🟡 2 | 🟢 2 PR 描述质量:优秀 — 四维结构完整(背景/原因/方案/结果),含事件链还原、代码级定位和修复建议。 合并建议:✅ 建议合并 一、PR 描述评审
二、方案评审2.0 变更可视化Finalize 状态机变更: Scale-in 数据流变更: 2.1 方案合理性A. 函数调用场景
B. 使用场景覆盖
C. 架构合理性
2.2 测试方案梳理T1. 变更 → 测试映射
T8. 回归验证
T9. 故障复现效率
三、变更概述本 PR 修复了 CNClaim controller 和 CNClaimSet controller 中 4 个相互关联的 bug。这些 bug 在 dev freetier-01 环境中导致了 CNClaim Finalize 卡死(60+ 次/秒重试)、迁移中 claim 被误删、Pod 引用残留和 watch 事件丢失。 核心变更集中在两个 controller 包:
所有修改仅影响 CNClaim 生命周期管理,不涉及 API 类型变更或 CRD 变更。 四、代码审查(逐文件)
|
| 函数 | 时间复杂度 | 说明 |
|---|---|---|
Finalize loop |
O(owned × claims) | owned 通常 1,claims 通常 < 100 |
scaleIn filter |
O(owned) | 单次遍历 |
watchPodChangeFn |
O(claims_in_ns) | 每次 Pod 事件触发 |
containsRequest |
O(N) | N = requests 长度,通常 1-2 |
5.4 成本
- ✅ 无额外 API 调用(Patch 是必要操作)
- ✅ 无资源泄漏风险
5.5 安全
- ✅ 无用户输入处理
- ✅ 无敏感信息泄漏
5.9 违禁操作
- ✅ 无超时硬编码(使用现有常量
waitCacheTimeout、retryPatchInterval)
5.10 测试实现质量
- ✅ 断言有效,验证核心过滤逻辑
- ✅ 无 flaky 风险(纯逻辑测试,无 timing/ordering 依赖)
- 🟡
Test_scaleIn_skipsMigratingClaims重复了 filtering 逻辑而非调用scaleIn()——如果将来scaleIn实现改变但忘记更新测试,测试仍会通过。属于可接受的 tradeoff,因为完整测试需 reconciler context。
- Test_Finalize_releasesLabelWhenPodClaimedByOther: verifies Bug 1 fix — when a pod is owned by another claim, Finalize releases the claimed-by label instead of getting stuck - Test_Sync_clearsSpecOnPodNotFound: verifies Bug 3 fix — Pod NotFound clears spec.PodName and spec.NodeName, enabling proper Lost→cleanup flow - Test_watchPodChangeFn_enqueuesClaimBySpecPodName: verifies Bug 4 fix — CNClaims referencing a pod via spec.podName get reconciled even when the pod's claimed-by label is absent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
修复 CNClaim Finalize 卡死 + CNClaimSet scale-in 选中迁移中 claim 的 4 个 bug,解决 dev freetier-01 环境观察到的 claim 错乱、无限重试问题。
The PR fixes all 4 bugs from issue #591 (related to unit-agent#289):
关联
变更内容
Bug 1: Finalize() 卡死(主因)
问题:当所有 owned Pod 都被另一个 CNClaim 引用(迁移场景下
ensureOwnership覆盖了 label),Finalize()正确 skip reclaim 但返回(false, nil)→ 永远无法完成。修复:skip 时主动清理自己残留的
claimed-bylabel,让 owning claim 正常管理 Pod;所有 owned CN 处理完后返回(true, nil)完成 finalization。Bug 2: Scale-in 选中迁移中 claim
问题:
scaleIn()不排除spec.SourcePod != nil的 claim,导致迁移进行到一半时 claim 被删除,产生 Finalize 与 migration 冲突。修复:在
scaleIn()中过滤掉正在迁移的 claim,不作为 scale-in 候选。Bug 3: Sync() Pod NotFound 不清理 spec
问题:Pod 不存在时只设
status.phase=Lost,但spec.podName未清空,claim 永远卡在 Lost 且保留过期的 Pod 引用。修复:Pod NotFound 时同时清空
spec.PodName和spec.NodeName。Bug 4: watchPodChange 关联不完整
问题:Pod 删除事件仅通过
claimed-bylabel 关联 CNClaim,如果 label 已在迁移/reclaim 中被清除,则 CNClaim 不会被 reconcile。修复:
watchPodChange使用 manager client 额外检索spec.podName匹配的 CNClaim,确保 Pod 删除时所有相关 claim 都被触发 reconcile。测试
Test_scaleIn_skipsMigratingClaims: 验证迁移中 claim 不被选为 scale-in 候选Test_containsRequest: 验证 reconcile request 去重辅助函数Test_sortClaimsToDelete和Test_buildPodClaimIndex继续通过Checklist
go build ./pkg/controllers/cnclaim/ ./pkg/controllers/cnclaimset/)go test ./pkg/controllers/cnclaimset/)-lmo库无法本地执行,CI 环境可正常运行)