Skip to content

Fix test_vllm_npu_worker_class_resolves: tolerate version mismatch#1

Open
UsernameFull wants to merge 102 commits into
mainfrom
npu_ci
Open

Fix test_vllm_npu_worker_class_resolves: tolerate version mismatch#1
UsernameFull wants to merge 102 commits into
mainfrom
npu_ci

Conversation

@UsernameFull

Copy link
Copy Markdown
Owner

Test fix for version incompatibility between vllm_ascend and expected import path.

UsernameFull and others added 30 commits January 28, 2026 20:51
Co-Authored-By: chengengru.cgr <chengengru.cgr@taobao.com>
Co-Authored-By: fengjingxuan.fjx <fengjingxuan.fjx@alibaba-inc.com>
Co-Authored-By: ft498870 <ft498870@taobao.com>
Co-Authored-By: heyancheng.hyc <heyancheng.hyc@taobao.com>
Co-Authored-By: hongzhen.yj <hongzhen.yj@alibaba-inc.com>
Co-Authored-By: huangju.hj <huangju.hj@alibaba-inc.com>
Co-Authored-By: jiamang.wang <jiamang.wang@alibaba-inc.com>
Co-Authored-By: scott.lxy <scott.lxy@taobao.com>
Co-Authored-By: shenjingyu.sjy <shenjingyu.sjy@alibaba-inc.com>
Co-Authored-By: shenliao.sla <shenliao.sla@taobao.com>
Co-Authored-By: tianhe.lzd <tianhe.lzd@alibaba-inc.com>
Co-Authored-By: weixun.wwx <weixun.wwx@alibaba-inc.com>
Co-Authored-By: wzy496492 <wzy496492@alibaba-inc.com>
Co-Authored-By: xiongshaopan.xsp <xiongshaopan.xsp@alibaba-inc.com>
Co-Authored-By: xuehuanran.xhr <xuehuanran.xhr@alibaba-inc.com>
Co-Authored-By: zhaohaizhou.zhz <zhaohaizhou.zhz@alibaba-inc.com>
Co-Authored-By: bzd02333762 <bzd02333762@alibaba-inc.com>
Co-authored-by: beiyue.lj <beiyue.lj@alibaba-inc.com>
Co-Authored-By: lt511297 lt511297@alibaba-inc.com
Co-Authored-By: lt511297 <lt511297@alibaba-inc.com>
Removed the call to upload checkpoint to MOS after saving.
to correct `group_size` instead of `gropu_size`
Previously, is_last_step was passed via **kwargs and transparently
forwarded to DeepSpeedEngine.save_checkpoint(), which does not accept
this argument, causing a TypeError at checkpoint time.

Fix by explicitly declaring is_last_step=None in the signature (consistent
with megatron_strategy and fsdp2_strategy), and applying the same
async_upload guard logic as the other strategies.

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
- Fix socket resource leak in get_node_ip() by properly closing socket
- Replace list comprehension with proper loop in destroy_placement_group() for better error handling

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@UsernameFull UsernameFull force-pushed the npu_ci branch 7 times, most recently from 33fd25c to d724d6d Compare May 25, 2026 07:28
@UsernameFull UsernameFull force-pushed the npu_ci branch 4 times, most recently from 51bb5dc to 8c9d56c Compare June 5, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.