This repository was archived by the owner on Jun 11, 2026. It is now read-only.
Job Manager
- Support setting different scheduling policies per VC.
- RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
- FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
- Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
- Support limiting number of interactive GPUs per VC.
- Support user global public keys, enabling users to access jobs in any cluster using their own private key.
- Requeue preempted jobs at the head of the job queue.
- Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
- Delete very old jobs in small batches to avoid locking DB.
Restful API
- Allow specifying max retry count for each job.
- Support changing parameters per VC:
- Max job time
- Max number of interactive GPUs
- Scheduling policy
- Allow adding user IP for allowlist.
- VC quota management proportional to GPU/CPU.
Dashboard
- VC notification
- Show worker node count for pure CPU cluster.
- Add timeout column for jobs in View and Manage Jobs.
- Show insight message(s) on job details page for running jobs.
- Show repair message(s) on job details page for running jobs.
- Add Visual Studio Code (alpha) as an endpoint on job details page.
- Allow downloading full job logs.
- Allow specifying max retry count on job submission page.
- Show repair status for worker nodes.
- Show snapshot time on STORAGE tab.
- Support exporting STORGAE tab as csv.
- Add SETTINGS tab for VC admins to manage VC parameters.
- Add a hidden page for cluster admins to manage VC quota.
- Add My SSH Keys page for users to upload global public keys.
- Add My Allowed IP page for users to self-serve allowing their IP.
Monitoring and RepairManager
- Fix incorrect mapping for DCGM GPU metrics.
- Auto-manage repair cycle of nodes according to predefined set of rules.
- Add a Node Repair State dashboard for repair monitoring.
Storage Manager
- Delete an expired directory file-by-file to avoid locking NFS.
- Take ctime into consideration when expiring files.
Lustre
- Support default storage quota per person (with configurable hard/soft limit and grace period).
- Support multi-MDT in auto-deployment pipeline.
- Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.
(Azure) AllowList Manager
- Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
- Expire user IPs after a specified number of days.