From b9bc9c1c5b3ea57a021fcce3920cff1ce5f9430a Mon Sep 17 00:00:00 2001 From: Frank Keller Date: Wed, 13 May 2026 14:56:03 +0100 Subject: [PATCH] added guide on priority jobs --- guides/guide_on_icf_priority_jobs.md | 136 +++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 guides/guide_on_icf_priority_jobs.md diff --git a/guides/guide_on_icf_priority_jobs.md b/guides/guide_on_icf_priority_jobs.md new file mode 100644 index 0000000..9dbb499 --- /dev/null +++ b/guides/guide_on_icf_priority_jobs.md @@ -0,0 +1,136 @@ +# How to schedule priority jobs on ICF + +## Overview of concepts + +This is a short guide to scheduling priority jobs on the Informatics +Compute Facility (ICF). ICF has a free tier and a paid-for tier; the +jobs in the paid-for tier have higher priority than the jobs in the +free tier. Higher priority jobs can be run using preemption, see below. + +As a user, you will be part of an **account**, a group of users that +are put together for tracking and controlling access to resources. For +example, all students in the same CDT are typically part of the same +account. + +Furthermore, you will be part of one or more **quality of service** +(QOS). A QOS is a policy profile that controls priority, resource +limits, and preemption rules. The way ICF is set up, a QOS typically +corresponds to a type of GPU, for example H200 or L40s. + +Charging works by the hour; if a job of yours runs for n hours using m +GPUs, then the account you are part of will be charged for n*m GPU +hours. The hours are charged separately by GPU type. + +Currently, there are no quotas on GPU hours per user (only per +account), but these will be introduced in the future. However, all GPU +usage is monitored on a by-user basis. + +The final concept you need is a **partition**, which is a group of nodes +on the ICF. A user can have access to multiple partitions, and a node +can be part of more than one partition. + + +## How preemption works + +When you submit a priority job, it may need resources (typically GPUs) +that are currently in use by free-tier jobs. In this case, your job +will preempt one or more of those free-tier jobs: they will be +terminated and placed back at the end of the queue, freeing up the +resources for your job. + +Conversely, if you submit a free-tier job, be aware that it may be +preempted at any time by a priority job. Any work done by a preempted +job that has not been saved to disk will be lost! So it is good +practice to checkpoint your work regularly so that it can be restarted +without losing too much progress. + + +## Find out which resources are available to you + +Start by obtaining a list of all partitions and nodes on ICF: + +`sinfo` + +As a research student, you will normally use the `ICF-Research` partition. + +Now find out which accounts you are part of and which partitions you +have access to: + +`sacctmgr show user $USER withassoc format=user,account%40,partition` + +As a research student, you should be part of the `research` account, +which corresponds to the free tier. If you are also a member of a CDT, +a research group, or a project which has its own ICF account, then you +should see one or more additional accounts; these are denoted by +32-digit hex strings. + +To find out which QOSs you are associated with use: + +`sacctmgr show user $USER withassoc format=user,account%40,qos%85,defaultqos` + +You will find that an account can be associated with more than one QOS +(typically corresponding to different GPU types), and that each +account has a default QOS, which is used if no explicit QOS is +specified when running a job. The ID of a QOS is also a 32-digit hex +string. If you are part of a CDT, for example you will see two QOSs, +each associated with a GPU type. + +To find out more about the QOS, use: + +`sacctmgr show qos format=name%40,priority,maxtrespu%25,maxwall` + +This will show you the priority, maximum number of GPUs allowed, and +maximum runtime for each QOS. + + +## Check how many resources you have used + +To find out how many GPU hours you have used (from a certain start +date) use: + +`sreport cluster AccountUtilizationByUser start=2026-05-12 -t hours -T gres/gpu` + +This gives you a report per account. If you want to get a report for a +specific QOS, use the following: + +`sacct --qos=a0465xxx --starttime=2026-05-01 \`\ +` --format=user,account,qos,elapsed,alloctres%85` + +where `a0465xxx` is the QOS you want information about. + +Note that this command lists individual job records rather than a +single total. To calculate the total GPU hours used, you need to sum +up the GPU hours across all jobs, taking into account the number of +GPUs allocated for each job and the elapsed time. + + +## Run prioritised jobs + +Once you know which resources (accounts and QOSs) are available to +you, you can run jobs that use these resources. For example to run +`job.sh` on account `ee4386xxx` with QOS `a0465xxx` while requesting +two GPUs, use the following command: + +`sbatch \`\ +` --partition=ICF-Research \`\ +` --account=ee4386xxx \`\ +` --qos=a0465xxx \`\ +` --gres=gpu:2 \`\ +` job.sh` + +Note that the partition should always be `ICF-Research`, as this is +where the priority resources reside. + + +## More information + +You can find more information on how to use the ICF on the [NLP-RR +OpenCourse web site](https://opencourse.inf.ed.ac.uk/nlp-rr/tutorials). +This site includes tutorials, videos, and quick start guide. + +There is also a [Github repo with ICF cluster +scripts](https://github.com/cdt-data-science/cluster-scripts). + +You can use the [ICF Dashboard](https://icfwebview.inf.ed.ac.uk/) to +monitor the load of the cluster in real-time. Note that you have to be +on the Informatics network for the dashboard to work.