π₯οΈ RAD-HPC Job Submission Guide
High-Performance Computing (HPC) systems rely on Slurm to manage job submissions. To run jobs successfully, you must understand how accounts, QoS, and partitions interact. This guide walks you through the essentials β from checking your access and building submission scripts to troubleshooting errors, understanding fairshare priority, and following best practices. Use it as a quick reference to get your jobs running smoothly.
- Core Concepts β Accounts, QoS, and Partitions
- Checking Your access β Using sact_me
- Accounts β QoS β Partitions β Visual overview
- Submission Script Templates β Examples for each cluster
- Common Errors & Fixes β Quick troubleshooting
- Priority & Fairshare β How jobs are ranked
- Preemption Policy β Who can preempt whom
- Educational Accounts β Limits and usage
- Best Practices β Recommendations for efficient jobs
- Useful Commands β Diagnostics and monitoring
- Admin Appendix β Priority weights & configs
π 1. Core Concepts
When you run jobs on RAD-HPC, you must specify three things:
- Account (
--account
)
The project/group youβre charging usage to. - QoS (Quality of Service) (
--qos
)
Sets limits like CPU/GPU caps, memory, and walltime. Also defines preemption. - Partition (
--partition
or-p
)
Represents the hardware pool (GPU-DGX, GPU-HGX, CPU-only with small medium and large memory modules).
π Jobs only start if your account, QoS, and partition are a valid combination. In some cases your account will have a default QOS that align with the target partition and only the account and partition will need to be identified. The same may be true of the account name in which case only the partition would need to be identified. If an user would like their defaults specified please open a request with us on service desk.
π 2. Checking Your Access (with sact_me
)
We provide a tool called sact_me
to show you all of your accounts, their QoS, limits, and which partitions accept them.
Usage
# Default: show only for the cluster you are logged in to
sact_me
# All clusters where you have associations
sact_me -c all
# Only one cluster
sact_me -c jakar
# Filter to one account
sact_me -a punakha_general
# CSV output (paste into Excel)
sact_me -csv -c all
Example Output
USER: rocorral LOCAL CLUSTER: jakar
-------------------------------------------------------------------------------------------------------------------
Account: jakar_general Cluster: jakar
QoS Prio QoSMaxWall AssocMaxWall QoSMaxTRES AssocMaxTRES QoSGrpTRES AssocGrpTRES Preempt Mode AllowedPartitions
jakar_medium_general 1 24:00:00 . . . . cpu=40 REQUEUE medium
jakar_small_general 1 12:00:00 . . . . cpu=40 REQUEUE small
What the columns mean
QoSMaxWall / AssocMaxWall
: max runtime allowed at QoS level vs account levelQoSMaxTRES / AssocMaxTRES
: resource caps (CPUs, GPUs, memory) from QoS vs accountQoSGrpTRES / AssocGrpTRES
: group-level quotas (total cores, GPUs)Preempt / Mode
: whether this QoS can preempt others (our site uses REQUEUE)AllowedPartitions
: partitions that accept this QoS
Important
π Slurm enforces the stricter of QoS vs Assoc limits.
π 3. Accounts β QoS β Partitions
π 4. Submission Script Templates
A) Punakha β DGX
#!/bin/bash
#SBATCH --job-name=dgx-job
#SBATCH --account=punakha_general
#SBATCH --qos=punakha_dgx_general
#SBATCH --partition=DGX
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=slurm-%j.out
srun ./gpu_app
B) Punakha β HGX
#SBATCH --account=punakha_general
#SBATCH --qos=punakha_hgx_general
#SBATCH --partition=HGX
C) Jakar β Medium
#SBATCH --account=jakar_general
#SBATCH --qos=jakar_medium_general
#SBATCH --partition=medium
D) Jakar β Small
#SBATCH --account=jakar_general
#SBATCH --qos=jakar_small_general
#SBATCH --partition=small
β οΈ 5. Common Errors \& Fixes
Error | Cause | Fix |
---|---|---|
Invalid account or account/partition combination specified |
Partition does not allow any of your QoS | Run sact_me and pick a valid partition |
Invalid qos for job request |
QoS not linked to your account | Use a QoS listed for your account |
PENDING (QOSNotAllowed) |
Partition does not allow that QoS | Switch to a permitted QoS |
PENDING (AssocGrpQosLimit) |
Job exceeds CPU/GPU/memory/walltime quota | Reduce request or pick another QoS |
PENDING (Priority) |
Waiting behind higher-priority jobs | Check with sprio -j <jobid> |
PENDING (ReqNodeNotAvail) |
Nodes drained or unavailable | See sinfo -R |
βοΈ 6. Priority \& Fairshare
We use multifactor priority with these major factors:
- Fairshare (account-level)
Accounts that have used fewer resources recently get higher priority. - Age
Jobs gain priority the longer they wait. - QoS/Partition priority
Certain QoS levels (e.g., partner GPU) carry higher weight.
Check your jobβs priority breakdown:
sprio -j <jobid>
βοΈ 7. Preemption Policy
- PreemptMode = REQUEUE
Jobs that are preempted are stopped and sent back to the queue. If your application does not checkpoint, the job restarts from the beginning. -
Who can preempt whom
-
Partner QoS (e.g.,
jakar_medium_chemh
,paro_medium_physf
)
β can preempt General QoS on medium and large nodes. - Education QoS (e.g.,
jakar_small_education
)
β can preempt General QoS on small nodes. -
General QoS is lowest priority:
- Preemptible by Education or Partner as above.
- Does not preempt others.
π 8. Educational Accounts
- Max walltime = 1 hour
- Jobs longer than
--time=01:00:00
will be rejected. - Recommended:
#SBATCH --time=00:59:00
β 9. Best Practices
- Always specify
--account
,--qos
, and--partition
. - Run
sact_me
if youβre unsure whatβs valid. - Request realistic runtimes β avoid padding with excessive hours.
- For GPU jobs: match
--gres=gpu:N
and--cpus-per-task
to your appβs needs. - For long runs: use checkpointing if available; jobs can be preempted.
- Use
srun
(notmpirun
) unless required by your application stack.
π§ 10. Useful Commands
# See your associations (accounts + QoS)
sact_me
# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>
# Check partition status
sinfo -p DGX
sinfo -p medium
# Why nodes unavailable?
sinfo -R
π Admin Appendix (for transparency)
- Priority weights (from
slurm.conf
):
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightQOS=5000
PriorityWeightPartition=1000
PriorityWeightJobSize=0
PriorityDecayHalfLife=7-14days
PriorityMaxAge=7-14days
- Preemption:
- Mode =
REQUEUE
- Partner QoS (DGX/HGX) preempt General QoS on medium/large nodes
- Education QoS preempts General QoS on small nodes
- Mode =