π 1. Core Concepts
When you run jobs on RAD-HPC, you must specify three things:
- Account (
--account)
The project or group you are charging usage to. - QoS (Quality of Service) (
--qos)
Defines limits such as CPU, GPU, memory, walltime, and preemption behavior. - Partition (
--partitionor-p)
Represents the hardware pool where the job will run.
Attention
π Jobs only start if your account, QoS, and partition are a valid combination. In some cases, your account may already have a default QoS that matches the target partition, and only the account and partition need to be specified. In some cases, the account may also be implied, in which case only the partition may be needed. If you would like your defaults configured, please open a request with the service desk.
π 2. Checking Your Access (with sact_me)
We provide a tool called sact_me to show you all of your accounts, their QoS values, limits, and which partitions accept them.
Usage
# Default: show only for the cluster you are logged in to
sact_me
# All clusters where you have associations
sact_me -c all
# Only one cluster
sact_me -c <cluster_name>
# Filter to one account
sact_me -a <account_name>
# CSV output (paste into Excel)
sact_me -csv -c all
````
### Example Output
```text
USER: username LOCAL CLUSTER: cluster_name
-------------------------------------------------------------------------------------------------------------------
Account: project_account Cluster: cluster_name
QoS Prio QoSMaxWall AssocMaxWall QoSMaxTRES AssocMaxTRES QoSGrpTRES AssocGrpTRES Preempt Mode AllowedPartitions
example_qos_a 1 24:00:00 . . . . cpu=40 REQUEUE partition_a
example_qos_b 1 12:00:00 . . . . cpu=40 REQUEUE partition_b
What the columns mean
QoSMaxWall / AssocMaxWall: max runtime allowed at QoS level vs account levelQoSMaxTRES / AssocMaxTRES: resource caps (CPUs, GPUs, memory) from QoS vs accountQoSGrpTRES / AssocGrpTRES: group-level quotas (total cores, GPUs)Preempt / Mode: whether this QoS can preempt others and what happens when preemption occursAllowedPartitions: partitions that accept this QoS
Attention
π Slurm enforces the stricter of QoS vs Assoc limits.
π 3. Accounts β QoS β Partitions

π 4. Submission Script Pattern
The examples below are generic templates. The correct values for your account, QoS, and partition depend on what sact_me shows for your access.
Basic Template
#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --account=<account_name>
#SBATCH --qos=<qos_name>
#SBATCH --partition=<partition_name>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out
srun ./my_application
CPU Job Example
#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<cpu_qos_name>
#SBATCH --partition=<cpu_partition_name>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out
srun ./my_application
GPU Job Example
#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<gpu_qos_name>
#SBATCH --partition=<gpu_partition_name>
#SBATCH --gres=gpu:<gpu_count>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out
srun ./my_application
Important
- Replace placeholders like
<account_name>,<qos_name>, and<partition_name>with values you are actually allowed to use. - Not every account can use every QoS.
- Not every QoS is valid on every partition.
- Always verify with
sact_meif you are unsure.
π Treat these as patterns, not fixed combinations. Your valid options depend on your assigned associations.
β οΈ 5. Common Errors & Fixes
| Error | Cause | Fix |
|---|---|---|
Invalid account or account/partition combination specified |
Partition does not allow any of your QoS values | Run sact_me and pick a valid partition |
Invalid qos for job request |
QoS not linked to your account | Use a QoS listed for your account |
PENDING (QOSNotAllowed) |
Partition does not allow that QoS | Switch to a permitted QoS |
PENDING (AssocGrpQosLimit) |
Job exceeds CPU/GPU/memory/walltime quota | Reduce request or pick another QoS |
PENDING (Priority) |
Waiting behind higher-priority jobs | Check with sprio -j <jobid> |
PENDING (ReqNodeNotAvail) |
Nodes drained or unavailable | See sinfo -R |
βοΈ 6. Priority & Fairshare
We use multifactor priority with these major factors:
- Fairshare (account-level) Accounts that have used fewer resources recently get higher priority.
- Age Jobs gain priority the longer they wait.
- QoS / Partition priority Certain QoS levels may carry higher weight than others.
Check your jobβs priority breakdown:
sprio -j <jobid>
βοΈ 7. Preemption Policy
-
PreemptMode = REQUEUE Jobs that are preempted are stopped and sent back to the queue. If your application does not checkpoint, the job restarts from the beginning.
-
General preemption behavior
-
Higher-priority QoS levels may preempt lower-priority QoS levels.
- Lower-priority QoS levels do not generally preempt higher-priority ones.
- Which QoS values can preempt others depends on site policy and the configuration tied to your cluster and account access.
Tip
π If your workload is long-running or difficult to restart, checkpointing is strongly recommended whenever supported.
β 8. Best Practices
- Always specify
--account,--qos, and--partitionunless your defaults are already configured. - Run
sact_meif you are unsure what is valid. - Request realistic runtimes β avoid padding with excessive hours.
- For GPU jobs, match
--gres=gpu:Nand--cpus-per-taskto your applicationβs needs. - For long runs, use checkpointing if available, since jobs may be preempted.
- Use
srununless your application stack specifically requires otherwise.
π§ 9. Useful Commands
# See your associations (accounts + QoS)
sact_me
# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>
# Check partition status
sinfo -p <partition_name>
# Why are nodes unavailable?
sinfo -R
π Admin Appendix (for transparency)
Priority weights (from slurm.conf)
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightQOS=5000
PriorityWeightPartition=1000
PriorityWeightJobSize=0
PriorityDecayHalfLife=7-14days
PriorityMaxAge=7-14days
Preemption
- Mode =
REQUEUE - Higher-priority QoS levels may preempt lower-priority QoS levels
- Exact preemption relationships depend on cluster policy and configured QoS relationships