πŸ”‘ 1. Core Concepts

When you run jobs on RAD-HPC, you must specify three things:

  • Account (--account)
    The project or group you are charging usage to.
  • QoS (Quality of Service) (--qos)
    Defines limits such as CPU, GPU, memory, walltime, and preemption behavior.
  • Partition (--partition or -p)
    Represents the hardware pool where the job will run.

Attention

πŸ‘‰ Jobs only start if your account, QoS, and partition are a valid combination. In some cases, your account may already have a default QoS that matches the target partition, and only the account and partition need to be specified. In some cases, the account may also be implied, in which case only the partition may be needed. If you would like your defaults configured, please open a request with the service desk.


πŸ” 2. Checking Your Access (with sact_me)

We provide a tool called sact_me to show you all of your accounts, their QoS values, limits, and which partitions accept them.

Usage

# Default: show only for the cluster you are logged in to
sact_me

# All clusters where you have associations
sact_me -c all

# Only one cluster
sact_me -c <cluster_name>

# Filter to one account
sact_me -a <account_name>

# CSV output (paste into Excel)
sact_me -csv -c all
````

### Example Output

```text
USER: username   LOCAL CLUSTER: cluster_name
-------------------------------------------------------------------------------------------------------------------
Account: project_account    Cluster: cluster_name
QoS                          Prio QoSMaxWall AssocMaxWall QoSMaxTRES    AssocMaxTRES QoSGrpTRES       AssocGrpTRES     Preempt Mode       AllowedPartitions
example_qos_a                1    24:00:00   .            .             .            .                cpu=40           REQUEUE           partition_a
example_qos_b                1    12:00:00   .            .             .            .                cpu=40           REQUEUE           partition_b

What the columns mean

  • QoSMaxWall / AssocMaxWall: max runtime allowed at QoS level vs account level
  • QoSMaxTRES / AssocMaxTRES: resource caps (CPUs, GPUs, memory) from QoS vs account
  • QoSGrpTRES / AssocGrpTRES: group-level quotas (total cores, GPUs)
  • Preempt / Mode: whether this QoS can preempt others and what happens when preemption occurs
  • AllowedPartitions: partitions that accept this QoS

Attention

πŸ‘‰ Slurm enforces the stricter of QoS vs Assoc limits.


πŸ“Š 3. Accounts β†’ QoS β†’ Partitions

![Accounts to QoS to Partitions](rad_hpc_accounts_qos_partitions.png)


πŸ“„ 4. Submission Script Pattern

The examples below are generic templates. The correct values for your account, QoS, and partition depend on what sact_me shows for your access.

Basic Template

#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --account=<account_name>
#SBATCH --qos=<qos_name>
#SBATCH --partition=<partition_name>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

CPU Job Example

#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<cpu_qos_name>
#SBATCH --partition=<cpu_partition_name>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

GPU Job Example

#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<gpu_qos_name>
#SBATCH --partition=<gpu_partition_name>
#SBATCH --gres=gpu:<gpu_count>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

Important

  • Replace placeholders like <account_name>, <qos_name>, and <partition_name> with values you are actually allowed to use.
  • Not every account can use every QoS.
  • Not every QoS is valid on every partition.
  • Always verify with sact_me if you are unsure.

πŸ‘‰ Treat these as patterns, not fixed combinations. Your valid options depend on your assigned associations.


⚠️ 5. Common Errors & Fixes

Error Cause Fix
Invalid account or account/partition combination specified Partition does not allow any of your QoS values Run sact_me and pick a valid partition
Invalid qos for job request QoS not linked to your account Use a QoS listed for your account
PENDING (QOSNotAllowed) Partition does not allow that QoS Switch to a permitted QoS
PENDING (AssocGrpQosLimit) Job exceeds CPU/GPU/memory/walltime quota Reduce request or pick another QoS
PENDING (Priority) Waiting behind higher-priority jobs Check with sprio -j <jobid>
PENDING (ReqNodeNotAvail) Nodes drained or unavailable See sinfo -R

βš–οΈ 6. Priority & Fairshare

We use multifactor priority with these major factors:

  • Fairshare (account-level) Accounts that have used fewer resources recently get higher priority.
  • Age Jobs gain priority the longer they wait.
  • QoS / Partition priority Certain QoS levels may carry higher weight than others.

Check your job’s priority breakdown:

sprio -j <jobid>

βš”οΈ 7. Preemption Policy

  • PreemptMode = REQUEUE Jobs that are preempted are stopped and sent back to the queue. If your application does not checkpoint, the job restarts from the beginning.

  • General preemption behavior

  • Higher-priority QoS levels may preempt lower-priority QoS levels.

  • Lower-priority QoS levels do not generally preempt higher-priority ones.
  • Which QoS values can preempt others depends on site policy and the configuration tied to your cluster and account access.

Tip

πŸ‘‰ If your workload is long-running or difficult to restart, checkpointing is strongly recommended whenever supported.


βœ… 8. Best Practices

  • Always specify --account, --qos, and --partition unless your defaults are already configured.
  • Run sact_me if you are unsure what is valid.
  • Request realistic runtimes β€” avoid padding with excessive hours.
  • For GPU jobs, match --gres=gpu:N and --cpus-per-task to your application’s needs.
  • For long runs, use checkpointing if available, since jobs may be preempted.
  • Use srun unless your application stack specifically requires otherwise.

πŸ”§ 9. Useful Commands

# See your associations (accounts + QoS)
sact_me

# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>

# Check partition status
sinfo -p <partition_name>

# Why are nodes unavailable?
sinfo -R

πŸ“Œ Admin Appendix (for transparency)

Priority weights (from slurm.conf)

PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightQOS=5000
PriorityWeightPartition=1000
PriorityWeightJobSize=0
PriorityDecayHalfLife=7-14days
PriorityMaxAge=7-14days

Preemption

  • Mode = REQUEUE
  • Higher-priority QoS levels may preempt lower-priority QoS levels
  • Exact preemption relationships depend on cluster policy and configured QoS relationships