Job Submission

🔑 1. Core Concepts

When you run jobs on RAD-HPC, you must specify three things:

Account (--account)
The project or group you are charging usage to.
QoS (Quality of Service) (--qos)
Defines limits such as CPU, GPU, memory, walltime, and preemption behavior.
Partition (--partition or -p)
Represents the hardware pool where the job will run.

Attention

👉 Jobs only start if your account, QoS, and partition are a valid combination. In some cases, your account may already have a default QoS that matches the target partition, and only the account and partition need to be specified. In some cases, the account may also be implied, in which case only the partition may be needed. If you would like your defaults configured, please open a request with the service desk.

🔍 2. Checking Your Access (with `sact_me`)

We provide a tool called sact_me to show you all of your accounts, their QoS values, limits, and which partitions accept them.

Usage

# Default: show only for the cluster you are logged in to
sact_me

# All clusters where you have associations
sact_me -c all

# Only one cluster
sact_me -c <cluster_name>

# Filter to one account
sact_me -a <account_name>

# CSV output (paste into Excel)
sact_me -csv -c all
````

### Example Output

```text
USER: username   LOCAL CLUSTER: cluster_name
-------------------------------------------------------------------------------------------------------------------
Account: project_account    Cluster: cluster_name
QoS                          Prio QoSMaxWall AssocMaxWall QoSMaxTRES    AssocMaxTRES QoSGrpTRES       AssocGrpTRES     Preempt Mode       AllowedPartitions
example_qos_a                1    24:00:00   .            .             .            .                cpu=40           REQUEUE           partition_a
example_qos_b                1    12:00:00   .            .             .            .                cpu=40           REQUEUE           partition_b

What the columns mean

QoSMaxWall / AssocMaxWall: max runtime allowed at QoS level vs account level
QoSMaxTRES / AssocMaxTRES: resource caps (CPUs, GPUs, memory) from QoS vs account
QoSGrpTRES / AssocGrpTRES: group-level quotas (total cores, GPUs)
Preempt / Mode: whether this QoS can preempt others and what happens when preemption occurs
AllowedPartitions: partitions that accept this QoS

Attention

👉 Slurm enforces the stricter of QoS vs Assoc limits.

📊 3. Accounts → QoS → Partitions

![Accounts to QoS to Partitions](rad_hpc_accounts_qos_partitions.png)

📄 4. Submission Script Pattern

The examples below are generic templates. The correct values for your account, QoS, and partition depend on what sact_me shows for your access.

Basic Template

#!/bin/bash
#SBATCH --job-name=my-job
#SBATCH --account=<account_name>
#SBATCH --qos=<qos_name>
#SBATCH --partition=<partition_name>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

CPU Job Example

#!/bin/bash
#SBATCH --job-name=cpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<cpu_qos_name>
#SBATCH --partition=<cpu_partition_name>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

GPU Job Example

#!/bin/bash
#SBATCH --job-name=gpu-job
#SBATCH --account=<account_name>
#SBATCH --qos=<gpu_qos_name>
#SBATCH --partition=<gpu_partition_name>
#SBATCH --gres=gpu:<gpu_count>
#SBATCH --cpus-per-task=<cpu_count>
#SBATCH --mem=<memory_amount>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=slurm-%j.out

srun ./my_application

Important

Replace placeholders like <account_name>, <qos_name>, and <partition_name> with values you are actually allowed to use.
Not every account can use every QoS.
Not every QoS is valid on every partition.
Always verify with sact_me if you are unsure.

👉 Treat these as patterns, not fixed combinations. Your valid options depend on your assigned associations.

⚠️ 5. Common Errors & Fixes

Error	Cause	Fix
`Invalid account or account/partition combination specified`	Partition does not allow any of your QoS values	Run `sact_me` and pick a valid partition
`Invalid qos for job request`	QoS not linked to your account	Use a QoS listed for your account
`PENDING (QOSNotAllowed)`	Partition does not allow that QoS	Switch to a permitted QoS
`PENDING (AssocGrpQosLimit)`	Job exceeds CPU/GPU/memory/walltime quota	Reduce request or pick another QoS
`PENDING (Priority)`	Waiting behind higher-priority jobs	Check with `sprio -j <jobid>`
`PENDING (ReqNodeNotAvail)`	Nodes drained or unavailable	See `sinfo -R`

⚖️ 6. Priority & Fairshare

We use multifactor priority with these major factors:

Fairshare (account-level) Accounts that have used fewer resources recently get higher priority.
Age Jobs gain priority the longer they wait.
QoS / Partition priority Certain QoS levels may carry higher weight than others.

Check your job’s priority breakdown:

sprio -j <jobid>

⚔️ 7. Preemption Policy

PreemptMode = REQUEUE Jobs that are preempted are stopped and sent back to the queue. If your application does not checkpoint, the job restarts from the beginning.
General preemption behavior
Higher-priority QoS levels may preempt lower-priority QoS levels.
Lower-priority QoS levels do not generally preempt higher-priority ones.
Which QoS values can preempt others depends on site policy and the configuration tied to your cluster and account access.

Tip

👉 If your workload is long-running or difficult to restart, checkpointing is strongly recommended whenever supported.

✅ 8. Best Practices

Always specify --account, --qos, and --partition unless your defaults are already configured.
Run sact_me if you are unsure what is valid.
Request realistic runtimes — avoid padding with excessive hours.
For GPU jobs, match --gres=gpu:N and --cpus-per-task to your application’s needs.
For long runs, use checkpointing if available, since jobs may be preempted.
Use srun unless your application stack specifically requires otherwise.

🔧 9. Useful Commands

# See your associations (accounts + QoS)
sact_me

# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>

# Check partition status
sinfo -p <partition_name>

# Why are nodes unavailable?
sinfo -R

📌 Admin Appendix (for transparency)

Priority weights (from `slurm.conf`)

PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightQOS=5000
PriorityWeightPartition=1000
PriorityWeightJobSize=0
PriorityDecayHalfLife=7-14days
PriorityMaxAge=7-14days

Preemption

Mode = REQUEUE
Higher-priority QoS levels may preempt lower-priority QoS levels
Exact preemption relationships depend on cluster policy and configured QoS relationships