💤 RAD-HPC Standby QoS Guide

Best-Effort Jobs · Opportunistic Scheduling · Lower Priority · Preemptible Workloads

🔑 1. Core Concept

The standby QoS is a best-effort scheduling tier intended for jobs that can run opportunistically when cluster resources would otherwise remain idle.

Jobs submitted under standby are treated as lower-priority workloads. They are useful for work that is flexible, non-urgent, and tolerant of delay or interruption.

👉 Standby is not intended for time-sensitive or production-critical jobs. It exists to improve cluster utilization while preserving priority for normal and higher-priority work.

🎯 2. What Standby QoS Is For

The standby QoS is designed to:

increase utilization of otherwise idle resources
provide a scheduling option for low-priority workloads
separate opportunistic jobs from standard production jobs
allow flexible workloads to run without competing equally with normal access tiers

Typical use cases include:

development and testing
exploratory runs
low-priority parameter sweeps
checkpoint-capable jobs
long-running work with no strict deadline

👉 If your workload must start quickly, run uninterrupted, or meet a deadline, standby is probably not the right QoS.

⚙️ 3. Expected Behavior

Jobs submitted with the standby QoS may behave differently from standard jobs.

Typical characteristics include:

lower scheduling priority
longer queue wait times
access to otherwise idle resources
possible preemption by higher-priority workloads
stricter limits depending on site policy

This means that standby jobs should be treated as opportunistic rather than guaranteed.

👉 Users should assume that standby jobs may wait longer and may need to restart depending on scheduler policy and workload pressure.

📄 4. Submission Example

Example Slurm submission using standby:

#!/bin/bash
#SBATCH --job-name=standby-job
#SBATCH --account=<account_name>
#SBATCH --qos=standby
#SBATCH --partition=<partition_name>
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=slurm-%j.out

srun ./my_application
````

Replace:

* `<account_name>` with your permitted account
* `<partition_name>` with a partition that allows the standby QoS

If your account or partition defaults are already configured, fewer options may be required.

---

## 🔍 5. Checking Whether You Can Use Standby

Before submitting, confirm that your **account**, **QoS**, and **partition** are a valid combination.

We provide `sact_me` to show your available accounts, QoS values, limits, and permitted partitions.

### Usage

```bash
# Default: show only for the cluster you are logged in to
sact_me

# All clusters where you have associations
sact_me -c all

# Filter to one account
sact_me -a <account_name>

# CSV output
sact_me -csv -c all

Check whether:

your account includes the standby QoS
the target partition accepts standby
any walltime or resource limits apply

👉 Jobs only start if your account, QoS, and partition are a valid combination.

⚠️ 6. Important Limitations

Standby QoS is intentionally limited in order to protect regular cluster usage.

Depending on configuration, standby may have:

lower priority than normal QoS tiers
lower walltime limits
tighter CPU, memory, or GPU caps
preemptible status
access restricted to certain accounts or partitions

Because of this, standby should not be used for:

urgent jobs
production workflows with fixed deadlines
jobs that cannot tolerate interruption
workloads that require guaranteed start times

⚔️ 7. Preemption Behavior

If standby is configured as preemptible, higher-priority jobs may displace standby jobs when resources are needed.

This usually means:

standby jobs run when resources are free
higher-priority jobs take precedence
standby jobs may be stopped and requeued depending on preemption mode

If your application does not checkpoint, interruption may require it to restart from the beginning.

👉 Checkpointing is strongly recommended for long-running standby jobs whenever the application supports it.

⚖️ 8. Priority and Scheduling

Standby exists below normal scheduling tiers and should be understood as a lower-priority lane for work that is flexible.

In practice, this means:

normal and priority workloads are served first
standby jobs may remain pending even when they are otherwise valid
idle capacity may be used by standby jobs when available
the scheduler may reclaim those resources later for higher-priority work

Use the following to inspect priority behavior:

sprio -j <jobid>
scontrol show job -dd <jobid>

✅ 9. Best Practices

Use standby only for non-urgent work
Keep jobs restartable when possible
Use checkpointing if your software supports it
Request only the resources you actually need
Avoid padding walltime unnecessarily
Confirm valid account/QoS/partition combinations with sact_me

For flexible workloads, standby can be a very effective way to consume otherwise unused capacity.

❌ 10. Common Errors & Fixes

Error	Cause	Fix
`Invalid account or account/partition combination specified`	Partition does not allow standby for your account	Run `sact_me` and verify the valid combination
`Invalid qos for job request`	Your account is not linked to standby	Use a permitted QoS or request access
`PENDING (QOSNotAllowed)`	Target partition does not accept standby	Submit to a partition that allows it
`PENDING (Priority)`	Higher-priority jobs are ahead of your standby job	Wait, or use a standard QoS if appropriate
`PENDING (AssocGrpQosLimit)`	Your request exceeds standby limits	Reduce CPUs, memory, GPUs, or walltime
Job requeued unexpectedly	Standby job was preempted	Resubmit or use checkpointing if supported

🔧 11. Useful Commands

# See your associations and QoS values
sact_me

# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>

# Check partition status
sinfo -p <partition_name>

# Why are nodes unavailable?
sinfo -R

📌 12. Bottom Line

The standby QoS is a controlled way to let low-priority jobs use idle resources without interfering with standard scheduling policy.

It is best suited for:

flexible workloads
interruption-tolerant jobs
non-urgent experimentation
best-effort compute consumption

👉 Think of standby as an opportunistic service tier, not a guaranteed scheduling path.