๐ค RAD-HPC Standby QoS Guide
Best-Effort Jobs ยท Opportunistic Scheduling ยท Lower Priority ยท Preemptible Workloads
๐ 1. Core Concept
The standby QoS is a best-effort scheduling tier intended for jobs that can run opportunistically when cluster resources would otherwise remain idle.
Jobs submitted under standby are treated as lower-priority workloads. They are useful for work that is flexible, non-urgent, and tolerant of delay or interruption.
๐ Standby is not intended for time-sensitive or production-critical jobs. It exists to improve cluster utilization while preserving priority for normal and higher-priority work.
๐ฏ 2. What Standby QoS Is For
The standby QoS is designed to:
- increase utilization of otherwise idle resources
- provide a scheduling option for low-priority workloads
- separate opportunistic jobs from standard production jobs
- allow flexible workloads to run without competing equally with normal access tiers
Typical use cases include:
- development and testing
- exploratory runs
- low-priority parameter sweeps
- checkpoint-capable jobs
- long-running work with no strict deadline
๐ If your workload must start quickly, run uninterrupted, or meet a deadline, standby is probably not the right QoS.
โ๏ธ 3. Expected Behavior
Jobs submitted with the standby QoS may behave differently from standard jobs.
Typical characteristics include:
- lower scheduling priority
- longer queue wait times
- access to otherwise idle resources
- possible preemption by higher-priority workloads
- stricter limits depending on site policy
This means that standby jobs should be treated as opportunistic rather than guaranteed.
๐ Users should assume that standby jobs may wait longer and may need to restart depending on scheduler policy and workload pressure.
๐ 4. Submission Example
Example Slurm submission using standby:
#!/bin/bash
#SBATCH --job-name=standby-job
#SBATCH --account=<account_name>
#SBATCH --qos=standby
#SBATCH --partition=<partition_name>
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=slurm-%j.out
srun ./my_application
````
Replace:
* `<account_name>` with your permitted account
* `<partition_name>` with a partition that allows the standby QoS
If your account or partition defaults are already configured, fewer options may be required.
---
## ๐ 5. Checking Whether You Can Use Standby
Before submitting, confirm that your **account**, **QoS**, and **partition** are a valid combination.
We provide `sact_me` to show your available accounts, QoS values, limits, and permitted partitions.
### Usage
```bash
# Default: show only for the cluster you are logged in to
sact_me
# All clusters where you have associations
sact_me -c all
# Filter to one account
sact_me -a <account_name>
# CSV output
sact_me -csv -c all
Check whether:
- your account includes the standby QoS
- the target partition accepts standby
- any walltime or resource limits apply
๐ Jobs only start if your account, QoS, and partition are a valid combination.
โ ๏ธ 6. Important Limitations
Standby QoS is intentionally limited in order to protect regular cluster usage.
Depending on configuration, standby may have:
- lower priority than normal QoS tiers
- lower walltime limits
- tighter CPU, memory, or GPU caps
- preemptible status
- access restricted to certain accounts or partitions
Because of this, standby should not be used for:
- urgent jobs
- production workflows with fixed deadlines
- jobs that cannot tolerate interruption
- workloads that require guaranteed start times
โ๏ธ 7. Preemption Behavior
If standby is configured as preemptible, higher-priority jobs may displace standby jobs when resources are needed.
This usually means:
- standby jobs run when resources are free
- higher-priority jobs take precedence
- standby jobs may be stopped and requeued depending on preemption mode
If your application does not checkpoint, interruption may require it to restart from the beginning.
๐ Checkpointing is strongly recommended for long-running standby jobs whenever the application supports it.
โ๏ธ 8. Priority and Scheduling
Standby exists below normal scheduling tiers and should be understood as a lower-priority lane for work that is flexible.
In practice, this means:
- normal and priority workloads are served first
- standby jobs may remain pending even when they are otherwise valid
- idle capacity may be used by standby jobs when available
- the scheduler may reclaim those resources later for higher-priority work
Use the following to inspect priority behavior:
sprio -j <jobid>
scontrol show job -dd <jobid>
โ 9. Best Practices
- Use standby only for non-urgent work
- Keep jobs restartable when possible
- Use checkpointing if your software supports it
- Request only the resources you actually need
- Avoid padding walltime unnecessarily
- Confirm valid account/QoS/partition combinations with
sact_me
For flexible workloads, standby can be a very effective way to consume otherwise unused capacity.
โ 10. Common Errors & Fixes
| Error | Cause | Fix |
|---|---|---|
Invalid account or account/partition combination specified |
Partition does not allow standby for your account | Run sact_me and verify the valid combination |
Invalid qos for job request |
Your account is not linked to standby | Use a permitted QoS or request access |
PENDING (QOSNotAllowed) |
Target partition does not accept standby | Submit to a partition that allows it |
PENDING (Priority) |
Higher-priority jobs are ahead of your standby job | Wait, or use a standard QoS if appropriate |
PENDING (AssocGrpQosLimit) |
Your request exceeds standby limits | Reduce CPUs, memory, GPUs, or walltime |
| Job requeued unexpectedly | Standby job was preempted | Resubmit or use checkpointing if supported |
๐ง 11. Useful Commands
# See your associations and QoS values
sact_me
# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>
# Check partition status
sinfo -p <partition_name>
# Why are nodes unavailable?
sinfo -R
๐ 12. Bottom Line
The standby QoS is a controlled way to let low-priority jobs use idle resources without interfering with standard scheduling policy.
It is best suited for:
- flexible workloads
- interruption-tolerant jobs
- non-urgent experimentation
- best-effort compute consumption
๐ Think of standby as an opportunistic service tier, not a guaranteed scheduling path.