๐Ÿ’ค RAD-HPC Standby QoS Guide

Best-Effort Jobs ยท Opportunistic Scheduling ยท Lower Priority ยท Preemptible Workloads


๐Ÿ”‘ 1. Core Concept

The standby QoS is a best-effort scheduling tier intended for jobs that can run opportunistically when cluster resources would otherwise remain idle.

Jobs submitted under standby are treated as lower-priority workloads. They are useful for work that is flexible, non-urgent, and tolerant of delay or interruption.

๐Ÿ‘‰ Standby is not intended for time-sensitive or production-critical jobs. It exists to improve cluster utilization while preserving priority for normal and higher-priority work.


๐ŸŽฏ 2. What Standby QoS Is For

The standby QoS is designed to:

  • increase utilization of otherwise idle resources
  • provide a scheduling option for low-priority workloads
  • separate opportunistic jobs from standard production jobs
  • allow flexible workloads to run without competing equally with normal access tiers

Typical use cases include:

  • development and testing
  • exploratory runs
  • low-priority parameter sweeps
  • checkpoint-capable jobs
  • long-running work with no strict deadline

๐Ÿ‘‰ If your workload must start quickly, run uninterrupted, or meet a deadline, standby is probably not the right QoS.


โš™๏ธ 3. Expected Behavior

Jobs submitted with the standby QoS may behave differently from standard jobs.

Typical characteristics include:

  • lower scheduling priority
  • longer queue wait times
  • access to otherwise idle resources
  • possible preemption by higher-priority workloads
  • stricter limits depending on site policy

This means that standby jobs should be treated as opportunistic rather than guaranteed.

๐Ÿ‘‰ Users should assume that standby jobs may wait longer and may need to restart depending on scheduler policy and workload pressure.


๐Ÿ“„ 4. Submission Example

Example Slurm submission using standby:

#!/bin/bash
#SBATCH --job-name=standby-job
#SBATCH --account=<account_name>
#SBATCH --qos=standby
#SBATCH --partition=<partition_name>
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=slurm-%j.out

srun ./my_application
````

Replace:

* `<account_name>` with your permitted account
* `<partition_name>` with a partition that allows the standby QoS

If your account or partition defaults are already configured, fewer options may be required.

---

## ๐Ÿ” 5. Checking Whether You Can Use Standby

Before submitting, confirm that your **account**, **QoS**, and **partition** are a valid combination.

We provide `sact_me` to show your available accounts, QoS values, limits, and permitted partitions.

### Usage

```bash
# Default: show only for the cluster you are logged in to
sact_me

# All clusters where you have associations
sact_me -c all

# Filter to one account
sact_me -a <account_name>

# CSV output
sact_me -csv -c all

Check whether:

  • your account includes the standby QoS
  • the target partition accepts standby
  • any walltime or resource limits apply

๐Ÿ‘‰ Jobs only start if your account, QoS, and partition are a valid combination.


โš ๏ธ 6. Important Limitations

Standby QoS is intentionally limited in order to protect regular cluster usage.

Depending on configuration, standby may have:

  • lower priority than normal QoS tiers
  • lower walltime limits
  • tighter CPU, memory, or GPU caps
  • preemptible status
  • access restricted to certain accounts or partitions

Because of this, standby should not be used for:

  • urgent jobs
  • production workflows with fixed deadlines
  • jobs that cannot tolerate interruption
  • workloads that require guaranteed start times

โš”๏ธ 7. Preemption Behavior

If standby is configured as preemptible, higher-priority jobs may displace standby jobs when resources are needed.

This usually means:

  • standby jobs run when resources are free
  • higher-priority jobs take precedence
  • standby jobs may be stopped and requeued depending on preemption mode

If your application does not checkpoint, interruption may require it to restart from the beginning.

๐Ÿ‘‰ Checkpointing is strongly recommended for long-running standby jobs whenever the application supports it.


โš–๏ธ 8. Priority and Scheduling

Standby exists below normal scheduling tiers and should be understood as a lower-priority lane for work that is flexible.

In practice, this means:

  • normal and priority workloads are served first
  • standby jobs may remain pending even when they are otherwise valid
  • idle capacity may be used by standby jobs when available
  • the scheduler may reclaim those resources later for higher-priority work

Use the following to inspect priority behavior:

sprio -j <jobid>
scontrol show job -dd <jobid>

โœ… 9. Best Practices

  • Use standby only for non-urgent work
  • Keep jobs restartable when possible
  • Use checkpointing if your software supports it
  • Request only the resources you actually need
  • Avoid padding walltime unnecessarily
  • Confirm valid account/QoS/partition combinations with sact_me

For flexible workloads, standby can be a very effective way to consume otherwise unused capacity.


โŒ 10. Common Errors & Fixes

Error Cause Fix
Invalid account or account/partition combination specified Partition does not allow standby for your account Run sact_me and verify the valid combination
Invalid qos for job request Your account is not linked to standby Use a permitted QoS or request access
PENDING (QOSNotAllowed) Target partition does not accept standby Submit to a partition that allows it
PENDING (Priority) Higher-priority jobs are ahead of your standby job Wait, or use a standard QoS if appropriate
PENDING (AssocGrpQosLimit) Your request exceeds standby limits Reduce CPUs, memory, GPUs, or walltime
Job requeued unexpectedly Standby job was preempted Resubmit or use checkpointing if supported

๐Ÿ”ง 11. Useful Commands

# See your associations and QoS values
sact_me

# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>

# Check partition status
sinfo -p <partition_name>

# Why are nodes unavailable?
sinfo -R

๐Ÿ“Œ 12. Bottom Line

The standby QoS is a controlled way to let low-priority jobs use idle resources without interfering with standard scheduling policy.

It is best suited for:

  • flexible workloads
  • interruption-tolerant jobs
  • non-urgent experimentation
  • best-effort compute consumption

๐Ÿ‘‰ Think of standby as an opportunistic service tier, not a guaranteed scheduling path.