Frequently Asked Questions

This page addresses common questions about using UTEP's High-Performance Computing (HPC) resources, including the Jakar, Paro, and Punakha clusters.


πŸ” Access & Accounts

Who can request access?

All UTEP researchers, faculty members, and their sponsored collaborators (students, postdocs, external partners) are eligible.
A UTEP Principal Investigator (PI) must approve the request.

How do I request access to the HPC clusters?

Go to the HPC Account Requests and follow the instructions.
You’ll need your UTEP email and PI approval.

How long does access last?

Accounts are granted for one year and can be renewed annually.
You’ll receive reminders before your account expires.


πŸ“¦ Software & Modules

What software is available?

The clusters use OpenHPC for software management.
Run:

module spider

to see the full list.

How do I load a module?

Use:

module load <module_name>

For example:

module load gcc/11.2.0
  • Load and use OpenMPI4:
source /opt/intel/oneapi/setvars.sh -ofi_internal=1 --force
module load gnu12 openmpi4
export I_MPI_OFI_PROVIDER=tcp

In this example, the environment variables are set for compiling executables with the GNU version 12 compilers, and the wrappers and libraries are loaded to compile with OpenMPI4. Then compile your executable.

Can I request additional software?

Yes. Submit a software request through the HPC Software Request. Please include the software name, version, and justification for use.


⚑ Performance & Storage

How much storage do I get?

Each account includes three storage areas with quotas:

Storage Type Quota Backup Notes
Home 10 GB βœ… Backed up Best for configs, small scripts, and critical files
Work 100 GB ❌ Not backed up Use for active research data
Scratch Unlimited* ❌ Not backed up Files unused for 7 days are automatically purged

Your directories are located at:


/home/<username>
/work/<username>
/scratch/<username>

*Unlimited scratch is subject to automatic cleanup β€” 7-day purge policy applies.


How do I check my quota?

Run:

qta_h   # Check Home quota
qta_w   # Check Work quota

Quota Output Terms:

  • Blocks = Used space
  • Quota = Warning threshold
  • Limit = Hard cap (cannot exceed)
  • File Limits = Restriction on total number of files

Info

These limits apply per user account. Usage by other researchers will not count against your quota.


What if I need more storage?

You may request additional space via the Support page. Approval depends on project needs and available resources.


🧹 Best Practices for Storage

  • Remove or archive unnecessary files.
  • Use /scratch for temporary data.
  • Use /work for active, non-critical data.
  • Keep /home clean for configs and critical scripts (backed up).

πŸ› οΈ Helpful Shortcuts (Aliases)

Alias Purpose
cdw Quickly change to your work directory
cds Quickly change to your scratch directory
qta_w Check your work quota
qta_h Check your home quota

Quota Output Example:
Quota output example

  • Blocks = Used space
  • Quota = Warning limit
  • Limit = Hard limit

File Limits = Describes File Count limitations

Note

When reading the quota output these limits apply to your account alone.

πŸ‘₯ For Researchers Under Partner PIs

If you are working under a Partner PI, please ask your PI to coordinate directly with HPC staff for project space allocations.


Account Setup

You should have received an email from the HPC Assistant with a temporary password for your account. - Change your password using the passwd command. - An external reset mechanism is being developed β€” thank you for your patience.


πŸ–₯️ RAD-HPC Job Submission Guide

Accounts Β· QoS Β· Partitions Β· Priority Β· Preemption


πŸ”‘ 1. Core Concepts

When you run jobs on RAD-HPC, you must specify three things:

  • Account (--account)
    The project/group you’re charging usage to.
  • QoS (Quality of Service) (--qos)
    Sets limits like CPU/GPU caps, memory, and walltime. Also defines preemption.
  • Partition (--partition or -p)
    Represents the hardware pool (GPU-DGX, GPU-HGX, CPU-only with small medium and large memory modules).

πŸ‘‰ Jobs only start if your account, QoS, and partition are a valid combination. In some cases your account will have a default QOS that align with the target partition and only the account and partition will need to be identified. The same may be true of the account name in which case only the partition would need to be identified. If an user would like their defaults specified please open a request with us on service desk.


πŸ” 2. Checking Your Access (with sact_me)

We provide a tool called sact_me to show you all of your accounts, their QoS, limits, and which partitions accept them.

Usage

# Default: show only for the cluster you are logged in to
sact_me

# All clusters where you have associations
sact_me -c all

# Only one cluster
sact_me -c jakar

# Filter to one account
sact_me -a punakha_general

# CSV output (paste into Excel)
sact_me -csv -c all

Example Output

USER: rocorral   LOCAL CLUSTER: jakar
-------------------------------------------------------------------------------------------------------------------
Account: jakar_general    Cluster: jakar
QoS                          Prio QoSMaxWall AssocMaxWall QoSMaxTRES    AssocMaxTRES QoSGrpTRES       AssocGrpTRES     Preempt Mode       AllowedPartitions
jakar_medium_general         1    24:00:00    .            .             .              .                 cpu=40            REQUEUE         medium
jakar_small_general          1    12:00:00    .            .                .               .                 cpu=40            REQUEUE          small

What the columns mean

  • QoSMaxWall / AssocMaxWall: max runtime allowed at QoS level vs account level
  • QoSMaxTRES / AssocMaxTRES: resource caps (CPUs, GPUs, memory) from QoS vs account
  • QoSGrpTRES / AssocGrpTRES: group-level quotas (total cores, GPUs)
  • Preempt / Mode: whether this QoS can preempt others (our site uses REQUEUE)
  • AllowedPartitions: partitions that accept this QoS

πŸ‘‰ Slurm enforces the stricter of QoS vs Assoc limits.


πŸ“Š 3. Accounts β†’ QoS β†’ Partitions

Accounts to QoS to Partitions


πŸ“„ 4. Submission Script Templates

A) Punakha β€” DGX

#!/bin/bash
#SBATCH --job-name=dgx-job
#SBATCH --account=punakha_general
#SBATCH --qos=punakha_dgx_general
#SBATCH --partition=DGX
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=slurm-%j.out

srun ./gpu_app

B) Punakha β€” HGX

#SBATCH --account=punakha_general
#SBATCH --qos=punakha_hgx_general
#SBATCH --partition=HGX

C) Jakar β€” Medium

#SBATCH --account=jakar_general
#SBATCH --qos=jakar_medium_general
#SBATCH --partition=medium

D) Jakar β€” Small

#SBATCH --account=jakar_general
#SBATCH --qos=jakar_small_general
#SBATCH --partition=small

⚠️ 5. Common Errors \& Fixes

Error Cause Fix
Invalid account or account/partition combination specified Partition does not allow any of your QoS Run sact_me and pick a valid partition
Invalid qos for job request QoS not linked to your account Use a QoS listed for your account
PENDING (QOSNotAllowed) Partition does not allow that QoS Switch to a permitted QoS
PENDING (AssocGrpQosLimit) Job exceeds CPU/GPU/memory/walltime quota Reduce request or pick another QoS
PENDING (Priority) Waiting behind higher-priority jobs Check with sprio -j <jobid>
PENDING (ReqNodeNotAvail) Nodes drained or unavailable See sinfo -R

βš–οΈ 6. Priority \& Fairshare

We use multifactor priority with these major factors:

  • Fairshare (account-level)
    Accounts that have used fewer resources recently get higher priority.
  • Age
    Jobs gain priority the longer they wait.
  • QoS/Partition priority
    Certain QoS levels (e.g., partner GPU) carry higher weight.

Check your job’s priority breakdown:

sprio -j <jobid>

βš”οΈ 7. Preemption Policy

  • PreemptMode = REQUEUE
    Jobs that are preempted are stopped and sent back to the queue. If your application does not checkpoint, the job restarts from the beginning.
  • Who can preempt whom

  • Partner QoS (e.g., jakar_medium_chemh, paro_medium_physf)
    β†’ can preempt General QoS on medium and large nodes.

  • Education QoS (e.g., jakar_small_education)
    β†’ can preempt General QoS on small nodes.
  • General QoS is lowest priority:

    • Preemptible by Education or Partner as above.
    • Does not preempt others.

πŸŽ“ 8. Educational Accounts

  • Max walltime = 1 hour
  • Jobs longer than --time=01:00:00 will be rejected.
  • Recommended:
#SBATCH --time=00:59:00

βœ… 9. Best Practices

  • Always specify --account, --qos, and --partition.
  • Run sact_me if you’re unsure what’s valid.
  • Request realistic runtimes β€” avoid padding with excessive hours.
  • For GPU jobs: match --gres=gpu:N and --cpus-per-task to your app’s needs.
  • For long runs: use checkpointing if available; jobs can be preempted.
  • Use srun (not mpirun) unless required by your application stack.

πŸ”§ 10. Useful Commands

# See your associations (accounts + QoS)
sact_me

# Inspect why a job is pending
scontrol show job -dd <jobid>
sprio -j <jobid>

# Check partition status
sinfo -p DGX
sinfo -p medium

# Why nodes unavailable?
sinfo -R

πŸ“Œ Admin Appendix (for transparency)

  • Priority weights (from slurm.conf):
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightQOS=5000
PriorityWeightPartition=1000
PriorityWeightJobSize=0
PriorityDecayHalfLife=7-14days
PriorityMaxAge=7-14days
  • Preemption:
    • Mode = REQUEUE
    • Partner QoS (DGX/HGX) preempt General QoS on medium/large nodes
    • Education QoS preempts General QoS on small nodes

πŸ› οΈ Troubleshooting

My job failed. What should I do?

  1. Check the SLURM output/error logs (e.g., slurm-<jobid>.out).
  2. Verify your job script for errors.
  3. If still unresolved, contact the HPC support team with your job ID.

My module/software isn’t working. What next?

Unload and reload the module, or try a different version:

module purge
module load <module_name>/<version>

If the issue persists, contact support.


πŸ“¬ Getting Help


πŸ“ Quick Commands Cheat Sheet

Here are some of the most common HPC commands you’ll need:

Purpose Command Example
Submit a batch job sbatch job.slurm
Run an interactive job srun --pty bash
Check running jobs squeue -u $USER
Cancel a job scancel <jobid>
Show job details scontrol show job <jobid>
Check account usage sacct -u $USER
List available modules module avail
Load a module module load gcc/11.2.0
Unload all modules module purge
Check storage quota (home) mmlsquota --block-size AUTO home
Check storage quota (work) mmlsquota --block-size AUTO work

βœ… This page is intended as a living FAQ. If you think a question should be added, please suggest it through the General Inquiry.