PUNAKHA Cluster, AI/ML Resource on GPU Cluster
Overview
The PUNAKHA cluster is the first High Performance Cluster on campus that offers NVIDIA GPUs for Artificial Intelligence (AI) and Machine Learning (ML) workloads. This cluster was built with the initial collaboration from Dr. Moore, Dr. Tosh, and Information Resources (IR).
Cluster Specifications
The cluster currently has two nodes: an NVIDIA DGX and an HGX system, both using NVIDIA Hopper architecture with fourth-generation Tensor cores (H100s). The cluster uses Docker and SLURM to schedule jobs and partitions NVIDIA H100 with Multi-Instance GPU (MIG) instances to provide substantial and granular AI/ML capability through predefined cores and memory allocation per GPU (MIG). Future investments will continue with the integration of DGX, HGX, PCI NVIDIA GPUs, and Sapphire family Intel Xeon Platinum 8480+.
NVIDIA DGX (Proprietary System)
- GPUs: 8-way NVIDIA H100 built into the motherboard
- Memory: 2TB of DRAM
- Local Disk Space: 15TB
- Network: NVIDIA NDR200/200GbE InfiniBand
- Configuration: One GPU per user
HGX Systems (Non-Proprietary System)
- GPUs: 4-way NVIDIA H100 with NVLink connections
- Memory: 1TB of DRAM
- Local Disk Space: 15TB
- Network: NVIDIA NDR200/200GbE InfiniBand
- Configuration: MIG (Predefined profiles)
Cluster Summary
Specification | Details |
---|---|
Total Nodes | 2 |
Total RAM | 3.0 TB |
Partitions | DGX, HGX |
GPU Types | H100 |
Total GPUs | 12 |
Operating System | Red Hat Enterprise Linux (RHEL) 9.2 |
Scheduler | SLURM v3 |
Software Stack | Integrated with OpenHPC version 3 |
Multi-Instance GPU (MIG)
Multi-Instance GPU (MIG) expands the performance and value of NVIDIA Blackwell and Hopper™ generation GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This allows administrators to support every workload, from the smallest to the largest, with guaranteed quality of service (QoS), extending the reach of accelerated computing resources to every user.
Future Path
While the current JAKAR hardware has reached its expansion limit, the software environment continues to be upgraded. The transition to next-generation Intel Sapphire Rapids processors marks the evolution into PARO and PUNAKHA, two new systems designed to maintain UTEP’s leadership in computational research.