PUNAKHA Cluster, AI/ML Resource on GPU Cluster

Overview

The PUNAKHA cluster is the first High Performance Cluster on campus that offers NVIDIA GPUs for Artificial Intelligence (AI) and Machine Learning (ML) workloads. This cluster was built with the initial collaboration from Dr. Moore, Dr. Tosh, and Information Resources (IR).

Cluster Specifications

The cluster currently has two nodes: an NVIDIA DGX and an HGX system, both using NVIDIA Hopper architecture with fourth-generation Tensor cores (H100s). The cluster uses Docker and SLURM to schedule jobs and partitions NVIDIA H100 with Multi-Instance GPU (MIG) instances to provide substantial and granular AI/ML capability through predefined cores and memory allocation per GPU (MIG). Future investments will continue with the integration of DGX, HGX, PCI NVIDIA GPUs, and Sapphire family Intel Xeon Platinum 8480+.

NVIDIA DGX (Proprietary System)

GPUs: 8-way NVIDIA H100 built into the motherboard
Memory: 2TB of DRAM
Local Disk Space: 15TB
Network: NVIDIA NDR200/200GbE InfiniBand
Configuration: One GPU per user

HGX Systems (Non-Proprietary System)

GPUs: 4-way NVIDIA H100 with NVLink connections
Memory: 1TB of DRAM
Local Disk Space: 15TB
Network: NVIDIA NDR200/200GbE InfiniBand
Configuration: MIG (Predefined profiles)

Cluster Summary

Specification	Details
Total Nodes	2
Total RAM	3.0 TB
Partitions	DGX, HGX
GPU Types	H100
Total GPUs	12
Operating System	Red Hat Enterprise Linux (RHEL) 9.2
Scheduler	SLURM v3
Software Stack	Integrated with OpenHPC version 3

PUNAKHA main site

Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) expands the performance and value of NVIDIA Blackwell and Hopper™ generation GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This allows administrators to support every workload, from the smallest to the largest, with guaranteed quality of service (QoS), extending the reach of accelerated computing resources to every user.

Future Path

While the current JAKAR hardware has reached its expansion limit, the software environment continues to be upgraded. The transition to next-generation Intel Sapphire Rapids processors marks the evolution into PARO and PUNAKHA, two new systems designed to maintain UTEP’s leadership in computational research.