PUNAKHA Cluster, AI/ML Resource on GPU Cluster

Overview

The PUNAKHA cluster is the first High Performance Cluster on campus that offers NVIDIA GPUs for Artificial Intelligence (AI) and Machine Learning (ML) workloads. This cluster was built with the initial collaboration from Dr. Moore, Dr. Tosh, and Information Resources (IR).

Cluster Specifications

The cluster currently has two nodes: an NVIDIA DGX and an HGX system, both using NVIDIA Hopper architecture with fourth-generation Tensor cores (H100s). The cluster uses Docker and SLURM to schedule jobs and partitions NVIDIA H100 with Multi-Instance GPU (MIG) instances to provide substantial and granular AI/ML capability through predefined cores and memory allocation per GPU (MIG). Future investments will continue with the integration of DGX, HGX, PCI NVIDIA GPUs, and Sapphire family Intel Xeon Platinum 8480+.

NVIDIA DGX (Proprietary System)

  • GPUs: 8-way NVIDIA H100 built into the motherboard
  • Memory: 2TB of DRAM
  • Local Disk Space: 15TB
  • Network: NVIDIA NDR200/200GbE InfiniBand
  • Configuration: One GPU per user

HGX Systems (Non-Proprietary System)

  • GPUs: 4-way NVIDIA H100 with NVLink connections
  • Memory: 1TB of DRAM
  • Local Disk Space: 15TB
  • Network: NVIDIA NDR200/200GbE InfiniBand
  • Configuration: MIG (Predefined profiles)

Cluster Summary

Specification Details
Total Nodes 2
Total RAM 3.0 TB
Partitions DGX, HGX
GPU Types H100
Total GPUs 12
Operating System Red Hat Enterprise Linux (RHEL) 9.2
Scheduler SLURM v3
Software Stack Integrated with OpenHPC version 3

PUNAKHA main site


Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) expands the performance and value of NVIDIA Blackwell and Hopper™ generation GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This allows administrators to support every workload, from the smallest to the largest, with guaranteed quality of service (QoS), extending the reach of accelerated computing resources to every user.


Future Path

While the current JAKAR hardware has reached its expansion limit, the software environment continues to be upgraded. The transition to next-generation Intel Sapphire Rapids processors marks the evolution into PARO and PUNAKHA, two new systems designed to maintain UTEP’s leadership in computational research.