Maintenance¶

Planned Maintenance¶

Most maintenance is performed during regular hours with no interruption to service. System wide maintenance is usually planned ahead of time and is scheduled for Wednesdays from 8:00 AM to 5:00 PM with at least 10 days notice. These will be planned to occur four times per year.

Maintenance windows represent periods when UITS may choose to drain the queues of running jobs and suspend access to the cluster operation for HPC maintenance purposes.

The notification will describe the nature and extent (partial or full) of the interruptions of HPC services.

System-wide Maintenance¶

Impacts to job queues

During system-wide maintenance cycles, jobs queues are impacted before and during maintenance. Jobs submitted whose runtimes would overlap with maintenance are held until maintenance is concluded.

Some maintenance cycles require the entire system to be taken offline. In preparation, batch queues will be modified prior to scheduled downtimes to hold jobs which request more wallclock time than remains before the shutdown. Held jobs will be released to run once maintenance concludes.

Rolling maintenance¶

Impacts to job queues

During rolling maintenance cycles, job queues are impacted during and after maintenance. All nodes are drained, meaning they cannot accept new jobs and must allow running jobs to complete before they can be updated, rebooted, and put back online. The system may be slower to accept new jobs for 10 days following these maintenance cycles.

Rolling maintenance cycles are implemented to facilitate updates or maintenance tasks without necessitating a complete system shutdown. Throughout rolling maintenance, nodes will stop accepting new jobs, allowing currently running tasks to finish uninterrupted. As nodes gradually become vacant, they are taken offline, updated, rebooted, and then restored to service. This iterative process ensures minimal disruption to ongoing computational tasks while maintenance is underway. It's important to note that during rolling maintenance cycles, job queues may experience a temporary slowdown as nodes await reboot.

Emergency Maintenance¶

Unavoidable (emergency) downtime may occur as a result of any of the above reasons at almost any time. Such events are rare and great effort is made to avoid these situations. However, when emergency maintenance is needed, the UITS unit responsible for the item affected will provide as much notice to users as possible and work to resolve the fault as quickly as possible.

Any emergency outages will be announced via email through the hpc-announce@list.arizona.edu mailing list.

Maintenance History¶

Type: Disruptive Maintenance

Routine software and firmware updates for the compute nodes and storage.
The scheduler will receive a major update. Please report any unexpected scheduler issues after service is restored.
Newer intel software will be made available.

Type: Rolling Maintenance

Routine software and firmware updates for the compute nodes and storage.
Login nodes OS update to Rocky Linux 9.
Minor Open OnDemand update to improve usability

Type: Rolling Maintenance

Network and device software updates.
Quarterly OS patching and associated reboots.
cuDNN module fix: The existing cudnn/9.3 module was incorrectly pointing to cuDNN 9.2. During maintenance, it will be updated to correctly point to cuDNN 9.3. If this change disrupts your workflow, you can load cudnn/9.2 instead.
Ticketing system change: all emails sent to hpc-consult will now automatically open a ServiceNow ticket for assignment and tracking purposes.

Type: Rolling Maintenance

Routine patching of all nodes.
Upgrade of Open OnDemand (OOD) to 4.0 with relatively minor changes. Link to Release Note.

Type: Rolling Maintenance

Routine patching of all nodes.
Completion of OS migration. All remaining CentOS 7 Puma nodes migrated to Rocky Linux 9. Puma9 made default cluster.

Type: Rolling Maintenance

Routine patching of all nodes and storage array.
OS upgrades continue. A block of nodes were migrated from CentOS 7 to Rocky 9, accessible from the login nodes using puma9. More details can be found on our OS Updates page.

Type: Rolling Maintenance

OnDemand graphical jobs limited to four days to support general resource availability.
User portal upgraded to support mobile clients.
New GPU partitions introduced to improve GPU resource availability.

Type: Rolling Maintenance

OnDemand Upgrade.
Gatekeeper moved to EL8 operating system.
Enabled job script storage in slurm accounting.

Type: Rolling Maintenance

General operating system patches.
Qumulo storage array update.
Slurm configuration improvements.
Metrics (XDMOD) OS upgrade.
RStudio Server support for R version 4.3.2.
OnDemand OS upgrades.