Skip to content

Maintenance

Planned Maintenance

Most maintenance is performed during regular hours with no interruption to service. System wide maintenance is usually planned ahead of time and is scheduled for Wednesdays from 8:00 AM to 5:00 PM with at least 10 days notice. These will be planned to occur four times per year.

Maintenance windows represent periods when UITS may choose to drain the queues of running jobs and suspend access to the cluster operation for HPC maintenance purposes.

The notification will describe the nature and extent (partial or full) of the interruptions of HPC services.

System-wide Maintenance

Impacts to job queues

During system-wide maintenance cycles, jobs queues are impacted before and during maintenance. Jobs submitted whose runtimes would overlap with maintenance are held until maintenance is concluded.

Some maintenance cycles require the entire system to be taken offline. In preparation, batch queues will be modified prior to scheduled downtimes to hold jobs which request more wallclock time than remains before the shutdown. Held jobs will be released to run once maintenance concludes.

Rolling maintenance

Impacts to job queues

During rolling maintenance cycles, job queues are impacted during and after maintenance. All nodes are drained, meaning they cannot accept new jobs and must allow running jobs to complete before they can be updated, rebooted, and put back online. The system may be slower to accept new jobs for 10 days following these maintenance cycles.

Rolling maintenance cycles are implemented to facilitate updates or maintenance tasks without necessitating a complete system shutdown. Throughout rolling maintenance, nodes will stop accepting new jobs, allowing currently running tasks to finish uninterrupted. As nodes gradually become vacant, they are taken offline, updated, rebooted, and then restored to service. This iterative process ensures minimal disruption to ongoing computational tasks while maintenance is underway. It's important to note that during rolling maintenance cycles, job queues may experience a temporary slowdown as nodes await reboot.

Emergency Maintenance

Unavoidable (emergency) downtime may occur as a result of any of the above reasons at almost any time. Such events are rare and great effort is made to avoid these situations. However, when emergency maintenance is needed, the UITS unit responsible for the item affected will provide as much notice to users as possible and work to resolve the fault as quickly as possible.

Any emergency outages will be announced via email through the hpc-announce@list.arizona.edu mailing list.

Maintenance History

Type: Rolling Maintenance

  • Routine patching of all nodes and storage array.
  • OS upgrades continue. A block of nodes were migrated from CentOS 7 to Rocky 9, accessible from the login nodes using puma9. More details can be found on our OS Updates page.

Type: Rolling Maintenance

Type: Rolling Maintenance

  • OnDemand Upgrade.
  • Gatekeeper moved to EL8 operating system.
  • Enabled job script storage in slurm accounting.

Type: Rolling Maintenance

  • General operating system patches.
  • Qumulo storage array update.
  • Slurm configuration improvements.
  • Metrics (XDMOD) OS upgrade.
  • RStudio Server support for R version 4.3.2.
  • OnDemand OS upgrades.