Monitoring Jobs and Resources¶
Monitoring Jobs¶
Tip
Some of these functions are specific to the UArizona HPC, and may not work if invoked on other systems.
Every job has a unique ID associated with it that can be used to track its status, view resource allocations, and resource usage. Below is a list of some helpful commands you can use for job monitoring.
Command |
Purpose | Example |
|---|---|---|
squeue --job=<jobid> |
Retrieves a running or pending job's status. | squeue --job=12345 |
squeue --me |
Retrieves all your running and pending jobs | |
scontrol show jobs <jobid> |
Retrieve detailed information on a running or pending job | scontrol show job 12345 |
scancel <jobid> |
Cancel a running or pending job | scancel 12345 |
job-history <jobid> |
Retrieves a running or completed job's history in a user-friendly format | job-history 12345 |
seff <jobid> |
Retrieves a completed job's memory and CPU efficiency | seff 12345 |
past-jobs |
Retrieves past jobs run by user. Can be used with option -d <n> to search for jobs run in the past <n> days |
past-jobs -d 5 |
job-limits <group_name> |
View your group's job resource limits and current usage. | job-limits mygroup |
Slurm Reason Codes¶
Sometimes, if you check a pending job there is a message under the field Reason indicating why your job may not be running. Some of these codes are non-intuitive so a human-readable translation is provided below:
| Reason | Explanation |
|---|---|
AssocGrpCpuLimit |
Your job is not running because your group CPU limit has been reached1 |
AssocGrpMemLimit |
Your job is not running because your group memory limit has been reached1 |
AssocGrpCPUMinutesLimit |
Either your group is out of CPU hours or your job will exhaust your group's CPU hours. |
AssocGrpGRES |
Your job is not running because your group GPU limit has been reached1 |
Dependency |
Your job depends on the completion of another job. It will wait in queue until the target job completes. |
QOSGrpCPUMinutesLimit |
This message indicates that your high priority or qualified hours allocation has been exhausted for the month. |
QOSMaxWallDurationPerJobLimit |
Your job's time limit exceeds the max allowable and will never run1 |
Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_or_jobs_in_higher_priority_partitions |
This very long message simply means your job is waiting in queue until there is enough space for it to run |
Priority |
Your job is waiting in queue until there are enough resources for it to run. |
QOSMaxCpuPerUserLimit |
Your job is not running because your per-user CPU limit has been reached1 |
ReqNodeNotAvail, Reserved for maintenance |
Your job's time limit overlaps with an upcoming maintenance window. Run "uptime_remaining" to see when the system will go offline. If you remove and resubmit your job with a shorter walltime that does not overlap with maintenance, it will likely run. Otherwise, it will remain pending until after the maintenance window. |
Resources |
Your job is waiting in queue until the required resources are available. |
Monitoring System Resources¶
We have a number of system commands that can be used to view the availability of resources on each cluster. This may be helpful in determining which cluster to use for your analyses
| Command | Purpose |
|---|---|
nodes-busy |
Display a visualization of each node on a cluster and overall usage. Use nodes-busy --help for more detailed options. |
cluster-busy |
Display a visual overview of each cluster's usage |
system-busy |
Display a text-based summary of a cluster's usage |