Monitoring Jobs and Resources¶
Monitoring Jobs¶
Tip
Some of these functions are specific to the UArizona HPC, and may not work if invoked on other systems.
Every job has a unique ID associated with it that can be used to track its status, view resource allocations, and resource usage. Below is a list of some helpful commands you can use for job monitoring.
Command |
Purpose | Example |
---|---|---|
squeue --job=<jobid> |
Retrieves a running or pending job's status. | squeue --job=12345 |
squeue --me |
Retrieves all your running and pending jobs | |
scontrol show jobs <jobid> |
Retrieve detailed information on a running or pending job | scontrol show job 12345 |
scancel <jobid> |
Cancel a running or pending job | scancel 12345 |
job-history <jobid> |
Retrieves a running or completed job's history in a user-friendly format | job-history 12345 |
seff <jobid> |
Retrieves a completed job's memory and CPU efficiency | seff 12345 |
past-jobs |
Retrieves past jobs run by user. Can be used with option -d <n> to search for jobs run in the past <n> days |
past-jobs -d 5 |
job-limits <group_name> |
View your group's job resource limits and current usage. | job-limits mygroup |
Slurm Reason Codes¶
Sometimes, if you check a pending job there is a message under the field Reason
indicating why your job may not be running. Some of these codes are non-intuitive so a human-readable translation is provided below:
Reason | Explanation |
---|---|
AssocGrpCpuLimit |
Your job is not running because your group CPU limit has been reached1 |
AssocGrpMemLimit |
Your job is not running because your group memory limit has been reached1 |
AssocGrpCPUMinutesLimit |
Either your group is out of CPU hours or your job will exhaust your group's CPU hours. |
AssocGrpGRES |
Your job is not running because your group GPU limit has been reached1 |
Dependency |
Your job depends on the completion of another job. It will wait in queue until the target job completes. |
QOSGrpCPUMinutesLimit |
This message indicates that your high priority or qualified hours allocation has been exhausted for the month. |
QOSMaxWallDurationPerJobLimit |
Your job's time limit exceeds the max allowable and will never run1 |
Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_or_jobs_in_higher_priority_partitions |
This very long message simply means your job is waiting in queue until there is enough space for it to run |
Priority |
Your job is waiting in queue until there are enough resources for it to run. |
QOSMaxCpuPerUserLimit |
Your job is not running because your per-user CPU limit has been reached1 |
ReqNodeNotAvail, Reserved for maintenance |
Your job's time limit overlaps with an upcoming maintenance window. Run "uptime_remaining" to see when the system will go offline. If you remove and resubmit your job with a shorter walltime that does not overlap with maintenance, it will likely run. Otherwise, it will remain pending until after the maintenance window. |
Resources |
Your job is waiting in queue until the required resources are available. |
Monitoring System Resources¶
We have a number of system commands that can be used to view the availability of resources on each cluster. This may be helpful in determining which cluster to use for your analyses
Command | Purpose |
---|---|
nodes-busy |
Display a visualization of each node on a cluster and overall usage. Use nodes-busy --help for more detailed options. |
cluster-busy |
Display a visual overview of each cluster's usage |
system-busy |
Display a text-based summary of a cluster's usage |