Documentation
Maintenance Manual

WATcloud Maintenance Manual

This manual outlines the maintenance procedures for various components of WATcloud.

SLURM

To get a general overview of the health of the SLURM cluster, you can run:

sinfo --long

Example output:

Thu Apr 18 17:16:26 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

In the output above, tr-slurm1 is in the drained state, which means it is not available for running jobs. thor-slurm1 is in the mix state, which means some jobs are running on it. All other nodes are in the idle state, which means there are no jobs running on them.

To get a detailed overview of the health of the SLURM cluster, you can run:

scontrol show node [NODE_NAME]

The optional NODE_NAME argument can be used to restrict the output to a specific node.

Example output:

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20] 

In the output above, we can see that the reason tr-slurm1 is in the drained state (a.k.a. IDLE+DRAIN) for reason Performing maintenance on baremetal. The Reason field is an arbitrary user-specified string that can be set when performing actions on nodes.

Performing maintenance on a node

Creating a reservation

The first step to performing maintenance on a node is to create a reservation1. This ensures that user-submitted jobs do not get dispatched to the target nodes if they cannot be completed before the maintenance window starts.

scontrol create reservation starttime=<START_TIME> duration=<DURATION_IN_MINUTES> user=root flags=maint nodes=<NODE_NAMES>

The time zone for the starttime argument is the local time zone where the command is run. To see the local timezone, run timedatectl.

Here's an example:

scontrol create reservation starttime=2024-04-30T21:00:00 duration=480 user=root flags=maint nodes=trpro-slurm1,trpro-slurm2

output:

Reservation created: root_4

This command creates a reservation named root_4. The reservation starts on April 30, 2024, at 9:00 PM local time and lasts for 8 hours (480 minutes) for the nodes trpro-slurm1 and trpro-slurm2. During this time, only the root user can run jobs on these nodes. Jobs submitted by other users will be queued until the reservation is over.

To see existing reservations, you can run:

scontrol show reservation

To delete a resservation, you can run:

scontrol delete reservation <RESERVATION_NAME>

Starting maintenance

Before performing maintenance on a node, you should drain the node to ensure no jobs are running on it and no new jobs are scheduled to run on it.

scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"

For example:

scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"

This will drain the node tr-slurm1 (prevent new jobs from running on it) and set the reason to Performing maintenance on baremetal. If there are no jobs running on the node, the node state becomes drained (a.k.a. IDLE+DRAIN in scontrol). If there are jobs running on the node, the node state becomes draining (a.k.a. MIXED+DRAIN in scontrol). In this case, SLURM will wait for the jobs to finish before transitioning the node to the drained state.

Example output from when a node is in the draining state:

> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1    draining             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

After jobs finish running on the node, the node will transition to the drained state:

> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

Once the node is in the drained state, you can perform maintenance on it.

Taking a node out of maintenance mode

To take a node out of maintenance mode, you can run:

scontrol update nodename="<NODE_NAME>" state=resume

For example:

scontrol update nodename="tr-slurm1" state=resume

This will resume the node tr-slurm1 (allow new jobs to run on it) and clear the reason.

Also remember to delete any unexpired reservations (as outlined in Creating a reservation).

Footnotes

  1. The official documentation for reservations is at https://slurm.schedmd.com/reservations.html (opens in a new tab)