Documentation
Maintenance Manual

WATcloud Maintenance Manual

This manual outlines the maintenance procedures for various components of WATcloud.

General procedure

  1. Plan the maintenance: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
  2. Notify users: If the maintenance will affect users, notify them in advance (opens in a new tab). Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
  3. Perform the maintenance: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
  4. Verify the maintenance: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use observability tools to monitor the health of the system.
  5. Notify users: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.

SLURM

This section outlines the maintenance procedures for the SLURM cluster.

Cluster overview

To get a general overview of the health of the SLURM cluster, you can run:

sinfo --long

Example output:

Thu Apr 18 17:16:26 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

In the output above, tr-slurm1 is in the drained state, which means it is not available for running jobs. thor-slurm1 is in the mix state, which means some jobs are running on it. All other nodes are in the idle state, which means there are no jobs running on them.

To get a detailed overview of nodes in the cluster, you can run:

scontrol show node [NODE_NAME]

The optional NODE_NAME argument can be used to restrict the output to a specific node.

Example output:

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20] 

In the output above, we can see that the reason tr-slurm1 is in the drained state (a.k.a. IDLE+DRAIN) for reason Performing maintenance on baremetal. The Reason field is an arbitrary user-specified string that can be set when performing actions on nodes.

Performing maintenance on a node

Creating a reservation

The first step to performing maintenance on a node is to create a reservation1. This ensures that user-submitted jobs do not get dispatched to the target nodes if they cannot be completed before the maintenance window starts.

scontrol create reservation starttime=<START_TIME> duration=<DURATION_IN_MINUTES> user=root flags=maint nodes=<NODE_NAMES>

The time zone for the starttime argument is the local time zone where the command is run. To see the local timezone, run timedatectl.

Here's an example:

scontrol create reservation starttime=2024-04-30T21:00:00 duration=480 user=root flags=maint nodes=trpro-slurm1,trpro-slurm2

output:

Reservation created: root_4

This command creates a reservation named root_4. The reservation starts on April 30, 2024, at 9:00 PM local time and lasts for 8 hours (480 minutes) for the nodes trpro-slurm1 and trpro-slurm2. During this time, only the root user can run jobs on these nodes. Jobs submitted by other users will be queued until the reservation is over.

To see existing reservations, you can run:

scontrol show reservation

To delete a resservation, you can run:

scontrol delete reservation <RESERVATION_NAME>

Starting maintenance

Before performing maintenance on a node, you should drain the node to ensure no jobs are running on it and no new jobs are scheduled to run on it.

scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"

For example:

scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"

This will drain the node tr-slurm1 (prevent new jobs from running on it) and set the reason to Performing maintenance on baremetal. If there are no jobs running on the node, the node state becomes drained (a.k.a. IDLE+DRAIN in scontrol). If there are jobs running on the node, the node state becomes draining (a.k.a. MIXED+DRAIN in scontrol). In this case, SLURM will wait for the jobs to finish before transitioning the node to the drained state.

Example output from when a node is in the draining state:

> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1    draining             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

After jobs finish running on the node, the node will transition to the drained state:

> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

Once the node is in the drained state, you can perform maintenance on it.

Taking a node out of maintenance mode

To take a node out of maintenance mode, you can run:

scontrol update nodename="<NODE_NAME>" state=resume

For example:

scontrol update nodename="tr-slurm1" state=resume

This will resume the node tr-slurm1 (allow new jobs to run on it) and clear the reason.

Also remember to delete any unexpired reservations (as outlined in Creating a reservation).

Footnotes

  1. The official documentation for reservations is at https://slurm.schedmd.com/reservations.html (opens in a new tab)