WATcloud Maintenance Manual
This manual outlines the maintenance procedures for various components of WATcloud.
General procedure
- Plan the maintenance: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
- Notify users: If the maintenance will affect users, notify them in advance (opens in a new tab). Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
- Perform the maintenance: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
- Verify the maintenance: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use observability tools to monitor the health of the system.
- Notify users: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.
SLURM
This section outlines the maintenance procedures for the SLURM cluster.
Cluster overview
To get a general overview of the health of the SLURM cluster, you can run:
sinfo --long
Example output:
Thu Apr 18 17:16:26 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
In the output above, tr-slurm1
is in the drained
state, which means it is not available for running jobs.
thor-slurm1
is in the mix
state, which means some jobs are running on it.
All other nodes are in the idle
state, which means there are no jobs running on them.
To get a detailed overview of nodes in the cluster, you can run:
scontrol show node [NODE_NAME]
The optional NODE_NAME
argument can be used to restrict the output to a specific node.
Example output:
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20]
In the output above, we can see that the reason tr-slurm1
is in the drained
state (a.k.a. IDLE+DRAIN
) for reason Performing maintenance on baremetal
.
The Reason
field is an arbitrary user-specified string that can be set when performing actions on nodes.
Performing maintenance on a node
Creating a reservation
The first step to performing maintenance on a node is to create a reservation1. This ensures that user-submitted jobs do not get dispatched to the target nodes if they cannot be completed before the maintenance window starts.
scontrol create reservation starttime=<START_TIME> duration=<DURATION_IN_MINUTES> user=root flags=maint nodes=<NODE_NAMES>
The time zone for the starttime
argument is the local time zone where the command is run.
To see the local timezone, run timedatectl
.
Here's an example:
scontrol create reservation starttime=2024-04-30T21:00:00 duration=480 user=root flags=maint nodes=trpro-slurm1,trpro-slurm2
output:
Reservation created: root_4
This command creates a reservation named root_4
. The reservation starts on
April 30, 2024, at 9:00 PM local time and lasts for
8 hours (480 minutes) for the nodes trpro-slurm1
and trpro-slurm2
.
During this time, only the root user can run jobs on these nodes.
Jobs submitted by other users will be queued until the reservation is over.
To see existing reservations, you can run:
scontrol show reservation
To delete a resservation, you can run:
scontrol delete reservation <RESERVATION_NAME>
Starting maintenance
Before performing maintenance on a node, you should drain the node to ensure no jobs are running on it and no new jobs are scheduled to run on it.
scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"
For example:
scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"
This will drain the node tr-slurm1
(prevent new jobs from running on it) and set the reason to Performing maintenance on baremetal
.
If there are no jobs running on the node, the node state becomes drained
(a.k.a. IDLE+DRAIN
in scontrol
).
If there are jobs running on the node, the node state becomes draining
(a.k.a. MIXED+DRAIN
in scontrol
).
In this case, SLURM will wait for the jobs to finish before transitioning the node to the drained
state.
Example output from when a node is in the draining
state:
> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 draining tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
After jobs finish running on the node, the node will transition to the drained
state:
> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
Once the node is in the drained
state, you can perform maintenance on it.
Taking a node out of maintenance mode
To take a node out of maintenance mode, you can run:
scontrol update nodename="<NODE_NAME>" state=resume
For example:
scontrol update nodename="tr-slurm1" state=resume
This will resume the node tr-slurm1
(allow new jobs to run on it) and clear the reason.
Also remember to delete any unexpired reservations (as outlined in Creating a reservation).
Footnotes
-
The official documentation for reservations is at https://slurm.schedmd.com/reservations.html (opens in a new tab) ↩