WATcloud Maintenance Manual
This manual outlines the maintenance procedures for various components of WATcloud.
General procedure
- Plan the maintenance: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
- Notify users: If the maintenance will affect users, notify them in advance. Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
- Perform the maintenance: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
- Verify the maintenance: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use observability tools to monitor the health of the system.
- Notify users: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.
SLURM
This section outlines the maintenance procedures for the SLURM cluster.
Cluster overview
To get a general overview of the health of the SLURM cluster, you can run:
sinfo --longExample output:
Thu Apr 18 17:16:26 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1In the output above, tr-slurm1 is in the drained state, which means it is not available for running jobs.
thor-slurm1 is in the mix state, which means some jobs are running on it.
All other nodes are in the idle state, which means there are no jobs running on them.
To get a detailed overview of nodes in the cluster, you can run:
scontrol show node [NODE_NAME]The optional NODE_NAME argument can be used to restrict the output to a specific node.
Example output:
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20] In the output above, we can see that the reason tr-slurm1 is in the drained state (a.k.a. IDLE+DRAIN) for reason Performing maintenance on baremetal.
The Reason field is an arbitrary user-specified string that can be set when performing actions on nodes.
Performing maintenance on a node
Creating a reservation
The first step to performing maintenance on a node is to create a reservation1. This ensures that user-submitted jobs do not get dispatched to the target nodes if they cannot be completed before the maintenance window starts.
scontrol create reservation starttime=<START_TIME> duration=<DURATION_IN_MINUTES> user=root flags=maint nodes=<NODE_NAMES>The time zone for the starttime argument is the local time zone where the command is run.
To see the local timezone, run timedatectl.
Here's an example:
scontrol create reservation starttime=2024-04-30T21:00:00 duration=480 user=root flags=maint nodes=trpro-slurm1,trpro-slurm2output:
Reservation created: root_4This command creates a reservation named root_4. The reservation starts on
April 30, 2024, at 9:00 PM local time and lasts for
8 hours (480 minutes) for the nodes trpro-slurm1 and trpro-slurm2.
During this time, only the root user can run jobs on these nodes.
Jobs submitted by other users will be queued until the reservation is over.
To see existing reservations, you can run:
scontrol show reservationTo delete a resservation, you can run:
scontrol delete reservation <RESERVATION_NAME>Starting maintenance
Before performing maintenance on a node, you should drain the node to ensure no jobs are running on it and no new jobs are scheduled to run on it.
scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"For example:
scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"This will drain the node tr-slurm1 (prevent new jobs from running on it) and set the reason to Performing maintenance on baremetal.
If there are no jobs running on the node, the node state becomes drained (a.k.a. IDLE+DRAIN in scontrol).
If there are jobs running on the node, the node state becomes draining (a.k.a. MIXED+DRAIN in scontrol).
In this case, SLURM will wait for the jobs to finish before transitioning the node to the drained state.
Example output from when a node is in the draining state:
> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 draining tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]After jobs finish running on the node, the node will transition to the drained state:
> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]Once the node is in the drained state, you can perform maintenance on it.
Taking a node out of maintenance mode
To take a node out of maintenance mode, you can run:
scontrol update nodename="<NODE_NAME>" state=resumeFor example:
scontrol update nodename="tr-slurm1" state=resumeThis will resume the node tr-slurm1 (allow new jobs to run on it) and clear the reason.
Also remember to delete any unexpired reservations (as outlined in Creating a reservation).
Maintenance Email Generator
Fill out the fields below to generate a maintenance announcement email. You can copy the generated email and send it to the mailing list (opens in a new tab).
Email (Plain Text)
Hi WATcloud compute cluster users, This is a maintenance announcement. Please see below for details.
SERVICES AFFECTED:
START TIME:
Tuesday, 2025-11-04 10:00 PM (ET)
DURATION:
2 hours
PURPOSE:
------------------------------------------------------------
If you have any questions or concerns, please contact: [email protected]Discord (Markdown)
**Hi WATcloud compute cluster users, This is a maintenance announcement. Please see below for details.**
**SERVICES AFFECTED:**
**START TIME:**
Tuesday, 2025-11-04 10:00 PM (ET)
**DURATION:**
2 hours
**PURPOSE:**
------------------------------------------------------------
If you have any questions or concerns, please contact: [email protected]Troubleshooting
SLURM Invalid user id when creating a reservation
If you get Error creating the reservation: Invalid user id when creating a reservation, it may mean that you don't have sufficient permissions to create a reservation. You can try running the command with sudo. For example:
> sudo $(which scontrol) create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint nodes=thor-slurm1
Reservation created: root_12SLURM Requested nodes are busy when creating a reservation
If you get Error creating the reservation: Requested nodes are busy when creating a reservation, it means that the nodes you are trying to reserve are already in use. E.g. there are jobs on the node with end time during the reservation window.
> scontrol create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint nodes=thor-slurm1
Error creating the reservation: Requested nodes are busyYou can check which jobs are running on the node with:
squeue --nodelist=<NODE_NAME>For example:
> squeue --nodelist=thor-slurm1
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
156143 compute slurm-sl watcloud R 0:03 1 thor-slurm1
156144 compute slurm-sl watcloud R 0:02 1 thor-slurm1
156133 compute slurm-sl watcloud R 0:23 1 thor-slurm1You can view the end time of a job with:
scontrol show job <JOB_ID>For example:
> scontrol show job 156143
JobId=156143 JobName=slurm-slurm-runner-large-41221285582
UserId=watcloud-slurm-ci(1814) GroupId=watcloud-slurm-ci(1814) MCS_label=N/A
Priority=1116 Nice=0 Account=watonomous-watcloud QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:38 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2025-04-27T05:39:27 EligibleTime=2025-04-27T05:39:27
AccrueTime=2025-04-27T05:39:29
StartTime=2025-04-27T05:39:29 EndTime=2025-04-27T06:09:29 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-27T05:39:29 Scheduler=Main
Partition=compute AllocNode:Sid=10.1.77.47:21
ReqNodeList=(null) ExcNodeList=(null)
NodeList=thor-slurm1
BatchHost=thor-slurm1
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=4,mem=8G,node=1,billing=4,gres/tmpdisk=16384
AllocTRES=cpu=4,mem=8G,node=1,billing=4,gres/tmpdisk=16384
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryCPU=2G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/watcloud-slurm-ci/apptainer.sh
WorkDir=/home/watcloud-slurm-ci
StdErr=/var/log/slurm-ci/slurm-ci-156143.out
StdIn=/dev/null
StdOut=/var/log/slurm-ci/slurm-ci-156143.out
TresPerNode=gres/tmpdisk:16384
TresPerTask=cpu=4If the job is long-running and you think it will finish before the maintenance starts, you can force create the reservation with the ignore_jobs flag. For example:
> scontrol create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint,ignore_jobs nodes=thor-slurm1
Reservation created: root_12Before the maintenance starts, just make sure that the job is finished before proceeding.
Footnotes
-
The official documentation for reservations is at https://slurm.schedmd.com/reservations.html (opens in a new tab) ↩