WATcloud Maintenance Manual

This manual outlines the maintenance procedures for various components of WATcloud.

General procedure

Plan the maintenance: Prepare a plan for the maintenance, including the start time, end time, and the steps to be taken during the maintenance. Identify the components and services that will be affected. Try to minimize the impact on users by using strategies like rolling updates.
Notify users: If the maintenance will affect users, notify them in advance. Make sure to give users plenty of time to prepare for the maintenance. In general, one week's notice is recommended.
Perform the maintenance: Follow the steps outlined in the maintenance plan. If the maintenance runs over the scheduled end time, notify users of the delay.
Verify the maintenance: After the maintenance is complete, verify that all components are working as expected (including CI pipelines). If there are any issues, address them immediately. Use observability tools to monitor the health of the system.
Notify users: Once the maintenance is complete, update the maintenance announcement to indicate that the maintenance is complete. If there were any issues during the maintenance, provide details on what happened and how it was resolved.

SLURM

This section outlines the maintenance procedures for the SLURM cluster.

Cluster overview

To get a general overview of the health of the SLURM cluster, you can run:

sinfo --long

Example output:

Thu Apr 18 17:16:26 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

In the output above, tr-slurm1 is in the drained state, which means it is not available for running jobs. thor-slurm1 is in the mix state, which means some jobs are running on it. All other nodes are in the idle state, which means there are no jobs running on them.

To get a detailed overview of nodes in the cluster, you can run:

scontrol show node [NODE_NAME]

The optional NODE_NAME argument can be used to restrict the output to a specific node.

Example output:

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20]

In the output above, we can see that the reason tr-slurm1 is in the drained state (a.k.a. IDLE+DRAIN) for reason Performing maintenance on baremetal. The Reason field is an arbitrary user-specified string that can be set when performing actions on nodes.

Performing maintenance on a node

Creating a reservation

The first step to performing maintenance on a node is to create a reservation¹. This ensures that user-submitted jobs do not get dispatched to the target nodes if they cannot be completed before the maintenance window starts.

scontrol create reservation starttime=<START_TIME> duration=<DURATION_IN_MINUTES> user=root flags=maint nodes=<NODE_NAMES>

The time zone for the starttime argument is the local time zone where the command is run. To see the local timezone, run timedatectl.

Here's an example:

scontrol create reservation starttime=2024-04-30T21:00:00 duration=480 user=root flags=maint nodes=trpro-slurm1,trpro-slurm2

output:

Reservation created: root_4

This command creates a reservation named root_4. The reservation starts on April 30, 2024, at 9:00 PM local time and lasts for 8 hours (480 minutes) for the nodes trpro-slurm1 and trpro-slurm2. During this time, only the root user can run jobs on these nodes. Jobs submitted by other users will be queued until the reservation is over.

To see existing reservations, you can run:

scontrol show reservation

To delete a resservation, you can run:

scontrol delete reservation <RESERVATION_NAME>

Starting maintenance

Before performing maintenance on a node, you should drain the node to ensure no jobs are running on it and no new jobs are scheduled to run on it.

scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"

For example:

scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"

This will drain the node tr-slurm1 (prevent new jobs from running on it) and set the reason to Performing maintenance on baremetal. If there are no jobs running on the node, the node state becomes drained (a.k.a. IDLE+DRAIN in scontrol). If there are jobs running on the node, the node state becomes draining (a.k.a. MIXED+DRAIN in scontrol). In this case, SLURM will wait for the jobs to finish before transitioning the node to the drained state.

Example output from when a node is in the draining state:

> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1    draining             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

After jobs finish running on the node, the node will transition to the drained state:

> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1

> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
   RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=compute
   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
   LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]

Once the node is in the drained state, you can perform maintenance on it.

Taking a node out of maintenance mode

To take a node out of maintenance mode, you can run:

scontrol update nodename="<NODE_NAME>" state=resume

For example:

scontrol update nodename="tr-slurm1" state=resume

This will resume the node tr-slurm1 (allow new jobs to run on it) and clear the reason.

Also remember to delete any unexpired reservations (as outlined in Creating a reservation).

Maintenance Email Generator

Fill out the fields below to generate a maintenance announcement email. You can copy the generated email and send it to the mailing list (opens in a new tab).

Email (Plain Text)

Hi WATcloud compute cluster users, This is a maintenance announcement. Please see below for details.

SERVICES AFFECTED:


START TIME:
Tuesday, 2025-11-04 10:00 PM (ET)

DURATION:
2 hours

PURPOSE:


------------------------------------------------------------
If you have any questions or concerns, please contact: [email protected]

Discord (Markdown)

**Hi WATcloud compute cluster users, This is a maintenance announcement. Please see below for details.**

**SERVICES AFFECTED:**


**START TIME:**
Tuesday, 2025-11-04 10:00 PM (ET)

**DURATION:**
2 hours

**PURPOSE:**


------------------------------------------------------------
If you have any questions or concerns, please contact: [email protected]

Troubleshooting

SLURM `Invalid user id` when creating a reservation

If you get Error creating the reservation: Invalid user id when creating a reservation, it may mean that you don't have sufficient permissions to create a reservation. You can try running the command with sudo. For example:

> sudo $(which scontrol) create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint nodes=thor-slurm1
Reservation created: root_12

SLURM `Requested nodes are busy` when creating a reservation

If you get Error creating the reservation: Requested nodes are busy when creating a reservation, it means that the nodes you are trying to reserve are already in use. E.g. there are jobs on the node with end time during the reservation window.

> scontrol create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint nodes=thor-slurm1
Error creating the reservation: Requested nodes are busy

You can check which jobs are running on the node with:

squeue --nodelist=<NODE_NAME>

For example:

> squeue --nodelist=thor-slurm1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            156143   compute slurm-sl watcloud  R       0:03      1 thor-slurm1
            156144   compute slurm-sl watcloud  R       0:02      1 thor-slurm1
            156133   compute slurm-sl watcloud  R       0:23      1 thor-slurm1

You can view the end time of a job with:

scontrol show job <JOB_ID>

For example:

> scontrol show job 156143
JobId=156143 JobName=slurm-slurm-runner-large-41221285582
   UserId=watcloud-slurm-ci(1814) GroupId=watcloud-slurm-ci(1814) MCS_label=N/A
   Priority=1116 Nice=0 Account=watonomous-watcloud QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:38 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2025-04-27T05:39:27 EligibleTime=2025-04-27T05:39:27
   AccrueTime=2025-04-27T05:39:29
   StartTime=2025-04-27T05:39:29 EndTime=2025-04-27T06:09:29 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-04-27T05:39:29 Scheduler=Main
   Partition=compute AllocNode:Sid=10.1.77.47:21
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=thor-slurm1
   BatchHost=thor-slurm1
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=8G,node=1,billing=4,gres/tmpdisk=16384
   AllocTRES=cpu=4,mem=8G,node=1,billing=4,gres/tmpdisk=16384
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/watcloud-slurm-ci/apptainer.sh
   WorkDir=/home/watcloud-slurm-ci
   StdErr=/var/log/slurm-ci/slurm-ci-156143.out
   StdIn=/dev/null
   StdOut=/var/log/slurm-ci/slurm-ci-156143.out
   TresPerNode=gres/tmpdisk:16384
   TresPerTask=cpu=4

If the job is long-running and you think it will finish before the maintenance starts, you can force create the reservation with the ignore_jobs flag. For example:

> scontrol create reservation starttime=2025-04-27T19:00:00 duration=120 user=root flags=maint,ignore_jobs nodes=thor-slurm1
Reservation created: root_12

Before the maintenance starts, just make sure that the job is finished before proceeding.

The official documentation for reservations is at https://slurm.schedmd.com/reservations.html (opens in a new tab) ↩

Kubernetes Cheat Sheet Observability

WATcloud Maintenance Manual

General procedure

SLURM

Cluster overview

Performing maintenance on a node

Creating a reservation

Starting maintenance

Taking a node out of maintenance mode

Maintenance Email Generator

Email (Plain Text)

Discord (Markdown)

Troubleshooting

SLURM Invalid user id when creating a reservation

SLURM Requested nodes are busy when creating a reservation

Footnotes

SLURM `Invalid user id` when creating a reservation

SLURM `Requested nodes are busy` when creating a reservation