Documentation
Machine Usage Guide

Machine Usage Guide

This document provides an overview of the machines in the WATcloud compute cluster, including their hardware, networking, operating system, services, and software. It also includes guidelines for using the machines, troubleshooting instructions for common issues, and information about maintenance and outages.

Types of Machines

There are two main types of machines in the cluster: general-use machines and SLURM compute nodes. We will refer to them both as "development machines" in this document.

General-Use Machines

General-use machines are meant for interactive use and are shared among all users in the cluster. Additionally, general-use machines marked as SLURM login nodes (SL) in the machine list can be used to submit jobs to the SLURM cluster.

Instructions for accessing our general-use machines can be found in our SSH documentation.

SLURM Compute Nodes

Simple Linux Utility for Resource Management (SLURM) is an open-source job scheduler that allocates resources to jobs on a cluster of computers. It is widely used in HPC1 environments to provide a fair and efficient way to run jobs on a shared cluster.

Instructions for accessing our SLURM cluster can be found in our SLURM documentation.

Hardware

Most machines in the cluster come with standard workstation hardware that include CPU, RAM, GPU, and storage2. In special cases, you can request to have specialized hardware such as FPGAs installed in the machines.

Networking

All machines in the cluster are connected to both the university network (using 10Gbps or 1Gbps Ethernet) and a cluster network (using 40Gbps or 10Gbps Ethernet). The IP address range for the university network is 129.97.0.0/163 and the IP address range for the cluster network is 10.0.50.0/24.

Operating System

All development machines are virtual machines (VMs)4. This setup allows us to easily manage machines remotely and reduce the complexity of the bare-metal OSes.

Services

/home Directory

We run an SSD-backed Ceph5 cluster to provide distributed storage for machines in the cluster. All development machines share a common /home directory that is backed by the Ceph cluster.

Due to the relatively expensive cost of SSDs and observations that large file transfers can slow down the filesystem for all users, the home directory should only be used for storing small files. If you need to store large files (e.g. datasets, videos, ML model checkpoints), please use one of the other storage options below.

/mnt/wato-drive* Directory

We have a few HDD-backed NFS6 servers that provide large storage for machines in the cluster. These NASes are mounted on all development machines at the /mnt/wato-drive* directories. You can use these mounts to store large files such as datasets and ML model checkpoints.

/mnt/scratch Directory

Every general-use machine has an SSD-backed local storage pool that is mounted at the /mnt/scratch directory. These storage pools are meant for temporary storage for jobs that require fast and reliable filesystem access, such as storing training data and model checkpoints for ML workloads.

The space on /mnt/scratch is limited and shared between all users. Please make sure to clean up your files frequently (after every job). To promote good hygiene, there is an aggressive soft quota on the /mnt/scratch directory. Please refer to the Quotas page for more information.

Scratch space is available on SLURM compute nodes as well. They are mounted at /tmp and can be requested using the tmpdisk resource.

Docker

Every development machine has Docker Rootless7 installed. On general-use machines, the Docker daemon is automatically started8 when you log in. On SLURM compute nodes, the Docker daemon needs to be started manually.

On general-use machines, the storage location for Docker is set to /var/lib/cluster/users/$UID/docker, where $UID is your user ID. /var/lib/cluster is an SSD-backed storage pool, and there is a per-user storage quota to ensure that everyone has enough space to run their workloads. Please refer to the Quotas page for more information.

S3-compatible Object storage

We have an S3-compatible object storage that runs on the Ceph cluster. If you require this functionality, please contact a WATcloud admin to get access.

Kubernetes

We run a Kubernetes cluster using microk8s (opens in a new tab). This is mostly for running internal WATcloud services. However, if you require this functionality, please contact a WATcloud admin to get access.

GitHub Actions Runners

We run a GitHub Runner farm on Kubernetes using actions-runner-controller (opens in a new tab). Currently, it's enabled for the WATonomous organization. If you require this functionality, please reach out to a WATcloud admin to get access.

Software

We try to keep the machines lean and generally refrain from installing software that make sense for rootless installation or running in containerized environments.

Examples of software that we install:

  • Docker (rootless)
  • NVIDIA Container Toolkit
  • zsh
  • various CLI tools (e.g. vifm, iperf, moreutils, jq, ncdu)

Examples of software that we do not install:

If there is a piece of software that you think should be installed on the machines, please reach out to a WATcloud team member.

Maintenance and Outages

We try to keep machines in the cluster up and running at all times. However, we need to perform regular maintenance to keep machines up-to-date and services running smoothly. All scheduled maintenance will be announced in infrastructure-support discussions (opens in a new tab)9. Emergency maintenance and maintenance that has little effect on user experience will be announced in the #🌩-watcloud-use channel on Discord.

Sometimes, machines in the cluster may go down unexpectedly due to hardware failures or power outages. We have a comprehensive suite of healthchecks and internal monitoring tools10 to detect these failures and notify us. However, due to the part-time nature of the student team, we may not be able to respond to these failures immediately. If you notice that a machine is down, please ping the WATcloud team on Discord (@WATcloud or @WATcloud Leads, in the #🌩-watcloud-use channel).

To see if a machine is having issues, please visit status.watonomous.ca (opens in a new tab). The WATcloud team uses this page as a dashboard to monitor the health of machines in the cluster.

Usage Guidelines

  • Use SLURM as much as possible. SLURM streamlines resource allocation. You get a dedicated environment for your job, and you don't have to worry about CPU/memory contention.
  • Be nice (opens in a new tab)
    • If you have a long-running non-interactive process on a general-use machine, please increase its niceness (opens in a new tab) so that interactive programs don't lag.
    • Being nice is simply changing ./my_program arg1 arg2 to nice ./my_program arg1 arg2.
  • Clean up after yourself
    • If you are using /mnt/scratch on a general-use machine, please make sure to clean up your files after you are done with them.
    • Please only use /home for small files. Writing large files to /home will significantly slow down the filesystem for all users.
    • /mnt/wato-drive* are large storage pools, but they are not infinite and can fill up quickly with today's large datasets. Please remove unneeded files from these directories.
    • When using Docker on general-use machines, please clean up your Docker images and containers regularly. docker system prune --all is your friend.

Troubleshooting

This section contains some common issues that users may encounter when using machines in the cluster and their solutions. If you encounter an issue that is not listed here, please reach out.

Permission denied while trying to connect to the Docker daemon

You may encounter this error when trying to run Docker commands on general-use machines:

> docker ps
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json": dial unix /var/run/docker.sock: connect: permission denied

This error occurs when your shell doesn't source /etc/profile (zsh is known to do this). Our Docker rootless setup requires that the DOCKER_HOST environment variable be set properly, which is done in /etc/profile. To fix this, please add the following line to your shell's rc file (e.g. ~/.zshrc):

export DOCKER_HOST=unix://${XDG_RUNTIME_DIR}/docker.sock

Remember to restart your shell or source the rc file after making the change.

Disk quota exceeded when running Docker commands

You may encounter the following error when running Docker commands on general-use machines:

> docker pull hello-world
...
open /var/lib/cluster/users/$UID/docker/tmp/GetImageBlob3112047691: disk quota exceeded

This means that you have exceeded your allocated storage quota11. Here are some commands you can use to free up disk space12:

# remove dangling images (images without tags)
docker image prune
# remove all images without an existing container
docker image prune --all
# remove stopped containers
docker container prune
# remove all volumes not used by at least one container
docker volume prune
# remove all stopped containers, dangling images, unused networks, and unused build cache
docker system prune
# same as above, but also removes unused volumes
docker system prune --volumes
# same as above, but also removes unused images
docker system prune --volumes --all

Cannot connect to the Docker daemon

You may encounter this error when trying to run Docker commands on general-use machines:

> docker ps
Cannot connect to the Docker daemon at unix:///run/user/$UID/docker.sock. Is the docker daemon running?

To troubleshoot this issue, please run the following command to confirm whether the Docker daemon is running:

systemctl --user status docker-rootless

A healthy Docker daemon should look like this:

 docker-rootless.service - Docker Application Container Engine (Rootless)
     Loaded: loaded (/usr/lib/systemd/user/docker-rootless.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-01-16 05:47:04 UTC; 44s ago
       Docs: https://docs.docker.com/go/rootless/
   Main PID: 2090042 (rootlesskit)
      Tasks: 63
     Memory: 54.7M
        CPU: 587ms
     CGroup: /user.slice/user-1507.slice/[email protected]/app.slice/docker-rootless.service
             ├─2090042 rootlesskit --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --data-root /var/lib/cluster/users/1507/docker --config-file /etc/docker/daemon.json
             ├─2090053 /proc/self/exe --net=slirp4netns --mtu=65520 --slirp4netns-sandbox=auto --slirp4netns-seccomp=auto --disable-host-loopback --port-driver=builtin --copy-up=/etc --copy-up=/run --propagation=rslave /usr/bin/dockerd-rootless.sh --data-root /var/lib/cluster/users/1507/docker --config-file /etc/docker/daemon.json
             ├─2090072 slirp4netns --mtu 65520 -r 3 --disable-host-loopback --enable-sandbox --enable-seccomp 2090053 tap0
             ├─2090080 dockerd --data-root /var/lib/cluster/users/1507/docker --config-file /etc/docker/daemon.json
             └─2090112 containerd --config /run/user/1507/docker/containerd/containerd.toml

If the daemon failed to start, the output may look like this:

× docker-rootless.service - Docker Application Container Engine (Rootless)
     Loaded: loaded (/usr/lib/systemd/user/docker-rootless.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-01-16 05:15:38 UTC; 9min ago
       Docs: https://docs.docker.com/go/rootless/
    Process: 2049816 ExecStart=/usr/bin/dockerd-rootless.sh --data-root /var/lib/cluster/users/1507/docker --config-file /etc/docker/daemon.json (code=exited, status=1/FAILURE)
   Main PID: 2049816 (code=exited, status=1/FAILURE)
        CPU: 283ms
 
Jan 16 05:15:38 trpro-ubuntu2 systemd[73045]: docker-rootless.service: Scheduled restart job, restart counter is at 3.
Jan 16 05:15:38 trpro-ubuntu2 systemd[73045]: Stopped Docker Application Container Engine (Rootless).
Jan 16 05:15:38 trpro-ubuntu2 systemd[73045]: docker-rootless.service: Start request repeated too quickly.
Jan 16 05:15:38 trpro-ubuntu2 systemd[73045]: docker-rootless.service: Failed with result 'exit-code'.
Jan 16 05:15:38 trpro-ubuntu2 systemd[73045]: Failed to start Docker Application Container Engine (Rootless).

To get more information about why the daemon failed to start, you can view the logs by running:

journalctl --user --catalog --pager-end --unit docker-rootless.service

One of the most common reasons for the Docker daemon to fail to start is that the storage quota has been exceeded. If this is the case, you will see disk quota exceeded in the logs:

Jan 16 05:15:35 trpro-ubuntu2 dockerd-rootless.sh[2049885]: time="2024-01-16T05:15:35.066806684Z" level=info msg="containerd successfully booted in 0.020404s"
Jan 16 05:15:35 trpro-ubuntu2 dockerd-rootless.sh[2049852]: time="2024-01-16T05:15:35.077397993Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
Jan 16 05:15:36 trpro-ubuntu2 dockerd-rootless.sh[2049852]: failed to start daemon: Unable to get the TempDir under /var/lib/cluster/users/1507/docker: mkdir /var/lib/cluster/users/1507/docker/tmp: disk quota exceeded
Jan 16 05:15:36 trpro-ubuntu2 dockerd-rootless.sh[2049827]: [rootlesskit:child ] error: command [/usr/bin/dockerd-rootless.sh --data-root /var/lib/cluster/users/1507/docker --config-file /etc/docker/daemon.json] exited: exit status 1
Jan 16 05:15:36 trpro-ubuntu2 dockerd-rootless.sh[2049816]: [rootlesskit:parent] error: child exited: exit status 1
Jan 16 05:15:36 trpro-ubuntu2 systemd[73045]: docker-rootless.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit UNIT has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Jan 16 05:15:36 trpro-ubuntu2 systemd[73045]: docker-rootless.service: Failed with result 'exit-code'.

For more information about storage quotas, please see the Quotas page.

To free up space for Docker, you can selectively delete files from the /var/lib/cluster/users/$(id -u)/docker directory until you are no longer over quota. However, it could be difficult to determine which files to delete. If you would like to delete all your Docker files (including images, containers, volumes, etc.), which effectively resets your Docker installation, you can delete the entire directory:

rootlesskit rm -r /var/lib/cluster/users/$(id -u)/docker

Here, rootlesskit puts us in the "fake root" environment used by Docker rootless13, which allows us to manage files owned by subuid/subgid.

If you would prefer to use Docker tools to clean up your Docker installation (e.g. docker system prune --all), you can reach out to a WATcloud team member to temporarily increase your storage quota so that Docker daemon can be started.

Once you have freed up enough space, you can restart the Docker daemon by running:

systemctl --user restart docker-rootless

Footnotes

  1. High-performance computing (opens in a new tab) (HPC) is the use of supercomputers and parallel processing techniques to solve complex computational problems. Examples of HPC clusters include Cedar (opens in a new tab) and Graham (opens in a new tab).

  2. Machine specs can be found here.

  3. The IP range for the university network can be found here (opens in a new tab).

  4. We use Proxmox (opens in a new tab) as our hypervisor.

  5. Ceph (opens in a new tab) is a distributed storage system that provides high performance and reliability.

  6. NFS (opens in a new tab) stands for "Network File System" and is used to share files over a network.

  7. Docker (opens in a new tab) is a platform for neatly packaging software, both for development and deployment. Docker Rootless (opens in a new tab) is a way to run Docker without root privileges.

  8. The Docker daemon is started using a systemd user service.

  9. The GitHub team @WATonomous/watcloud-compute-cluster-users will be notified. Please ensure that you enable notifications (opens in a new tab) to receive these notices.

  10. Please refer to the Observability page to learn more about the tools we use to monitor the cluster.

  11. For more information about storage quotas, including how to check your current quota usage, please see the Quotas page.

  12. For more information about the Docker prune commands, please refer to the Docker manual (opens in a new tab)

  13. Learn more about rootlesskit here (opens in a new tab).