Get Involved
Join WATcloud

Joining WATcloud

  • Automation
  • DevOps
  • Sysadmin
  • HPC
  • Linux
  • Terraform
  • Ansible
  • Infrastructure as Code
  • CI/CD
  • Observability
  • Kubernetes
  • SLURM
  • HomeLab
  • Web Development
  • Robots taking over the world

Does any of the above sound familiar? Are you interested in bringing powerful compute to the masses? Do you want to work with compute infrastructure similar to those used by the world's most well-known companies1? If so, we'd love to have you onboard!

Who we're looking for

WATcloud is not like a course project, where you can do some work, get a grade, and then forget about it. We provide a service that is always up. We have users that depend on us. We have a responsibility to keep our service running, and to keep our users happy. We're looking for people who are passionate about what they do, and who are willing to put in the effort to build and quickly iterate on projects until every aspectis fully automated, reliable, observable, and trivially maintainable2.

How to apply

The best way to join WATcloud is to start contributing! We have a backlog of projects that we'd like to work on, but the list of projects always grows faster than we can work on them. A few of our projects are self-contained enough that anyone can pick them up and work on them. Below is a list of such projects. If you are able to complete one of these projects independently, we will be happy to bring you onboard immediately! If you are interested in working on a project that is not listed below, such as hardware projects (e.g. building computers, upgrading networking hardware) and projects that directly affect our infrastructure (Kubernetes, Terraform, Ansible, etc.), please reach out to us on Discord (opens in a new tab) or email [email protected].

Projects

If you can complete one of the projects below, you are guaranteed a spot on the team!

Terraform Provider Rate Limiting

We use a custom Terraform provider (opens in a new tab) for managing outgoing emails. Currently, the provider is a simple wrapper around an SMTP client. The SMTP server we use appears to have a rate limit. It errors out when we try to send more than a few emails in quick succession.

We would like to add rate limiting to the provider. Here's the ticket for this feature (opens in a new tab).

Blog Comment/Vote System

As we prepare to launch the WATcloud blog, we would like to integrate a comment/vote system that allows readers to engage with the content. Some requirements for this system are:

  • Easy to deploy and maintain: We want a system where almost all components can be automatically deployed (Infrastructure as Code).
  • Minimal infrastructure requirements: We want to avoid running servers/databases if possible3.
  • No paid subscriptions: We want to avoid services that require a paid subscription to use. This is because our funding does not allow for recurring costs.

Currently, we are considering the following options:

  1. Giscus (opens in a new tab):

  2. Utterances (opens in a new tab):

  3. A simple like button:

Some other options are described in these articles:

The blog is a part of our website. The website source code can be found here (opens in a new tab).

Azure Cost Estimator

We use an Azure nonprofit subscription for several projects, with an annual credit limit. Tracking our usage effectively is challenging due to these limitations in Azure:

  1. The Azure portal only displays current usage without projections or historical trends.
  2. Access to the Azure sponsorship portal is restricted to a single user.

To better manage our resources, we need a tool that provides detailed insights into our Azure credit usage. The ideal tool would:

  1. Display current usage and remaining credits.
  2. Chart historical usage trends.
  3. Project future usage based on past data.
  4. Provide detailed breakdowns by resource for all the above metrics.

We are considering using CAnalyzer (opens in a new tab), but are open to any other suggestions.

If you don't have access to an Azure subscription, we can give you read-only access to our Azure portal. Please fill out the onboarding-form (make sure to enable the Azure section) and let us know (opens in a new tab).

File Auto-Expiration Tool

This project is currently in the deployment stage. The source code will be made available once deployment is complete.

At WATcloud, we have many shared drives that are used by our users to store files. Some drives, like the scratch drive, is meant for temporary storage. However, users often forget to delete their files, and drives quickly fills up. We need a tool that can give us a list of files that have not been accessed in a long time, so that we can take appropriate action (e.g. notify the user, then delete the file). This tool should be a lightweight script that we can run on a schedule.

Assume that the drive is 2-5 TiB, backed by NVMe SSD. The filesystem type is flexible, but preferrably ext4 or xfs. The tool should have minimal impact on drive lifespan. Please be aware of the different timestamp types (e.g. access time, modification time, inode change time), and how they are accounted for by different filesystems and access methods.

Automatic DNS failover

This project is currently in the deployment stage. The source code will be made available once deployment is complete.

We host a Kubernetes cluster on our infrastructure and run a number of services. The services are exposed via nginx-ingress (opens in a new tab). Different machines are assigned the same DNS name. For example, we could have s3.watonomous.ca point to all Kubernetes hosts in the cluster (using multiple DNS A records), and the client accessing s3.watonomous.ca would send requests to one of the hosts, and nginx-ingress would route the request to the appropriate service. This is a simple way to reduce downtime, since if one of the hosts goes down, there's only a 1/n chance that the client will be affected4. However, this is still not ideal. Most clients are not designed with a retry mechanism, and certainly rarer to have a retry mechanism that re-issues DNS lookups. We would like to have a tool that can automatically detect when a host goes down, and remove its DNS record from the DNS server. This way, clients will be less likely to be affected by a host going down.

We use Cloudflare as our DNS provider. Cloudflare was generous enough to give us a sponsorship that included Zero-Downtime Failover (opens in a new tab). This works well for externally-accessible services, but we also have internal services that resolve to IP addresses that are only accessible from the cluster. This tool will help us achieve a similar5 level of reliability for internal services.

Broken Internal Link Detector

We have a statically-generated Next.js website6. Sometimes, we make typos in our hyperlinks. We would like to have a tool that can detect broken internal links. This should be a tool that runs at build-time and fails the build if it detects a broken link. The tool should be able to handle links to hashes (e.g. #section) in addition to links to pages. An initial brainstorm of how this could be implemented is available here (opens in a new tab).

Linux User Manager

At WATcloud, we use Ansible (opens in a new tab) for provisioning machines and users. However, due to the nature of Ansible, there are a lot of inefficiencies in user provisioning. The provisioning pipeline's running time scales linearly with the number of users7. As of 2023, we have accumulated over 300 users in the cluster. This results in a single provisioning step that takes over 15 minutes. We would like to have a tool that can manage users on a machine, and that can be used in place of Ansible for user provisioning. This tool should be able to accept the following arguments:

  • Managed UID range: the range of UIDs that the tool has control over
  • Managed GID range: the range of GIDs that the tool has control over
  • User list (username, UID, password, SSH keys): a list of users that the tool should manage.
  • Group list (groupname, GID, members): a list of groups that the tool should manage.

Footnotes

  1. Our compute infrastructure is errily similar to the dev farm used by the Tesla Autopilot team 😱

  2. A project is trivially maintainable if it can be maintained by someone who has never seen the project before, and who has no prior knowledge of the project's internals beyond a high-level overview of its purpose. Most of the time, this involves building something that we can take down and rebuild from scratch by running a single command.

  3. The current website is a statically-generated Next.js site, hosted on GitHub Pages. We would like to keep the infrastructure requirements similar to the current website.

  4. We are assuming that there's perfect DNS round-robin or random selection.

  5. It's slightly less reliable than Cloudflare's Zero-Downtime Failover, since there's a delay between when the host goes down and when the DNS record is removed. However, this is still much better than the current situation, where the DNS record of a downed host is never removed.

  6. The source code of the website is accessible at https://github.com/WATonomous/watcloud-website (opens in a new tab)

  7. Ansible issues a separate command for each action for each user. Even with the pipelining (opens in a new tab) feature, the provisioning pipeline is unbearably slow.