Joining WATcloud
- Automation
- DevOps
- Sysadmin
- HPC
- Linux
- Terraform
- Ansible
- Infrastructure as Code
- CI/CD
- Observability
- Kubernetes
- SLURM
- HomeLab
- Web Development
- Robots taking over the world
Does any of the above sound familiar? Are you interested in bringing powerful compute to the masses? Do you want to work with compute infrastructure similar to those used by the world's most well-known companies1? If so, we'd love to have you onboard!
Who we're looking for
WATcloud is not like a course project, where you can do some work, get a grade, and then forget about it. We provide a service that is always up. We have users that depend on us. We have a responsibility to keep our service running, and to keep our users happy. We're looking for people who are passionate about what they do, and who are willing to put in the effort to build and quickly iterate on projects until every aspect is fully automated, reliable, observable, and trivially maintainable2. Please take a look at our guidelines to get a sense of what we expect from our team members.
How to apply
The best way to join WATcloud is to start contributing! We have a backlog of projects that we'd like to work on, but the list of projects always grows faster than we can work on them. A few of our projects are self-contained enough that anyone can pick them up and work on them. Below is a list of such projects. If you are able to complete one of these projects independently, we will be happy to bring you onboard immediately! If you are interested in working on a project that is not listed below, such as hardware projects (e.g. building computers, upgrading networking hardware) and projects that directly affect our infrastructure (Kubernetes, Terraform, Ansible, etc.), please reach out to us on Discord (opens in a new tab) or email [email protected].
Projects
If you can complete one of the projects below, you are guaranteed a spot on the team!
Automated Power Outage Notification System
Create a system to automate notifications for power outages impacting our server room (CPH, 3rd floor). The system should monitor the PlantOps Service Interruptions page (opens in a new tab) to detect relevant updates and notify cluster users as needed. Notifications should include the date, time, and expected duration of the outage. We will also add additional information to the notification after the MVP, such as the impact on the cluster and any actions users should take. Here's a list of historical announcements (opens in a new tab).
Requirements
- The tool should process historical outage data to demonstrate sufficiently low false positive and false negative rates.
- Inputs to the tool can be the website or any equivalent data source (e.g., email). The tool is responsible for parsing the natural language and generating announcements when necessary.
The deployment of this tool (mechanism to send notifications) can be done with the WATcloud team when the tool is ready for deployment.
WATcloud CLI
As a WATcloud compute cluster user, there are a few things that you might want to do frequently, such as:
- Checking your quota usage (disk, CPU, memory)
- Checking the status of user daemons like Docker rootless
- Checking the status of the cluster (whether nodes are up, whether the cluster is under maintenance)
All of these can be done in user space (no special privileges required) with commands documented in various places in our documentation. However, it can be tedious to run multiple commands to get this information. We would like to have a CLI (command-line interface) tool that can provide all of this information in a single command.
Here are some initial commands the CLI tool should have:
watcloud status
: Get the status of the cluster, show machines that are up/down and whether they are under maintenance.watcloud quota list
: List the quota usage of the user. Can be expanded in the future to submit quota edit requests.watcloud daemon status
: Get the status of user daemons like Docker rootless.
Grammar and Style Checker
We have a lot of documentation, and we want to make sure that it is easy to read and understand. We would like to have a tool that can check our documentation for grammar and style issues. This tool should detect capitalization issues, such as those outlined in the guidelines. Additionally, it would be nice if this tool can check for common grammar mistakes, such as subject-verb agreement, punctuation, and sentence structure. This tool can be run as a part of our CI/CD pipeline and should be used to check all documentation and, optionally, code comments.
Open Source Tickets
Some of our open-source projects have open tickets that anyone can work on. Here's a collection of tickets that we think are suitable for new contributors:
- Support displaying job dependencies via flows (opens in a new tab) (GitHub Actions Tracing (opens in a new tab))
- Support Dark Mode (opens in a new tab) (WATcloud Emails (opens in a new tab))
- Use SMTP pool to send emails (opens in a new tab)
File Auto-Expiration Tool
This project is currently in the deployment stage. The source code will be made available once deployment is complete.
At WATcloud, we have many shared drives that are used by our users to store files. Some drives, like the scratch drive, is meant for temporary storage. However, users often forget to delete their files, and drives quickly fills up. We need a tool that can give us a list of files that have not been accessed in a long time, so that we can take appropriate action (e.g. notify the user, then delete the file). This tool should be a lightweight script that we can run on a schedule.
Assume that the drive is 2-5 TiB, backed by NVMe SSD. The filesystem type is flexible, but preferrably ext4 or xfs. The tool should have minimal impact on drive lifespan. Please be aware of the different timestamp types (e.g. access time, modification time, inode change time), and how they are accounted for by different filesystems and access methods.
Automatic DNS failover
This project is currently in the deployment stage. The source code will be made available once deployment is complete.
We host a Kubernetes cluster on our infrastructure and run a number of services. The services are exposed via nginx-ingress (opens in a new tab). Different machines are assigned the same DNS name. For example, we could have s3.watonomous.ca
point to all Kubernetes hosts in the cluster (using multiple DNS A records), and the client accessing s3.watonomous.ca
would send requests to one of the hosts, and nginx-ingress would route the request to the appropriate service. This is a simple way to reduce downtime, since if one of the hosts goes down, there's only a 1/n
chance that the client will be affected3. However, this is still not ideal. Most clients are not designed with a retry mechanism, and certainly rarer to have a retry mechanism that re-issues DNS lookups. We would like to have a tool that can automatically detect when a host goes down, and remove its DNS record from the DNS server. This way, clients will be less likely to be affected by a host going down.
We use Cloudflare as our DNS provider. Cloudflare was generous enough to give us a sponsorship that included Zero-Downtime Failover (opens in a new tab). This works well for externally-accessible services, but we also have internal services that resolve to IP addresses that are only accessible from the cluster. This tool will help us achieve a similar4 level of reliability for internal services.
Broken External Link Detector
Completed
This project has been completed and the initial version is available here (opens in a new tab) Please don't hesitate to reach out if you have any comments or suggestions!
In broken internal link detector, we implemented a tool that can detect broken internal links in our website. We would like to extend this tool/develop a new tool that can detect broken external links.
Special considerations:
- What happens if the external link is down temporarily? We have no control over how long the link will be down.
- How do we handle links that require authentication? For example, on some member-facing pages, we have links to private GitHub repos. These links will return a 404 if the user is not authenticated. Should we simply whitelist these links, or is there a better way to handle them?
We don't know the answers to these questions, and we'd love to hear your thoughts!
Terraform Provider Rate Limiting/Retry Mechanism
Completed
This project has been completed and the source code is available here (opens in a new tab). Please don't hesitate to reach out if you have any comments or suggestions!
We use a custom Terraform provider (opens in a new tab) for managing outgoing emails. Currently, the provider is a simple wrapper around an SMTP client. The SMTP server we use appears to have a rate limit. It errors (SMTP 421) when we try to send more than a few emails in quick succession.
We would like to add rate limiting or a retry mechanism to the provider. Here's the ticket for this feature (opens in a new tab).
Blog Comment/Vote System
Completed
This project has been completed and the initial version is available here (opens in a new tab). Please don't hesitate to reach out if you have any comments or suggestions!
As we prepare to launch the WATcloud blog, we would like to integrate a comment/vote system that allows readers to engage with the content. Some requirements for this system are:
- Easy to deploy and maintain: We want a system where almost all components can be automatically deployed (Infrastructure as Code).
- Minimal infrastructure requirements: We want to avoid running servers/databases if possible5.
- No paid subscriptions: We want to avoid services that require a paid subscription to use. This is because our funding does not allow for recurring costs.
Currently, we are considering the following options:
-
- A lightweight commenting system that uses GitHub Discussions to manage and store comments.
- Supports comments and reactions.
- Examples: 1 (opens in a new tab), 2 (opens in a new tab), 3 (opens in a new tab)
-
Utterances (opens in a new tab):
- A similar system to Giscus, but uses GitHub Issues to store comments.
- Examples: 1 (opens in a new tab)
-
A simple like button:
- Simple, but often involves running a (minimal) server/database or using an external service that may require a paid subscription.
- Examples: 1 (opens in a new tab), 2 (opens in a new tab)
Some other options are described in these articles:
- https://darekkay.com/blog/static-site-comments/ (opens in a new tab)
- https://getshifter.io/static-site-comments/ (opens in a new tab)
The blog is a part of our website. The website source code can be found here (opens in a new tab).
Broken Internal Link Detector
Completed
This project has been completed and the initial version is available here (opens in a new tab). Please don't hesitate to reach out if you have any comments or suggestions!
We have a statically-generated Next.js website6. Sometimes, we make typos in our hyperlinks. We would like to have a tool that can detect broken internal links. This should be a tool that runs at build-time and fails the build if it detects a broken link. The tool should be able to handle links to hashes (e.g. #section
) in addition to links to pages. An initial brainstorm of how this could be implemented is available here (opens in a new tab).
Linux User Manager
Completed
This project has been completed and the source code is available here (opens in a new tab). Please don't hesitate to reach out if you have any comments or suggestions!
At WATcloud, we use Ansible (opens in a new tab) for provisioning machines and users. However, due to the nature of Ansible, there are a lot of inefficiencies in user provisioning. The provisioning pipeline's running time scales linearly with the number of users7. As of 2023, we have accumulated over 300 users in the cluster. This results in a single provisioning step that takes over 15 minutes. We would like to have a tool that can manage users on a machine, and that can be used in place of Ansible for user provisioning. This tool should be able to accept the following arguments:
- Managed UID range: the range of UIDs that the tool has control over
- Managed GID range: the range of GIDs that the tool has control over
- User list (username, UID, password, SSH keys): a list of users that the tool should manage.
- Group list (groupname, GID, members): a list of groups that the tool should manage.
Azure Cost Estimator
This project is no longer needed because Azure has migrated our nonprofit subscription to a new model that includes cost management features.
We use an Azure nonprofit subscription for several projects, with an annual credit limit. Tracking our usage effectively is challenging due to these limitations in Azure:
- The Azure portal only displays current usage without projections or historical trends.
- Access to the Azure sponsorship portal is restricted to a single user.
To better manage our resources, we need a tool that provides detailed insights into our Azure credit usage. The ideal tool would:
- Display current usage and remaining credits.
- Chart historical usage trends.
- Project future usage based on past data.
- Provide detailed breakdowns by resource for all the above metrics.
We are considering using CAnalyzer (opens in a new tab), but are open to any other suggestions.
If you don't have access to an Azure subscription, we can give you read-only access to our Azure portal.
Please fill out the onboarding-form (make sure to enable the Azure
section)
and let us know (opens in a new tab).
Footnotes
-
Our compute infrastructure is errily similar to the dev farm used by the Tesla Autopilot team 😱. ↩
-
A project is trivially maintainable if it can be maintained by someone who has never seen the project before, and who has no prior knowledge of the project's internals beyond a high-level overview of its purpose. Most of the time, this involves building something that we can take down and rebuild from scratch by running a single command. ↩
-
We are assuming that there's perfect DNS round-robin or random selection. ↩
-
It's slightly less reliable than Cloudflare's Zero-Downtime Failover, since there's a delay between when the host goes down and when the DNS record is removed. However, this is still much better than the current situation, where the DNS record of a downed host is never removed. ↩
-
The current website is a statically-generated Next.js site, hosted on GitHub Pages. We would like to keep the infrastructure requirements similar to the current website. ↩
-
The source code of the website is accessible at https://github.com/WATonomous/watcloud-website (opens in a new tab) ↩
-
Ansible issues a separate command for each action for each user. Even with the pipelining (opens in a new tab) feature, the provisioning pipeline is unbearably slow. ↩