Observability
Observability is the ability to understand the internal state of a system based on its external outputs. The term originated in control theory and has since been adopted by the software industry to describe the same concept in the context of software systems1.
At WATcloud, we have a number of tools to help us understand the state of our systems, detect issues, and optimize processes.
Healthchecks
Healthchecks are periodic checks that verify that a service is running as expected.
Each healthcheck typically outputs a simple up
or down
status.
This section describes the healthcheck infrastructure at WATcloud.
Quick Links
- Status Page (opens in a new tab)
- Healthchecks.io (opens in a new tab)
- Sentry Crons (opens in a new tab)
- Alertmanager (opens in a new tab)
Status Page
The status page (opens in a new tab) is a collection of healthchecks from various sources. The goal of the status page is to provide a single source of truth for the status of all our services. We regularly use the status page as a troubleshooting tool to quickly identify the source of an issue.
Healthchecks.io
Healthchecks.io (opens in a new tab) is a dead man's switch (DMS) (opens in a new tab) service that accepts periodic pings from services. When a service fails to ping the service within a specified time frame, Healthchecks.io marks the service as down. It can be configured to send alerts to various channels. Currently, we receive alerts on Discord.
Sentry Crons
Similar to Healthchecks.io, Sentry Crons (opens in a new tab) is a DMS service. We receive alerts on Discord when a service fails to ping Sentry Crons within a specified time frame.
Alertmanager
Alertmanager (opens in a new tab) is a component of the Prometheus monitoring system. It uses metrics collected by Prometheus to send alerts to various channels. Currently, we receive alerts on Discord.
Metrics
Metrics are quantitative measurements of a system's status and performance. This section describes the metrics infrastructure at WATcloud.
Quick Links
Prometheus
Prometheus (opens in a new tab) is an open-source monitoring and alerting toolkit. It collects metrics from various sources and stores them in a time-series database. We use Prometheus to monitor the health of our systems and to set up alerts (via Alertmanager) for potential issues.
Logs
Logs are records of events that happen in a system. Examples of logs include Linux system logs and Kubernetes container logs. This section describes the logging infrastructure at WATcloud.
Quick Links
Elastic Cloud
Elastic Cloud (opens in a new tab) is a managed Elasticsearch service. It is used to store logs from various sources, including Kubernetes clusters and Linux servers.
Error Tracking
Error tracking is the practice of recording and monitoring errors that occur in a system. This section describes the error tracking infrastructure at WATcloud.
Quick Links
Sentry
Sentry (opens in a new tab) is an open-source error tracking tool. It captures and aggregates errors from various sources, including web applications and backend services. Use cases for Sentry at WATcloud include error monitoring for websites, APIs, and CI pipelines.
Tracing
Tracing is the practice of recording the life cycle of an object. An example of tracing is recording the different stages of a CI pipeline (e.g. start/finish times of each job stage, stages passed/failed).
Currently, WATcloud does not have a tracing system in place. You can following along this internal issue (opens in a new tab) for updates on the status of our tracing infrastructure.
Footnotes
-
See this Wikipedia article (opens in a new tab) for more information. ↩