Observability

Observability is the ability to understand the internal state of a system based on its external outputs. The term originated in control theory and has since been adopted by the software industry to describe the same concept in the context of software systems1.

At WATcloud, we have a number of tools to help us understand the state of our systems, detect issues, and optimize processes.

Healthchecks

Healthchecks are periodic checks that verify that a service is running as expected. Each healthcheck typically outputs a simple up or down status. This section describes the healthcheck infrastructure at WATcloud.

Status Page

The status page is a collection of healthchecks from various sources. The goal of the status page is to provide a single source of truth for the status of all our services. We regularly use the status page as a troubleshooting tool to quickly identify the source of an issue.

Healthchecks.io

Healthchecks.io is a dead man’s switch (DMS) service that accepts periodic pings from services. When a service fails to ping the service within a specified time frame, Healthchecks.io marks the service as down. It can be configured to send alerts to various channels. Currently, we receive alerts on Discord.

Sentry Crons

Similar to Healthchecks.io, Sentry Crons is a DMS service. We receive alerts on Discord when a service fails to ping Sentry Crons within a specified time frame.

Alertmanager

Alertmanager is a component of the Prometheus monitoring system. It uses metrics collected by Prometheus to send alerts to various channels. Currently, we receive alerts on Discord.

Metrics

Metrics are quantitative measurements of a system’s status and performance. This section describes the metrics infrastructure at WATcloud.

Prometheus

Prometheus is an open-source monitoring and alerting toolkit. It collects metrics from various sources and stores them in a time-series database. We use Prometheus to monitor the health of our systems and to set up alerts (via Alertmanager) for potential issues.

Logs

Logs are records of events that happen in a system. Examples of logs include Linux system logs and Kubernetes container logs. This section describes the logging infrastructure at WATcloud.

Elastic Cloud

Elastic Cloud is a managed Elasticsearch service. It is used to store logs from various sources, including Kubernetes clusters and Linux servers.

Error Tracking

Error tracking is the practice of recording and monitoring errors that occur in a system. This section describes the error tracking infrastructure at WATcloud.

Sentry

Sentry is an open-source error tracking tool. It captures and aggregates errors from various sources, including web applications and backend services. Use cases for Sentry at WATcloud include error monitoring for websites, APIs, and CI pipelines.

Tracing

Tracing is the practice of recording the life cycle of an object. An example of tracing is recording the different stages of a CI pipeline (e.g. start/finish times of each job stage, stages passed/failed).

Currently, WATcloud does not have a tracing system in place. You can follow this internal issue for updates on the status of our tracing infrastructure.

Footnotes

  1. See this Wikipedia article for more information.