Monitoring of Brmbl.io

Brmbl.io Service Level Availability

Service Level Availability (SLA) is the percentage of time during which the platform is in an available state. Other states are degraded and outage.

Each of the user facing services have two Service Level Indicators (SLI): the Apdex score, and the Error rate. The Apdex score is generally a measure of the service performance (latency). The Error rate measures the percentage of requests that fail due to an error (usually, a 5XX status code).

A service is considered available when:

  1. The Apdex score of the service is above its Service Level Objective (SLO),
  2. AND The error rate is below its Service Level Objective (SLO).

An example of available web service; within a 1 minute window period:

  • At least 95% of requests have a latency within their “satisfactory” threshold
  • AND, less than 0.5% of requests return a 5XX error status response.

A service is unavailable, if, for one minute:

  • Less than 95% of requests have a latency within their “satisfactory” threshold
  • OR, more than 0.5% of requests return a 5XX error status response.

In other words, a service needs to simultaneously meet both of its SLO targets in order to be considered available. If either target is not met, the service is considered unavailable.

The availability score for a service is then calculated as the percentage of time that it is available. The Availability score for each service combined define the platform Service Level Availability (SLA). The SLA number indicates availability of Brmbl.io for a select period of time.

For example, if service becomes unavailable for a 10 minute period, the availability score will be:

  • 99.90% for the week (1 430 minutes of availability out of 1 440 minutes in a week)
  • 99.97% for the month (43 190 minutes of availability out of 43 200 minutes in the month)

Finally, the availability metric for Brmbl.io is calculated as a weighted average availability over the following services (weights in brackets):

  1. web (5)
  2. api (5)

The SLA score can be seen on the [SLA dashboard], and the SLA target is set as an [Infrastructure key performance indicator][kpi].

More details on definitions of outage, and degradation are on the incident-management page

Monitoring

StatusCake Statistics

For a quick view of the availability and performance history of Brmbl.io, we use StatusCake.

AppSignal Dashboard

We use AppSignal as our main APM solution, which enables exception tracking, traceability, root cause analysis, performance monitoring, alerting, and secondary uptime monitoring.

Monitoring Dashboards

We also collect data using Prometheus, leveraging available exporters like k8s agents, node or the postgresQL exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana.

  • Developer AWS IAM account is required to access
  • Highly Available setup
  • Some alerting feeds from this setup
  • Separated from the public for security and availability reasons

Adding Dashboards

To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:

Grafana dashboards can be added our monitoring stack via our infrastructure definition (IAC):

Need access to add a dashboard? Ask the team lead from our infrastructure team.

Logs

Network, System, and Application logs are processed, stored, and searched AWS CloudWatch.

AWS GuardDuty is also used to monitor application activity, system and network authentication events, security events, etc.

One can view how we log our infrastructure as outlined by our runbook

Instrumenting Elixir to Monitor Performance

Elixir can be instrumented to measure performance with the Telemetry module.

References