Monitoring of Brmbl.io
Service Level Availability (SLA) is the percentage of time during which the platform is in an available state. Other states are degraded and outage.
Each of the user facing services have two Service Level Indicators (SLI): the Apdex score, and the Error rate.
The Apdex score is generally a measure of the service performance (latency).
The Error rate measures the percentage of requests that fail due to an error (usually, a
5XX status code).
A service is considered available when:
- The Apdex score of the service is above its Service Level Objective (SLO),
- AND The error rate is below its Service Level Objective (SLO).
An example of available
web service; within a 1 minute window period:
- At least 95% of requests have a latency within their “satisfactory” threshold
- AND, less than 0.5% of requests return a 5XX error status response.
A service is unavailable, if, for one minute:
- Less than 95% of requests have a latency within their “satisfactory” threshold
- OR, more than 0.5% of requests return a 5XX error status response.
In other words, a service needs to simultaneously meet both of its SLO targets in order to be considered available. If either target is not met, the service is considered unavailable.
The availability score for a service is then calculated as the percentage of time that it is available. The Availability score for each service combined define the platform Service Level Availability (SLA). The SLA number indicates availability of Brmbl.io for a select period of time.
For example, if service becomes unavailable for a 10 minute period, the availability score will be:
- 99.90% for the week (1 430 minutes of availability out of 1 440 minutes in a week)
- 99.97% for the month (43 190 minutes of availability out of 43 200 minutes in the month)
Finally, the availability metric for Brmbl.io is calculated as a weighted average availability over the following services (weights in brackets):
The SLA score can be seen on the [SLA dashboard], and the SLA target is set as an [Infrastructure key performance indicator][kpi].
More details on definitions of outage, and degradation are on the incident-management page
For a quick view of the availability and performance history of Brmbl.io, we use StatusCake.
We use AppSignal as our main APM solution, which enables exception tracking, traceability, root cause analysis, performance monitoring, alerting, and secondary uptime monitoring.
We also collect data using Prometheus, leveraging available exporters like k8s agents, node or the postgresQL exporters, and we build whatever else is necessary. The data is visualized in graphs and dashboards that are built using Grafana.
- Developer AWS IAM account is required to access
- Highly Available setup
- Some alerting feeds from this setup
- Separated from the public for security and availability reasons
To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:
Grafana dashboards can be added our monitoring stack via our infrastructure definition (IAC):
- [Canned Grafana Dashboards)[https://grafana.com/grafana/dashboards]
- Helm charts in Pulumi
- Adding a Grafana dashboard in Pulumi
Need access to add a dashboard? Ask the team lead from our infrastructure team.
Network, System, and Application logs are processed, stored, and searched AWS CloudWatch.
AWS GuardDuty is also used to monitor application activity, system and network authentication events, security events, etc.
One can view how we log our infrastructure as outlined by our runbook
Elixir can be instrumented to measure performance with the