Incident Management

If you’re a Bramble team member and are looking to alert Reliability Engineering about an availability issue with Brmbl.io, please find quick instructions to report an incident here: Reporting an Incident.

If you’re a Bramble team member looking for who is currently the Engineer On Call (EOC), please see the Who is the Current EOC? section.

If you’re a Bramble team member looking for the status of a recent incident, please see the incident board. For detailed information about incident status changes, please see the Incident Workflow section.


SOC 2 Criteria: CC2.2, CC2.3, CC4.2, CC5.1, CC7.3, CC7.5, CC9.1

ISO 27001 Annex A: A.16

Keywords: Impact, Security impact level, Report template, Incident

Purpose

This security incident response policy is intended to establish controls to ensure detection of security vulnerabilities and incidents, as well as quick reaction and response to security breaches. 

This document also provides implementing instructions for security incident response, to include definitions, procedures, responsibilities, and performance measures (metrics and reporting mechanisms).

Scope

A key objective of Bramble’s Information Security Program is to focus on detecting information security weaknesses and vulnerabilities so that incidents and breaches can be prevented wherever possible. Bramble is committed to protecting its employees, customers, and partners from illegal or damaging actions taken by others, either knowingly or unknowingly. Despite this, incidents and data breaches are likely to happen; when they do, Bramble is committed to rapidly responding to them, which may include identifying, containing, investigating, resolving , and communicating information related to the breach.

This policy requires that all users report any perceived or actual information security vulnerability or incident as soon as possible using the contact mechanisms prescribed in this document. In addition, Bramble must employ automated scanning and reporting mechanisms that can be used to identify possible information security vulnerabilities and incidents. If a vulnerability is identified, it must be resolved within a set period of time based on its severity. If an incident is identified, it must be investigated within a set period of time based on its severity. If an incident is confirmed as a breach, a set procedure must be followed to contain, investigate, resolve, and communicate information to employees, customers, partners and other stakeholders.

Within this document, the following definitions apply:

  • Information Security Vulnerability:

A vulnerability in an information system, information system security procedures, or administrative controls that could be exploited to gain unauthorized access to information or to disrupt critical processing.

  • Information Security Incident:

A suspected, attempted, successful, or imminent threat of unauthorized access, use, disclosure, breach, modification, or destruction of information; interference with information technology operations; or significant violation of information security policy.

Roles And Responsibilities

Bramble’s Security Officer is responsible for updating, reviewing, and maintaining this policy.

Policy

  • All users must report any system vulnerability, incident, or event pointing to a possible incident to the Security Officer as quickly as possible but no later than 24 hours.
  • Incidents must be reported by sending an email message with details of the incident.
  • Users must be trained on the procedures for reporting information security incidents or discovered vulnerabilities, and their responsibilities to report such incidents. Failure to report information security incidents shall be considered to be a security violation and will be reported to the Human Resources (HR) Manager for disciplinary action.
  • Information and artifacts associated with security incidents (including but not limited to files, logs, and screen captures) must be preserved in the event that they need to be used as evidence of a crime.
  • All information security incidents must be responded to through the incident management procedures defined below.

Incident Response

Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

  1. well-defined roles and responsibilities and workflow for members of the incident team,
  2. control points to manage the flow information and the resolution path,
  3. a root-cause analysis (RCA),
  4. and a post-incident review where lessons and techniques are extracted and shared

When an incident starts, the automation sends a message in the #incident-management channel containing a link to a per-incident Slack channels for chat based communication, the incident issue for permanent records, and the Situation Room Zoom link (also in all incident channel descriptions) for incident team members to join for synchronous verbal and screen-sharing communication.

Ownership

There is only ever one owner of an incident—and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. With the exception of when Brmbl.io is not functioning correctly, the incident issue should be assigned to the current owner.

Roles and Responsibilities

It’s important to clearly delineate responsibilities during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation.

Role Description Who?
EOC - Engineer On Call The EOC is the usually the first person alerted - expectations for the role are in the Handbook for oncall. The checklist for the EOC is in our runbooks. If another party has declared an incident, once the EOC is engaged the EOC owns the incident. The EOC can escalate a page in PagerDuty to engage the IMOC and CMOC. The Reliability Team Engineer On Call is generally an SRE and can declare an incident. They are part of the “SRE 8 Hour” on call shift in PagerDuty.
IMOC - Incident Manager On Call The IMOC is engaged when incident resolution requires coordination from multiple parties. The IMOC is the tactical leader of the incident response team—not a person performing technical work. The IMOC assembles the incident team by engaging individuals with the skills and information required to resolve the incident. The Incident Manager is an Engineering Manager, Staff Engineer, or Director from the Reliability team. The IMOC rotation is currently in the “SRE Managers” Pager Duty Schedule.
CMOC - Communications Manager On Call The CMOC disseminates information internally to stakeholders and externally to customers across multiple media (e.g. Bramble issues, Twitter, status.brmbl.oio, etc.). The Communications Manager is generally member of the support team at Bramble. Notifications to the Incident Management - CMOC service in PagerDuty will go to the rotations setup for CMOC.

These definitions imply several on-call rotations for the different roles.

Shared Incident Responsibilities

Incident Status Updates - EOC/IMOC

  1. During an active incident, the EOC is initially responsible for posting regular status updates in the Current Status section of the incident issue description. These updates should summarize the current customer impact of the incident and actions we are taking to mitigate the incident.
    1. These updates should occur at regular intervals based on the severity of the incident. Refer to Frequency of Updates for frequency guidelines.
    2. These status updates are used to:
      1. Help construct a detailed incident timeline to be used in Root Cause Analysis.
      2. Ensure CMOC has up to date and accurate information to communicate to customers, executives and other stakeholders.
      3. Ensure others in the company can track the state of the incident and the impact it is having on customers.
  2. Once an IMOC has been engaged in the incident these responsibilities shift to the IMOC.

Incident Timeline Updates - EOC/IMOC

  1. During an active incident, the EOC is initially responsible for ensuring that actions and events relevant to the issue and its resolution are captured in the timeline. These timeline updates should be captured in the Timeline section of the incident issue description, but can be captured in a comment thread, if rapid capture of events is needed. If capturing these events in comments on the incident issue, utilize the same format as the Timeline section of the incident issue.
  2. Once an IMOC has been engaged in the incident these responsibilities are shared with IMOC. With the IMOC taking the initiative to capture these to preserve space for the EOC to work on mitigation. The EOC should therefore update the IMOC in the incident call with items relevant to the timeline.

Engineer on Call (EOC) Responsibilities

  1. As EOC, your highest priority for the duration of your shift is the stability of Brmbl.io.
  2. The SSOT for who is the current EOC is the Bramble Production service definition in PagerDuty.
  3. Alerts that are routed to Pagerduty need to acknowledged within 15 minutes, otherwise they will be escalated to the oncall IMOC.
    1. Alert-manager alerts in Slack #alerts and #alerts-general are an important source of information about the health of the environment and should be monitored during working hours.
    2. If the Pagerduty alert noise is too high, your task as an EOC is clearing out that noise by either fixing the system or changing the alert.
    3. If you are changing the alert, it is your responsibility to explain the reasons behind it and inform the next EOC that the change occurred.
    4. Each event (may be multiple related pages) should result in an issue in the production tracker. See production queue usage for more details.
  4. If sources outside of our alerting are reporting a problem, and you have not received any alerts, it is still your responsibility to investigate. Declare a low severity incident and investigate from there.
    1. Low severity (S3/S4) incidents (and issues) are cheap, and will allow others a means to communicate their experience if they are also experiencing the issue.
    2. “No alerts” is not the same as “no problem”
  5. Brmbl.io is a somewhat complex system. It is ok to not fully understand the underlying issue or its causes. However, if this is the case, as EOC you should engage with the IMOC to find a team member with the appropriate expertise.
    1. Requesting assistance does not mean relinquishing EOC responsibility. The EOC is still responsible for the incident.
    2. The Bramble Organizational Chart (WIP) and the Bramble Team Page, which lists areas of expertise for team members, are important tools for finding the right people.
  6. As soon as an S1/S2 incident is declared, join the The Situation Room Permanent Zoom. The Zoom link is in the #incident-management topic.
    1. Bramble works in an asynchronous manner, but incidents require a synchronous response. Our collective goal is high availability of 99.9% and beyond, which means that the timescales over which communication needs to occur during an incident is measured in seconds and minutes, not hours.
  7. Keep in mind that a Brmbl.io incident is not an “infrastructure problem”. It is a company-wide issue, and as EOC, you are leading the response on behalf of the company.
    1. If you need information or assistance, engage with Engineering teams. If you do not get the response you require within a reasonable period, escalate through the IMOC.
    2. As EOC, require that those who may be able to assist to join the Zoom call and ask them to post their findings in the incident issue or active incident Google doc. Debugging information in Slack will be lost and this should be strongly discouraged.
  8. By acknowledging an incident in Pagerduty, the EOC is implying that they are working on it. To further reinforce this acknowledgement, post a note in Slack that you are joining the The Situation Room Permanent Zoom as soon as possible.
    1. If the EOC believes the alert is incorrect, comment on the thread in #production. If the alert is flappy, create an issue and post a link in the thread. This issue might end up being a part of RCA or end up requiring a change in the alert rule.
  9. Be inquisitive. Be vigilant. If you notice that something doesn’t seem right, investigate further.
  10. After the incident is resolved, the EOC should review the comments and ensure that the corrective actions are added to the issue description, regardless of the incident severity. If it has a ~review-requested label, the EOC should start on performing an incident review, in some cases this may be be a synchronous review meeting or an async review depending on what is requested by those involved with the incident.

Guidelines on Security Incidents

At times, we have a security incident where we may need to take actions to block a certain URL path or part of the application. This list is meant to help the Security Engineer On-Call and EOC decide when to engage help and post to status.brmbl.io.

If any of the following are true, it would be best to engage an Incident Manager:

  1. There is a S1/P1 report or security incident.
  2. An entire path or part of functionality of the Brmbl.io application must be blocked.
  3. Any unauthorized access to a Brmbl.io production system

In some cases, we may choose not to post to status.brmbl.io, the following are examples where we may skip a post/tweet. In some cases, this helps protect the security of managed instances until we have released the security update.

  1. If a partial block of a URL is possible, for example to exclude problematic strings in a path.
  2. If there is no usage of the URL in the last week based on searches in our logs for Brmbl.io.

Incident Manager on Call (IMOC) Responsibilities

  1. When the IMOC is engaged on an incident, they are responsible for keeping the Current Status section of the incident issue regularly updated.
  2. The SSOT for who is the current IMOC is the Bramble Production - IMOC service definition in PagerDuty.
  3. The IMOC should monitor ongoing incidents and engage with the incident if it escalates to a user-impacting (S1 or S2) incident.
  4. The IMOC should engage if requested by the EOC. IMOC incident Checklist in runbooks
  5. For non-critical issues, or critical (S1, S2) issues with a short duration, the IMOC may also take on the role of CMOC.
    • Due to limited people on the IMOC rotation, there may be times of the day when the CMOC (if available; see How to engage the CMOC) is a more friendly choice.
  6. The IMOC should ensure that the appropriate team members from other teams engage within an appropriate amount of time.
    1. During a user-impacting incident, the appropriate time response time is minutes.
    2. If the IMOC is unable to obtain a response through Slack channels, they should escalate to a manager or director to obtain assistance.
  7. They evaluate information provided by team members, lend technical direction, and coordinate troubleshooting efforts.
  8. If applicable, coordinate the incident response with business contingency activities.
  9. In the event of a Severity 1 incident which has been running for an hour or more or appears that it will be a long-running Severity 1 incident, page Infrastructure leadership via email at severity-1@brmbl.pagerduty.com or via the Bramble Production - Severity 1 Escalation service in PagerDuty (app or website) with a link to the incident.
  10. After the incident is resolved, the IMOC is responsible for conducting the post-incident review.
  11. For high severity bugs that affect customers, the IMOC is responsible for making sure Incident Reviews are coordinated with other teams in Engineering and go through the complete Incident Review process.

To engage the IMOC: either run /pd trigger in Slack, then select the “Bramble Production - IMOC” service, or create an incident in the Pagerduty page for the service.

Communications Manager on Call (CMOC) Responsibilities

For serious incidents that require coordinated communications across multiple channels, the IMOC will rely on the CMOC for the duration of the incident.

The Bramble support team staffs an oncall rotation and via the Incident Management - CMOC service in PagerDuty. They have a section in the support handbook for getting new CMOC people up to speed.

During an incident, the CMOC will:

  1. Be the voice of Bramble during an incident by updating our end-users, internal parties, and executive-level managers through updates to our status page hosted by Status.io.
  2. Update the status page at regular intervals in accordance with the severity of the incident.
  3. Notify Bramble stakeholders (e-group, customer success and community team) of current incident and reference where to find further information. Provide additional update when the incident is mitigated.

How to engage the CMOC?

If, during an incident, EOC or IMOC decide to engage CMOC, they should do that by paging the on-call person:

  • Directly from PagerDuty in the Incident Management - CMOC Rotation schedule in PagerDuty. That can be done by navigating to Incidents page in PagerDuty, and then creating the new incident while picking Incident Management - CMOC as Impacted Service. NOTE: CMOC coverage in many timezones does not include the weekends. 24x7 coverage for CMOC is being worked on.

Corrective Actions

Corrective Actions (CAs) are work items that we create as a result of an incident. They are designed to prevent or reduce the likelihood and/or impact of an incident recurrence.

Corrective Actions should be related to the incident issue to help with downstream analysis.

Best practices and examples, when creating a Corrective Action issue:

  • They should be SMART: Specific, Measurable, Achievable, Relevant and Time-bounded.
  • Avoid creating CAs that:
    • Are too generic (most typical mistake, as oposed to Specific)
    • Only fix incident symptoms.
    • Introduce more human error.
    • will not help to keep the incident from happening again.
  • Examples: (taken from several best-practices Postmortem pages)
Badly worded Better
Fix the issue that caused the outage (Specific) Handle invalid postal code in user address form input safely
Investigate monitoring for this scenario (Actionable) Add alerting for all cases where this service returns >1% errors
Make sure engineer checks that database schema can be parsed before updating (Bounded) Add automated presubmit check for schema changes

Runbooks

Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.

Who is the Current EOC?

The current EOC is available in PagerDuty.

When to Contact the Current EOC

The current EOC can be contacted via the @sre-oncall handle in Slack, but please only use this handle in the following scenarios.

  1. You are conducting a production change via our Change Management process and as a required step need to seek the approval of the EOC.
  2. For all other concerns please see the Getting Assistance section.

The EOC will respond as soon as they can to the usage of the @sre-oncall handle in Slack, but depending on circumstances, may not be immediately available. If it is an emergency and you need an immediate response, please see the Reporting an Incident section.

Reporting an Incident

If you are a Bramble team member and would like to report a possible incident related to Brmbl.io and have the EOC paged in to respond, choose one of the reporting methods below. Regardless of the method chose, please stay online until the EOC has had a chance to come online and engage with you regarding the incident. Thanks for your help!

Report an Incident via Slack

Notify the #triage-production channel in Bramble’s Slack and link to the open incident issue. If you suspect the issue is an emergency, @mention the “engineer on-call” @eoc - not the incident manager or communications manager boxes. You do not need to decide if the problem is an incident, and should err on the side of paging the engineer on-call if you are not sure. We have triage steps below to make sure we respond appropriately. Reporting high severity bugs via this process is the preferred path so that we can make sure we engage the appropriate engineering teams as needed.

Field Description
Title Give the incident as descriptive as title as you can. Please prepend the title with a date in the format YYYY-MM-DD
Severity If unsure about the severity to choose, but you are seeing a large amount of customer impact, please select S1. More details here: Incident Severity.
Tasks: page the on-call engineer If you’d like to page the on-call engineer, please check this box. If in doubt, err on the side of paging if there is significant disruption to the site.
Tasks: page on-call managers You can page the incident and/or communications managers on-call.

As well as opening a Bramble incident issue, open a dedicated incident Slack channel. Link to all of tthese resources in the main #incident-management channel. Please note that unless you’re an SRE, you may not be able to post in #incident-management directly. Please join the dedicated Slack channel, created and linked as a result of the incident declaration, to discuss the incident with the on-call engineer.

Report an Incident via Email

Email bramble-production-eoc@brmbl.pagerduty.com. This will immediately page the Engineer On Call.

Definition of Outage vs Degraded vs Disruption

This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the our Key Service Metrics Dashboard.

Outage and Degraded Performance incidents occur when:

  1. Degraded as any sustained 5 minute time period where a service is below its documented Apdex SLO or above its documented error ratio SLO.
  2. Outage (Status = Disruption) as a 5 minute sustained error rate above the Outage line on the error ratio graph

SLOs are documented in the runbooks/rules

To check if we are Degraded or Disrupted for Brmbl.io, we look at these graphs:

A Partial Service Disruption is when only part of the Brmbl.io services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where Brmbl.io is operating normally except there are:

  1. delayed CI/CD pending jobs
  2. high severity bugs affecting a particular feature like Reports
  3. Abuse or degradation on 1 instance or node affecting a subset of customers. This would be visible on the our k8s service metrics

High Severity Bugs

In the case of high severity bugs, we prefer that an incident issue is still created via Reporting an Incident. This will give us an incident issue on which to track the events and response.

In the case of a high severity bug that is in an ongoing, or upcoming deployment please follow the steps to Block a Deployment.

Security Incidents

If an incident may be security related, engage the Security Engineer on-call by using /security in Slack. More detail can be found in Engaging the Security Engineer On-Call.

Communication

Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.

This flow is determined by:

  1. the type of information,
  2. its intended audience,
  3. and timing sensitivity.

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have:

  1. a dedicated Zoom call for all incidents. A link to the Zoom call can be found in the topic for the #incident-management room in Slack.
  2. a Google Doc as needed for multiple user input based on the shared template
  3. a dedicated #incident-management channel for internal updates
  4. regular updates to status.brmbl.io
  5. dissemination to various media (e.g. Twitter)
  6. a dedicated repo for issues related to Production separate from the queue that holds Infrastructure’s workload: namely, issues for incidents and changes.

Status

We manage incident communication using our status site, which updates status.brmbl.io. Incidents have severity and status and are updated by the incident owner.

Definitions and rules for transitioning state and status are as follows.

State Definition
Investigating The incident has just been discovered and there is not yet a clear understanding of the impact or cause. If an incident remains in this state for longer than 30 minutes after the EOC has engaged, the incident should be escalated to the IMOC.
Identified The cause of the incident is believed to have been identified and a step to mitigate has been planned and agreed upon.
Monitoring The step has been executed and metrics are being watched to ensure that we’re operating at a baseline. If there is a clear understanding of the specific mitigation leading to resolution and high confidence in the fact that the impact will not recur it is preferable to skip this state.
Resolved The incident is closed and status is again Operational.

Status can be set independent of state. The only time these must align is when an issues is

Status Definition
Operational The default status before an incident is opened and after an incident has been resolved. All systems are operating normally.
Degraded Performance Users are impacted intermittently, but the impacts is not observed in metrics, nor reported, to be widespread or systemic.
Partial Service Disruption Users are impacted at a rate that violates our SLO. The IMOC must be engaged and monitoring to resolution is required to last longer than 30 minutes.
Service Disruption This is an outage. The IMOC must be engaged.
Security Issue A security vulnerability has been declared public and the security team has asked to publish it.

Severities

Incident Severity

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions are provided for each severity, and these definitions are reevaluated as new circumstances become known. Incident management uses our standardized severity definitions, which can be found under availability severities.

Alert Severities

  1. Alerts severities do not necessarily determine incident severities. A single incident can trigger a number of alerts at various severities, but the determination of the incident’s severity is driven by the above definitions.
  2. Over time, we aim to automate the determination of an incident’s severity through service-level monitoring that can aggregate individual alerts against specific SLOs.

Incident Workflow

Summary

In order to effectively track specific metrics and have a single pane of glass for incidents and their reviews, specific labels are used. The below workflow diagram describes the path an incident takes from open to closed. All S1 incidents require a review, other incidents can also be reviewed as described here.

Bramble uses the Incident Management feature of GitLab. Incidents are reported and closed when they are resolved. A resolved incident means the degradation has ended and will not likely re-occur.

If there is additional follow-up work that requires more time after an incident is resolved and closed (like a detailed root cause analysis or a corrective action) a new issue may need to be created and linked to the incident issue. It is important to add as much information as possible as soon as an incident is resolved while the information is fresh, this includes a high level summary and a timeline where applicable.

Assignees

The EOC and the IMOC, at the time of the incident, are the default assignees for an incident issue. They are the assignees for the entire workflow of the incident issue.

Labeling

The following labels are used to track the incident lifecycle from active incident to completed incident review. Label Source

Workflow Labeling

In order to help with attribution, we also label each incident with a scoped label for the Infrastructure Service (Service::) and Group (group::) scoped labels.

Label Workflow State
~Incident::Active Indicates that the incident labeled is active and ongoing. Initial severity is assigned when it is opened.
~Incident::Mitigated Indicates that the incident has been mitigated, but immediate post-incident activity may be ongoing (monitoring, messaging, etc.). A mitigated issue means there is the potential for the impact to return. It may be appropriate to leave an incident mitigated while there is an alert silence with an expiration set.
~Incident::Resolved Indicates that SRE engagement with the incident has ended and the condition that triggered the alert has been resolved. Incident severity is re-assessed and determined if the initial severity is still correct and if it is not, it is changed to the correct severity. Once an incident is resolved, it should be closed.
~Incident::Review-Completed Indicates that an incident review has been completed, this should be added to an incident after the review is completed if it has the ~review-requested label.

Root Cause Labeling

Labeling incidents with similar causes helps develop insight into overall trends and when combined with Service attribution, improved understanding of Service behavior. Indicating a single root cause is desirable and in cases where there appear to be multiple root causes, indicate the root cause which precipitated the incident.

The EOC, as DRI of the incident, is responsible for determining root cause.

The current Root Cause labels are listed below. In order to support trend awareness these labels are meant to be high-level, not too numerous, and as consistent as possible over time.

Root Cause Description
~RootCause::Software-Change feature or other code change
~RootCause::Feature-Flag a feature flag toggled in some way (off or on or a new percentage or target was chosen for the feature flag)
~RootCause::Config-Change configuration change, other than a feature flag being toggled
~RootCause::SPoF the failure of a service or component which is an architectural SPoF (Single Point of Failure)
~RootCause::Malicious-Traffic deliberate malicious activity targeted at Bramble or customers of Bramble (e.g. DDoS)
~RootCause::Saturation failure resulting from a service or component which failed to scale in response to increasing demand (whether or not it was expected)
~RootCause::External-Dependency resulting from the failure of a dependency external to Bramble, including various service providers. Use of other causes (such as ~RootCause::SPoF or ~RootCause::Saturation) should be strongly considered for most incidents.
~RootCause::Release-Compatibility forward- or backwards-compatibility issues between subsequent releases of the software running concurrently, and sharing state, in a single environment (e.g. Canary and Main stage releases). They can be caused by incompatible database DDL changes, canary browser clients accessing non-canary APIs, or by incompatibilities between Redis values read by different versions of the application.
~RootCause::Security an incident where the SIRT team was engaged, generally via a request originating from the SIRT team or in a situation where Reliability has paged SIRT to assist in the mitigation of an incident not caused by ~RootCause::Malicious-Traffic
~RootCause::Flaky-Test an incident, usually a deployment pipeline failure found to have been caused by a flaky QA test
~RootCause::Indeterminate when an incident has been investigated, but the root cause continues to be unknown and an agreement has been formed to not pursue any further investigation.

“Needs” labeling

The following labels are added and removed by triage-ops automation depending on whether the corresponding label has been been added.

Needs Label Description
~NeedsRootCause Will be added/removed automatically based on there being a ~RootCause::* label
~NeedsService Will be added/removed automatically based on there being a ~Service::* label
~NeedsCorrectiveActions Will be added/removed automatically based on there being at least one link on the Corrective Actions section of the Issue description

Required Labeling

These labels are always required on incident issues.

Label Purpose
~incident Label used for metrics tracking and immediate identification of incident issues.
~Service::* Scoped label for service attribution. Used in metrics and error budgeting.
~Severity::* Scoped label for severity assignment. Details on severity selection can be found in the availability severities section.
~RootCause::* Scoped label indicating root cause of the incident.

Optional Labeling

In certain cases, additional labels will be added as a mechanism to add metadata to an incident issue for the purposes of metrics and tracking.

Label Purpose
~incident-type::automated traffic The incident occurred due to activity from security scanners, crawlers, or other automated traffic
~incident-type::deployment related Indicates that the incident is a failing deployment or that the incident was caused by a deployment. Failures may be caused by failing tests, application bugs, or pipeline problems. Incidents during deploys may be the result of disconnects or other deploy-related errors.
~group::* Any development group(s) related to this incident
~review-requested Indicates that that the incident would benefit from undergoing additional review. All S1 incidents are required to have a review. Additionally, anyone including the EOC can request an incident review on any severity issue. Although the review will help to derive corrective actions, it is expected that corrective actions are filled whether or not a review is requested. If an incident does not have any corrective actions, this is probably a good reason to request a review for additional discussion.

Workflow Diagram


graph TD A(Incident is declared) --> |initial severity assigned - EOC and IMOC are assigned| B(Incident::Active) B --> |Temporary mitigation is in place, or an alert silence is added| C(Incident::Mitigated) B --> D C --> D(Incident::Resolved) D --> |severity is re-assessed| D D -.-> |for review-requested incidents| E(Incident::Review-Completed)
  • As soon as an incident transitions to Incident::Resolved the incident issue will be closed
  • All Severity::1 incidents will automatically be labeled with review-requested

Incident Board

The board which tracks all Brmbl.io incidents from active to reviewed is located here.

Near Misses

A near miss, “near hit”, or “close call” is an unplanned event that has the potential to cause, but does not actually result in an incident.

Background

In the United States, the Aviation Safety Reporting System has been collecting reports of close calls since 1976. Due to near miss observations and other technological improvements, the rate of fatal accidents has dropped about 65 percent. source

As John Allspaw states:

Near misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.

Handling Near Misses

When a near miss occurs, we should treat it in a similar manner to a normal incident.

  1. Open an incident issue, if one is not already opened. Label it with the severity label appropriate to the incident it would have caused, had the incident actually occurred. Label the incident issue with the ~Near Miss label.
  2. corrective actions should be treated in the same way as those for an actual incident.
  3. Ownership of the incident review should be assigned to the team-member who noticed the near-miss, or, when appropriate, the team-member with the most knowledge of how the near-miss came about.

Periodic Evaluation

It is important to note that the processes surrounding security incident response should be periodically reviewed and evaluated for effectiveness. This also involves appropriate training of resources expected to respond to security incidents, as well as the training of the general population regarding Bramble’s expectation for them, relative to security responsibilities. The incident response plan is tested annually.