Engineering Workflow
This document explains the workflow for anyone working with issues in Bramble.
Flow
Products at Bramble are built using the GitHub Flow, but with Continuous Deployment of the master
branch, along with Feature Flag-Driven Releases.
We have specific rules around code review.
Reverting a merge request
In line with our values of short toes, making two-way-door decisions and bias for action, anyone can propose to revert a merge request. When deciding whether an MR should be reverted, the following should be true:
- Something broke and there is no acceptable work around. Examples of this include:
- A feature broke and is categorized as
~severity::1
or~severity::2
. See severity labels - Master broke
- There are failing migrations
- A feature broke and is categorized as
- There are no dependencies on the change. For example, a database migration has not been run on production.
Reverting merge requests that add non-functional changes and don’t remove any existing capabilities should be avoided in order to prevent designing by committee.
The intent of a revert is never to place blame on the original author. Additionally, it is helpful to inform the original author so they can participate as a DRI on any necessary follow up actions.
Broken master
If you notice that pipelines for the master
branch of the Brmbl.io app is failing (red) or broken (green as a false positive), returning the build to a passing state takes priority over everything else development related, since everything we do while tests are broken may break existing functionality, or introduce new bugs and security issues.
What is a broken master
?
All tests (unit, integration, e2e) that fail on master are treated as ~"master:broken"
.
Any test failures or flakiness (either false positive or false negative) causes productivity impediments for all of engineering and our release processes.
If a change causes new test failures, the fix to the test should be made in the same Merge Request.
The cost to fix test failures increases exponentially as time passes due to pipelines with merged results used. Auto-deploys, as well as security releases, depend on master
being green.
Our aim should be to keep master
free from failures, not to fix master
only after it breaks.
Broken master
service level objectives
There are two phases for fixing a ~"master:broken"
issue which have a target SLO to clarify the urgency. The resolution phase is dependent on the completion of the triage phase.
Phase | Service level objective | DRI |
---|---|---|
Triage | 4 hours from the initial master pipeline failure until assigned ~"master:broken" issue |
Engineering Productivity team |
Resolution | 4 hours from assignment to DRI until issue is closed | Merge request author or team of merge request author |
Additional details about the phases are listed below.
Broken master
escalation
If a ~"master:broken"
is blocking your team (such as creating a security release) then you should:
- See if there is a current
~"master:broken"
issue with a DRI - Check discussion on the failure notifications in #triage-ci on Slack. If there isn’t a discussion, ask in
#triage-ci
if there’s anyone investigating the issue you are looking at
Triage broken master
The Development team is the triage DRI for monitoring master pipeline failures, identification and communication of ~"master:broken"
issues.
Triage DRI Responsibilities
- Monitor
- Pipeline failures are sent to
#triage-ci
and will be reviewed by the team. These reactions will be applied by the triage DRI to signal current status::eyes:
- signals the triage DRI is investigating a failing pipeline:boom:
- signals the pipeline contains a new failure. The triage DRI will create a new~"master:broken"
issue and reply in the thread with a link to the issue.:fire_engine:
- signals the pipeline is failing due to a known issue. The triage DRI will reply in the thread with a link to the existing issue(s).:retry:
- signals a system failure (e.g., Docker failure) is responsible and a retry has been triggered.
- Pipeline failures are sent to
- Identification
- Create an issue based on:
master
failing for a non-flaky reason - create an issue with the following labels:~"master:broken"
,~"Engineering Productivity"
,~priority::1
,~severity::1
.master
failing for a flaky reason that cannot be reliably reproduced - create an issue with the following labels:~"failure::flaky-test"
,~"Engineering Productivity"
,~priority::2
,~severity::2
.
- Identify the merge request that introduced the failures.
- Assign the issue to the
~"master:broken"
merge request author if they are available at the moment. If the author is not available, mention the team Engineering Manager and seek assistance in the#team-engineering
Slack channel.- Ask for assistance in the
#team-engineering
Slack channel if there is no merge request that caused the~"master:broken"
.
- Ask for assistance in the
- Create an issue based on:
- Communication
- Communicate
~"master:broken"
in#team-engineering
- Communicate
- (Optional) Pre-resolution
- If the triage DRI believes that there’s an easy resolution by either:
- Reverting a particular merge request.
- Making a quick fix (for example, one line or a few similar simple changes in a few lines).
The triage DRI can create a merge request, assign to any available maintainer, and ping the resolution DRI with a
@username FYI
message. Additionally, a message can be posted in#team-engineering
to get a maintainer take a look at the fix ASAP.
- If the triage DRI believes that there’s an easy resolution by either:
Resolution of broken master
The merge request author of the change that broke master is the resolution DRI. In the event the merge request author is not available, the team of the merge request author will assume the resolution DRI responsibilities. If a DRI has not acknowledged or signaled working on a fix, any developer can take ownership using the reaction guidance below and assume the resolution DRI responsibilities.
Responsibilities of the resolution DRI
- Prioritize resolving
~"master:broken"
over new bug/feature work. Resolution options include:- Default: Revert the merge request which caused the broken master. If a revert is performed, create an issue to reinstate the merge request and assign it to the author of the reverted merge request. Reverts can go straight to maintainer review and require 1 maintainer approval. The maintainer can request additional review/approvals if the revert is not trivial.
- Quarantine the failing test if you can confirm that it is flaky (e.g. it wasn’t touched recently and passed after retrying the failed job).
- Create a new merge request to fix the failure if revert is not possible or would introduce additional risk. This should be treated as a
~priority::1
~severity::1
issue. To ensure efficient review of the fix, the merge request should only contain the minimum change needed to fix the failure. Additional refactor or improvement to the code should be done as a follow up.- Remove the
~"master:broken"
label from the issue and apply~"failure::flaky-test"
- Remove the
- Apply the
~"Pick into auto-deploy"
label (along with the needed~"severity::1"
and~"priority::1"
) to make sure deployments are unblocked. - Reactions by the resolution DRI in
#team-engineering
should follow this guidance::eyes:
- applied by the resolution DRI (or backup) to signal acknowledgment:construction:
- applied by the resolution DRI to signal that work is in progress on a fix:white_check_mark:
- applied by the resolution DRI to signal the fix is complete.
- Communicate in
#team-engineering
when the fix is inmaster
. - When
master
build was failing and the underlying problem was quarantined / reverted / temporary workaround created but the root cause still needs to be discovered: create a new issue with the~"master:needs-investigation"
label
Responsibilities of authors and maintainers
Once the resolution DRI announces that master
is fixed:
- Maintainers should start a new Pipeline for Merged Results (for canonical MRs) and enable “Merge When Pipeline Succeeds” (MWPS).
Merging during broken master
Merge requests can not be merged to master
until the broken pipeline is fixed and passing again.
This is because we need to try hard to avoid introducing new failures, since it’s easy to lose confidence if it stays red for a long time.
In the rare case where a merge request is urgent and must be merged immediately, team members can follow the process below to have a merge request merged during a broken master
.
Criteria for merging during broken master
Merging while master
is broken can only be done for:
- Merge requests that need to be deployed to Brmbl.io to alleviate an ongoing production incident.
- Merge requests that fix broken
master
issues (we can have multiple broken master issues ongoing).
How to request a merge during a broken master
First, ensure the latest pipeline has completed less than 2 hours ago (although it is likely to have have failed due to brmbl-io/app
using pipelines for merged results).
Next, make a request on Slack:
- Post the
#team-engineering
Slack channel. - In your post outline why the merge request is urgent.
- Make it clear that this would be a merge during a broken
master
, optionally add a link to this page in your request.
Instructions for the maintainer
A maintainer who sees a request to merge during a broken master
must follow this process.
Note, if any part of the process below disqualifies a merge request from being merged
during a broken master
then the maintainer must inform the requestor as to why in the
merge request (and optionally in the Slack thread of the request).
First, assess the request:
- Add the
:eyes:
emoji to the Slack post so other maintainers know it is being assessed. We do not want multiple maintainers to work on fulfilling the request. - Assess whether the merge request is urgent or not. If in doubt, ask the requestor for more details in the merge request about why it is urgent.
Next, ensure that all the following conditions are met:
- The latest pipeline has completed less than 2 hours ago (although it is likely to have failed due to
brmbl-io/app
using pipelines for merged results). - All of the latest pipeline failures also happen on
master
. - There is an issue labelled
~"master:broken"
for every failure, see the “Triage DRI Responsibilities” steps above for more details.
Next, add a comment to the merge request mentioning that the merge request will be merged during a broken master
, and link to the ~"master:broken"
issue(s). For example:
Merge request will be merged while `master` is broken.
Failure in <JOB_URL> happens in `master` and is being worked on in <ISSUE_URL>.
Next, merge the merge request:
- If the “Merge” button is enabled (this is unlikely), then click it.
- Otherwise, you must:
- Unset the “Pipelines must succeed” setting for the
brmbl-io/app
project. - Click the “Merge” button.
- Set the “Pipelines must succeed” setting to be on again.
- Unset the “Pipelines must succeed” setting for the
Security Issues
Security issues are managed and prioritized by the security team. If you are assigned to work on a security issue in a milestone, you need to ensure your code, and solution is reviewed by an Application security engineer before the issue can be closed.
If you find a security issue in Bramble, create a confidential issue mentioning the relevant security and engineering managers, and post about it in #security
.
Basics
- Start working on an issue you’re assigned to. If you’re not assigned to any issue, find the issue with the highest priority and relevant label you can work on, and assign it to yourself. [You can use this query, which sorts by priority for the started milestones][priority-issues].
- If you need to schedule something or prioritize it, apply the appropriate labels (see Scheduling issues).
- If you are working on an issue that touches on areas outside of your expertise, be sure to mention someone in the other group(s) as soon as you start working on it. This allows others to give you early feedback, which should save you time in the long run.
- When you start working on an issue:
- Add the
workflow::in dev
label to the issue. - Create a merge request (MR) by clicking on the Create merge request button in the issue. This creates a MR with the labels, milestone and title of the issue. It also relates the just created MR to the issue.
- Assign the MR to yourself.
- Work on the MR until it is ready, it meets our definition of done, and the pipeline succeeds.
- Edit the description and click on the Remove the Draft: prefix from the title button.
- Assign it to a reviewer(s). When assigning, also @mention them in the comments, requesting a review.
- (Optionally) Unassign yourself from the MR. Some may find leaving the MR assigned to themselves easier to track the MRs they are responsible for by using the built in MR button/notification icon in the GitLab navigation bar.
- Change the workflow label of the issue to
workflow::in review
. If multiple people are working on the issue or multiple workflow labels might apply, consider breaking the issue up. Otherwise, default to the workflow label farthest away from completion. - Potentially, a reviewer offers feedback and assigns back to the author.
- The author addresses the feedback and this goes back and forth until all reviewers approve the MR.
- After approving, the reviewer in each category unassigns themselves and assigns the suggested maintainer in their category.
- Maintainer reviews take place with any back and forth as necessary and attempts to resolve any open threads.
- The last maintainer to approve the MR, follows the Merging a merge request guidelines.
- (Optionally) Change the workflow label of the issue to
workflow::verification
, to indicate all the development work for the issue has been done and it is waiting to be deployed and verified. We will use this label in cases where the work was requested to be verified by product OR we determined we need to perform this verification in production.
- You are responsible for the issues assigned to you. This means it has to ship with the milestone it’s associated with. If you are not able to do this, you have to communicate it early to your manager and other stakeholders (e.g. the product manager, other engineers working on dependent issues). In teams, the team is responsible for this (see Working in Teams). If you are uncertain, err on the side of overcommunication. It’s always better to communicate doubts than to wait.
- You (and your team, if applicable) are responsible for:
- Ensuring that your changes apply cleanly.
- The testing of a new feature or fix, especially right after it has been merged and packaged.
- Creating any relevant feature or API documentation.
- Shipping secure code, (see Security is everyone’s responsibility).
- Once a release candidate has been deployed to the staging environment, please verify that your changes work as intended. We have seen issues where bugs did not appear in development but showed in production.
Be sure to read general guidelines about issues and merge requests.
Convention over Configuration
Avoid adding configuration values in the application settings or in gitlab.yml
. Only add configuration if it is absolutely necessary. If you find yourself adding parameters to tune specific features, stop and consider how this can be avoided. Are the values really necessary? Could constants be used that work across the board? Could values be determined automatically?
See Convention over Configuration for more discussion.
Choosing Something to Work On
Start working on things with the highest priority in the current milestone. The priority of items are defined under labels in the repository, but you are able to sort by priority.
After sorting by priority, choose something that you’re able to tackle and falls under your responsibility. That means that if you’re a frontend developer, you work on something with the label frontend
.
To filter very precisely, you could filter all issues for:
- Milestone: Started
- Assignee: None (issue is unassigned)
- Label: Your label of choice. For instance
CI/CD
,Discussion
,Quality
,frontend
, orPlatform
- Sort by priority
[Use this link to quickly set the above parameters][priority-issues]. You’ll still need to filter by the label for your own team.
If you’re in doubt about what to work on, ask your lead. They will be able to tell you.
Working with Brmbl.io
Performance Data
There is extensive monitoring available for Brmbl.io. For more on this and related tools, see the monitoring handbook.
Error Reporting
- AppSignal is our error reporting tool
Scheduling Issues
Bramble has to be selective in working on particular issues. We have a limited capacity to work on new things. Therefore, we have to schedule issues carefully.
Our Product Manager is responsible for scheduling all issues including features, bugs, and tech debt. Product managers alone determine the prioritization, but others are encouraged to influence the PMs decisions. The UX Lead and Engineering Leads are responsible for allocating people making sure things are done on time. Product Managers are not responsible for these activities, they are not project managers.
Direction issues are the big, prioritized new features for each release. They are limited to a small number per release so that we have plenty of capacity to work on other important issues, bug fixes, etc.
If you want to schedule an Accepting merge requests
issue, please remove the label first.
Any scheduled issue should have a team label assigned, and at least one type label.
Requesting Something to be Scheduled
To request scheduling an issue, ask the Product Manager.
We have many more requests for great features than we have capacity to work on. There is a good chance we’ll not be able to work on something. Make sure the appropriate labels (such as customer
) are applied so every issue is given the priority it deserves.
Product Development Timeline
While deployments to Brmbl.io are more frequent than regular major/minor releases, teams (Product, UX, Development, Quality) continually work on issues according to their respective workflows.
There is no specified process whereby a particular person should be working on a set of issues in a given time period.
However, there are usually specific deadlines that should inform team workflows and prioritization.
Updating Issues Throughout Development
Team members use labels to track issues throughout development. This gives visibility to other developers, product managers, and designers, so that they can adjust their plans during a monthly iteration. An issue should follow these stages:
workflow::in dev
: A developer indicates they are developing an issue by applying thein dev
label.workflow::in review
: A developer indicates the issue is in code review and UX review by removing thein dev
label, and applying thein review
label.workflow::verification
: A developer indicates that all the development work for the issue has been done and is waiting to be deployed and verified.
When the issue has been verified and everything is working, it can be closed.
Use Group Labels and Group Milestones
When working in GitLab (and in particular, the brmbl-io GitLab group), use group labels and group milestones as much as you can. It is easier to plan issues and merge requests at the group level, and exposes ideas across projects more naturally. If you have a project label, you can promote it to a group milestone. This will merge all project labels with the same name into the one group label. The same is true for promoting group milestones.
Technical debt
We definitely don’t want our technical debt to grow faster than our code base. To prevent this from happening we should consider not only the impact of the technical debt but also a contagion. How big and how fast is this problem going to be over time? Is it likely a bad piece of code will be copy-pasted for a future feature? In the end, the amount of resources available is always less than amount of technical debt to address.
To help with prioritization and decision-making process here, we recommend thinking about contagion as an interest rate of the technical debt. There is a great comment from the internet about it:
You wouldn’t pay off your $50k student loan before first paying off your $5k credit card and it’s because of the high interest rate. The best debt to pay off first is one that has the highest loan payment to recurring payment reduction ratio, i.e. the one that reduces your overall debt payments the most, and that is usually the loan with the highest interest rate.
Technical debt is prioritized like other technical decisions in product groups by product management.
For technical debt which might span, or fall in gaps between groups they should be brought up for a globally optimzed prioritization in retrospectives or directly with the appropriate member of the Product Leadership team. Additional avenues for addressing technical debt outside of product groups are Rapid Action issues and working groups.
UX debt
Sometimes there is an intentional decision to deviate from the agreed-upon MVC, which sacrifices the user experience. When this occurs, the Product Designer creates a follow-up issue and labels it UX debt
to address the UX gap in subsequent releases.
For the same reasons as technical debt, we don’t want UX debt to grow faster than our code base.
These issues are prioritized like other technical decisions in product groups by product management.
As with technical debt, UX debt should be brought up for globally optimized prioritization in retrospectives or directly with the appropriate member of the Product Leadership team.
UI polish
UI polish issues are visual improvements to the existing user interface, touching mainly aesthetic aspects of our UI foundations. UI polish issues generally capture improvements related to color, typography, iconography, and spacing. We apply the UI polish
label to these issues. UI polish issues don’t introduce functionality or behavior changes to a feature.
Examples of UI polish
- Aesthetic improvements: removing unnecessary borders from a UI, updating the background color of an element, fixing the font size of a heading element.
- Misalignment of text, buttons, etc: although because many times something isn’t broken, these improvements are considered UI polish. These could also be considered a bug.
- Incorrect spacing between UI elements: when two interface elements are using inconsistent spacing values, such as 10px instead of 8px. It could also be considered technical debt. Note that if two interface elements have zero space between them, its an obvious bug.
- Visual inconsistencies across different product areas: visual inconsistencies could occur when we have have a series of buttons on a particular view. For example, when 3/4 of them have been migrated to use a Tailwind component, and 1/4 of them are still using a deprecated button, resulting in a visual inconsistency. This is considered a UI polish.
What is not UI polish
- Functional inconsistency related to the experience: for example, using a manual action to add an assignee automatically shows the assignee in the sidebar but using a manual action to add a weight to an issue does not automatically show the weight in the sidebar. This is not currently considered UI polish. It would be considered a UX issue.
- Improving visibility of system status: status indicator improvements are experience improvements and are not classified as UI polish.
- Even when updating something that is purely visual, such as a status icon, to improve the meaning the user has of what they are viewing, we are trying to improve the experience of that user.
Monitor Merge Request Trends
Open merge requests sometimes become idle (not updated by a human in more than a month). Once a month, engineering managers will receive an idle MR triage issue
that includes all (non-WIP/Draft) MRs for their group and use it to determine if any action should be taken (such as nudging the author/reviewer/maintainer). This assists in getting merge requests merged in a reasonable amount of time (which we track as the metric MTTR: Mean Time to Merge).
Open merge requests may also have other properties that indicate that the engineering manager should research them and potentially take action to improve efficiency. One key property is the number of threads, which, when high, may indicate a need to update the plan for the MR or that a synchronous discussion should be considered. Another property is the number of pipelines, which, when high, may indicate a need to revisit the plan for the MR. These metrics are not yet included in an automatically created a triage issue.
Security is everyone’s responsibility
Security is our top priority. Our team is raising the bar on security every day to protect users' data and make Bramble a safe place for our customer’s data. There are many lines of code, so we shift security left in the Software Development LifeCycle (SDLC) with DevSecOps.
Being able to start the security review process earlier in the software development lifecycle means we will catch vulnerabilities earlier, and mitigate identified vulnerabilities before the code is merged. We are fixing the obvious security issues before every merge, and therefore, scaling the security review process. Our workflow includes a check and validation by the reviewers of every merge request, thereby enabling developers to act on identified vulnerabilities before merging.
As part of that process, developers are also empowered to reach out to the security specialists in the team to discuss the issue at that stage, rather than later on, when mitigating vulnerabilities becomes more expensive. After all, security is everyone’s job.
Rapid Engineering Response
From time to time, there are occasions that engineering team must act quickly in response to urgent issues. This section describes how the engineering team handles certain kinds of such issues.
Scope
Not everything is urgent. See below for a non-exclusive list of things that are in-scope and not in-scope. As always, use your experience and judgment, and communicate with others.
- In Scope
- Last-minute release blocking bug or security patch before an imminent release.
- High severity (severity::1/priority::1) security issues. Refer to security severity and priority.
- Highest priority and severity customer issues based on the priority and severity definitions.
- Not In Scope
- An operational issue of Brmbl.io or a self managed customer environment. This falls under the on-call process.
- Self developed and maintained tools that are not officially supported products by Bramble.
- Feature request by a specific customer.
Process
- Person requesting Rapid Engineering Response creates an issue supplying all known information and applies priority and severity (or security severity and priority) to the best of their ability.
- Person requesting Rapid Engineering Response raises the issue to their own manager and the subject matter domain engineering manager (or the delegation if OOO).
- In case a specific group cannot be determined, raise the issue to the Director of Engineering (or the delegation if OOO) of the section.
- In case a specific section cannot be determined, raise the issue to the Sr. Director of Development (or the delegation if OOO).
- The engineering sponsor (subject matter Manager, Director, and/or Sr. Director) invokes all stakeholders of the subject matter as a rapid response task force to determine the best route of resolution:
- Engineering manager(s)
- Product Management
- QA
- UX
- Docs
- Security
- Support
- Adjust priority and severity or security severity and priority if necessary, and work collaboratively on the determined resolution.
Infradev
The infradev process is established to identify Issues requiring priority attention in support of SaaS availability and reliability. These escalations are intended to primarily be asyncronous as timely triage and attention is required. In addition to primary management through the Issues, any gaps, concerns, or critical triage is handled in our regular Bramble SaaS Infrastructure meetings
Scope
The infradev issue board is the primary focus of this process.
Roles and Responsibilities
Infrastructure
- Nominate issues by adding
Infradev
label. - Assess Severity and
Priority
and apply the corresponding label as appropriate. - Provide as much information as possible to assist development engineering troubleshooting.
Development
- Development directors are responsible for triaging Infradev issues regularly by following the triage process below.
- Development managers are encouraged to triage issues regularly as well.
- Development managers collaborate with their counterpart Product Managers to refine, schedule, and resolve Infradev issues.
- Usually, issues are nominated as Infradev issues by SREs or Managers in the Infrastructure Department. Development engineers/managers are not expected to nominate Infradev issues.
- However, when it’s necessary to spin off new issues from an existing Infradev issue, development engineers and managers may also apply
Infradev
label to the new issues. - When development engineers and managers split off new Infradev issues, they must have a
Severity
andPriority
labels to the new issues. The labels should correspond to the importance of the follow-on work.
- However, when it’s necessary to spin off new issues from an existing Infradev issue, development engineers and managers may also apply
Product Management
- Product Managers perform holistic prioritization of both product roadmap and Infradev issues as one unified backlog.
- Product Managers collaborate with their counterpart Development Managers to refine, schedule, and resolve Infradev issues.
Triage Process
Issues are nominated to the board through the inclusion of the label infradev
and will appear on the infradev board.
- Review issues in the Open column. Look for issues within your Stage/Group/Category, but also for those which lack a clear assignment or where the assignment may need correction.
- Review the severity on the issue to validate appropriate prioritization.
- Ensure that the issue clearly explains the problem, the (potential) impact on Brmbl.io’s availability, and ideally, clearly defines a proposed solution to the problem.
- Assign a Development Manager and a Product Manager to any issue where the
Milestone
or the labelworkflow::ready for development
is missing.- Development Manager and Product Manager collaborate on the assigned issue(s) for prioritization and planning.
- Development Manager and Product Manager unassign themselves once the issue is planned for an iteration, i.e. associated with a
Milestone
and the labelworkflow::ready for development
.
- All Issues should be prioritized into the appropriate workflow stage. It is the intent to maintain no Open (un-triaged) items.
Issues with ~infradev ~severity::1 ~priority::1 ~production request
labels applied require immediate resolution.
Additionally, an automated status report is generated in the brmbl-io/infradev-reports issue tracker. A new report is opened weekly, and updated regularly. The report categorizes each infradev issue according to several criteria, and can help with the triage and priorization process.
A Guide to Creating Effective Infradev Issues
Triage of infradev Issues is desired to occur asynchronously. For maximum efficiency, please ensure the following, so that your infradev issues can gain maximum traction.
- Clearly state the scope of the problem, and how it affects Brmbl.io. Examples could include:
- Reliability issues: the problem could cause a widespread outage or degradation on Brmbl.io.
- Saturation issues: the problem could leave to increased saturation, latency issues due to resource over-utilization.
- Service-level degradation: the problem is causing our service-level monitoring to degrade, impacting the overall SLA of Brmbl.io and potentially leaving to SLA violations.
- Unnecessary alerts: the problem does not have a major impact on users, but is leading to extraneous alerts, impacting the ability of SREs to effectively triage incidents due to alerting noise.
- Problems which extend the time to diagnosis of incidents: for example, issues which degrade the observability of Brmbl.io, swallow user-impacting errors or logs, etc. These could lead to incidents taking much longer to clear, and impacting availability.
- Deficiencies in our public APIs which lead to customers compensating by generating substantially more traffic to get the required results.
- Quantify the effect of the problem to help ensure that correct prioritization occurs.
- Include costs to availability.
- Include the number of times alerts have fired owing to the problem, how much time was spent dealing with the problem, and how many people were involved.
- Include screenshots of visualization from Grafana or Kibana.
- Always include a permalink to the source of the screenshot so that others can investigate further.
- Provide a clear, unambiguous, self-contained solution to the problem. Do not add the
infradev
label to architectural problems, vague solutions, or requests to investigate an unknown root-cause. - Ensure scope is limited. Each issue should be able to be owned by a single stage group team and should not need to be broken down further. Single task solutions are best.
- Ensure a realistic severity is applied: review the availability severity label guidelines and ensure that applied severity matches. Always ensure all issues have a severity, even if you are unsure.
- If possible, include ownership labels for more effective triage. The product categories can help determine the appropriate stage group to assign the issue to.
- Cross-reference links to Production Incidents, Pagerduty Alerts, Slack Alerts and Slack Discussions. To help ensure that the team performing the triage have all the available data.
- By adding “Related” links on the infradev issue, the Infradev Status Report will display a count of the number of production incidents related to each infradev issue, for easier and clearer prioritization.
- Ensure that the issue title is accurate, brief and clear. Change the title over time if you need to keep it accurate.
- By adding an infradev label to an issue, you are assuming responsibility and becoming the sponsor/champion of the issue.
- Provide a method for validating that the original issue still exists
- Sometimes infradev issues will resolve on their own, or are resolved as a side-effect of an unrelated change.
- In the infradev issue description, provide a clear way of checking whether the problem still exists.
- Having a way of checking validity can save on a great deal of back-and-forth discussion between Infradev Triage participants including Engineering Managers, Directors and Product Managers and make space for other non-resolved issues to get scheduled sooner.
- Ideally, provide a link to a Thanos query or an ELK query and clear instructions on how to interpret the results to determine whether the problem is still occurring.
- Alternatively, provide clear instructions on how to recreate or validate the problem.
- If an issue has been resolved, use the following process:
- Reassign the issue back to the author, or an appropriate owner, requesting that they confirm the resolution, and close the issue if they concur. If not, they should follow up with a note and unassign themselves.