Development Escalation Process
This page outlines the development team on-call process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.
Expectation
The expectation for the development engineers is to be a technical consultant and collaborate with the teams who requested on-call escalation to troubleshoot together. There is no expectation that the development engineer is solely responsible for a resolution of the escalation.
Escalation Process
Scope of Process
- This process is designed for the following issues:
- Brmbl.io
operational emergencies
raised by the Infrastructure , Security, and Support teams. - Engineering emergencies raised by Engineering teams, where an imminent deployment or release is blocked.
- Brmbl.io
- This process is NOT a path to reach development team for non-urgent issues that the Infrastructure, Security, and Support teams run into. Such issues can be moved forward by:
- Labelling with
security
and the@brmbl-io/security/appsec
team mentioned to be notified as part of the Application Security Triage rotation - Labelling with
infradev
which will be raised to Infra/Dev triage board - Raising to the respective product stage/group Slack channel, or
- Asking the #is-this-known Slack channel
- Labelling with
- This process provides 24x7 coverage.
Example of qualified issue
- Production issue examples:
- Brmbl.io: DB failover and degraded Brmbl.io performance
- Brmbl.io: Severity 1 vulnerability being actively exploited or high likelihood of being exploited and puts the confidentiality, availability, and/or integrity of customer data in jeopardy.
- Engineering emergency examples:
- A deloyment issue that will cause other deployment failures.
- Brmbl.io deployment or a security release is blocked due to pipeline failure.
- A severity::1 regression found in a recent release
Examples of non-qualified issues
- Production issue examples:
- Brmbl.io: Errors when importing a file
- Brmbl.io: Last minute security patch to be included in an upcoming release
- Engineering issue examples:
- A priority::1/severity::1 enhancement of CI
- A priority::1/severity::1 fix to API
- Non release blocking QA failures on staging
Process Outline
NOTE: On-call engineer need not announce beginning/end of their shift in #dev-escalation
unless there is an active incident happening (check the chat history of the channel to know if there is an active incident). This is because many engineers have very noisy notifications enabled for that channel, and such announcements are essentially false positives which make them check the channel unnecessarily.
Weekdays (UTC)
Weekdays will leverage PagerDuty to select an on-call, with these criteria:
- Incidents will be escalated by PagerDuty and should randomly select a BE that is currently eligible due to their working hours.
- During incidents a randomly selected BE should have the option to pass the incident to another BE if they are urgently needed somewhere else.
- Engineers who are eligible to be on-call during weekend shifts should be deprioritized from the process
Escalation
- SRE et al, raises the issue in #dev-escalation
- BE responds to the PagerDuty thread with 👀
- If Primary does not respond a secondary will be notified.
- PagersDuty should will continue trying up to 6 different BEs with a preference for those who do not take weekend shifts
- BE triages the issue and works towards a solution.
- If necessary, BE reach out to domain experts as needed.
In the event that no BE engineers respond to the bot, PagerDuty will then notify the Engineering Manager. They will need to find someone available and notify this in the escalation thread. As an EM:
- Check whether some of the engineers pinged belong to your group and see whether they are available to help
- Try to find someone available from your group
- If the search is positive, leave a message in the thread as an acknowledgement that the engineer from your group will be looking into the issue
Weekends and Holidays (UTC)
Weekend/Holiday oncall will continue to use the existing Oncall process as defined in PD.
Escalation
- SRE et al, notes an issue in #dev-escalation
- SRE raises incident in EOC in PagerDuty.
- BE triages the issue and works towards a solution.
- If necessary, BE will reach out to domain experts as needed.
First response time SLOs
OPERATIONAL EMERGENCY ISSUES ONLY
- Brmbl.io: Development engineers provide initial response (not solution) in both #dev-escalation and the tracking issue within 15 minutes.
- In the case of a tie between production (Brmbl.io) and engineering issues, production issue takes priority. The preferred action is to either backout or rollback to the point before the offending MR.
Required Slack Channel and Notification Settings
- All on-call engineers, managers, distinguished engineers, fellows (who are not co-founders) and directors are required to join #dev-escalation.
- On-call engineers are required to add a phone number that they can be reached on during their on-call schedule to the on-call sheet.
- On-call engineers are required to turn on Slack notification during regular working hours. Please refer to Notification Settings for details.
- Similarly, managers and directors of on duty engineers are also recommended to do the same above to be informed. When necessary, managers and directors will assist to find domain experts.
- Hint: turn on Slack email notification while on duty to double ensure things don’t fall into cracks.
Rotation Scheduling
Important: Sign-ups of weekdays and weekends are required as a backup while using PagerDuty as the primary scheduling and entry for weekday escalations.
Guidelines
Assignments
On-call work comes in four-hour blocks, aligned to UTC:
- 0000 - 0359
- 0400 - 0759
- 0800 - 1159
- 1200 - 1559
- 1600 - 1959
- 2000 - 2359
One engineer must be on-call at all times. This means that each year, we must allocate 2,190 4-hour shifts.
The total number of shifts is divided among the eligible engineers. This is the minimum number of shifts any one engineer is expected to do.
In general, engineers are free to choose which shifts they take across the year. They are free to choose shifts that are convenient for them, and to arange shifts in blocks if they prefer. A few conditions apply:
- No engineer should be on call for more than 3 shifts in a row (12 hours), with 1-2 being the norm
- No engineer should take more than 12 shifts (48 hours) per week, with 10 shifts (40 hours) being the usual maximum.
Most on-call shifts will take place within an engineer’s normal working hours.
Scheduling and claiming specific shifts is done in PagerDuty.
Eligibility
All development backend and fullstack engineers who have been with the company for at least 3 months.
Exceptions: (i.e. exempted from on-call duty)
- Where the law or regulation of the country/region poses restrictions. According to legal department -
- There are countries with laws governing hours that can be worked.
- This would not be an issue in the U.S.
- At this point we would only be looking into countries where 1) we have legal entities, as those team members are employees or 2) countries where team members are hired as employees through one of our PEO providers. For everyone else, team members are contracted as independent contractors so general employment law would not apply.
Nomination
Engineers normally claim shifts themselves in PD.
To ensure we get 100% coverage, the schedule is fixed one month in advance.
Engineers claim shifts between two and three months in advance. When signing up, fill the cell with your full name as it appears in the team members list, Slack display name, and phone number with country code. This same instruction is posted on the header of schedule spreadsheet too.
At the start of each month, engineering managers look at the schedule for the following month (e.g. on the 1st March, they would be considering the schedule for April, and engineers are claiming slots in May). If any gaps or uncovered shifts are identified, the EMs will assign those shifts to engineers. The assignment should take into account:
- How many on-call hours an engineer has done (i.e., how many of their allocated hours are left)
- Upcoming leave
- Any other extenuating factors
- Respecting an assumed 40-hour working week
- Respecting an assumed 8-hour working day
- Respecting the timezones engineers are based in
In general, engineers who aren’t signing up to cover on-call shifts will be the ones who end up being assigned shifts that nobody else wants to cover, so it’s best to sign up for shifts early!
Relay Handover
- Since the engineers who are on call may change frequently, responsibility for being available rests with them. Missing an on-call shift is a serious matter.
- In the instance of an ongoing escalation no engineer should finish their on-call duties until they have paged and confirmed the engineer taking over from them is present, or they have notified someone who is able to arrange a replacement. They do not have to find a replacement themselves, but they need confirmation from someone that a replacement will be found.
- In the instance of an ongoing escalation being handed over to another incoming on-call engineer the current on-call engineers summarize full context of on-going issues, such as but not limited to:
- Current status
- What was attempted
- What to explore next if any clue
- Anything that helps bring the next on-call engineer up to speed quickly
These summary items should be in written format in the following locations:
- Existing threads in #dev-escalation
- Incident tracking issues
- This shall be completed at the end of shifts to hand over smoothly.
- For current Infrastructure issues and status, refer to Infra/Dev Triage board.
- For current Production incident issues and status, refer to Production Incidents board.
- If an incident is ongoing at the time of handover, outgoing engineers may prefer to remain on-call for another shift. This is acceptable as long as the incoming engineer agrees, and the outgoing engineer is on their first or second shift.
- If you were involved in an incident which has been mitigated during your shift, leave a note about your involvement in the incident issue and link to it in the
#dev-escalation
Slack channel indicating you participated in the issue as an informational hand-off to future on-call engineers.
Coordinator
Given the complexity of administration overhead, one engineering director or manager will be responsible to coordinate the scheduling of one month. The nomination follows the same approach where self-nomination is the way to go. On each month tab in the schedule spreadsheet, directors and managers are encouraged to sign up in the Coordinator column. One director or manager per month.
Responsibility
The coordinator will:
- Remind engineers to sign up, by:
- Posting reminders to the
#team-engineering
channel in Slack - Asking managers in #eng-managers to remind team-members in 1-1s
- Utilizing appropriate mailing lists to contact engineers by email
- Posting reminders to the
- Assign folks to unfilled slots when needed (do your own due diligence when this action is necessary). Use purple text in the spreadsheet to indicate this was an assigned slot.
- Coordinate temporary changes or special requests that cannot be resolved by engineers themselves.
- After assigning unfilled slots and accommodating special requests the coordinator should update PagerDuty to schedule shifts.
Additional Notes for Weekend Shifts
For those eligible engineers, everyone is encouraged to explore options that work best for their personal situations in lieu of weekend shifts. When on-call you have the following possibilities:
- Swap weekend days and weekdays.
- Swap hours between weekend days and weekdays.
- Take up to double the time off for any time worked during the weekend when the above two options don’t work with your personal schedule.
- When an engineer is in standby mode (e.g. not paged) during the weekend shift, they can take 1.25x time-off.
- When an engineer is in call-back mode (e.g. being paged) during the weekend shift, they can take double the time-off.
- For those who reside in Australia, please refer to these guidelines of time in lieu in the handbook.
- Please create an OOO event in PTO by Roots and choose On-Call Time in Lieu.
- Other alternatives that promote work-life balance and have the least impact to your personal schedule.
With the above alternatives we want to make sure we comply with local labor laws and not surpass the restricted weekly working hours (ranging from 38 to 60 hours) and offer enough rest time for the engineers who sign up on weekend on-call shifts.
Resources
Shadowing A Whole Shift
To get an idea of what’s expected of an on-call engineer and how often incidents occur it can be helpful to shadow another shift. To do this simply identify a time-slot that you’d like to shadow in the on-call schedule and contact the primary to let them know you’ll be shadowing. Ask them to invite you to the calendar event for this slot. During the shift keep an eye on #dev-escalation for incidents and observe how the primary follows the process if any arise.
Notification Settings
To make the First Responder process effective, the engineer on-call must configure their notifications to give them the best chance of noticing and responding to an incident.
These are the recommended settings. Your mileage may vary.
Slack Notifications
- Within Slack, open “Preferences”.
- Under “Notify me about…”, select one of the first two options; we recommend “Direct messages, mentions & keywords”. Do not choose “Nothing”.
- If you check “Use different settings for my mobile devices”, follow the same rule above.
- Scroll down to “Notification Schedule”.
- Under “Allow notifications”, enter your work schedule. For example: Weekdays, 9 am to 5 pm. Pagerslack relies on this to decide whether or not to page a person.
- Scroll down to “Sound & appearance”.
- Choose settings that ensure you won’t miss messages. We recommend:
- Select a “Notification sound”.
- Check “Bounce Slack’s icon when receiving a notification”.
- Use your preference for the other settings. The “Channel-specific notifications” are particularly helpful to mute noisy channels that you don’t need to be interrupted for.
macOS Notifications
- Under “System Preferences”, select “Notifications”.
- Scroll down to find “Slack”.
- Enable “Allow Notifications from Slack”.
- For “Slack alert style”, we recommend “Alerts” so you need to dismiss them. “Banners” might also work for you. Do not select “None”.
- Enable “Play sound for notifications”, particularly if you chose “Banners” above.
- Use your preference for the other settings.
iOS Notifications
- Under “Settings”, open “Notifications”.
- Scroll down to find “Slack”.
- Enable “Allow Notifications from Slack”.
- Under “ALERTS”, enable “Lock Screen” and “Banners”.
- Enable “Sounds”.
- Use your preference for the other settings.