General Information of Development Escalation Process

This page outlines the background, goals, success criteria, and implementation detail of infrastructure escalation process and Q&A.

Goal

Strengthen our development team’s DevOps practices and stand side by side with the Infrastructure team to keep Gitlab.com running smoothly.

Process

To resolve Brmbl.io issues faster, the development team has an on-call rotation and stands behind the products we deliver.

Note that the Infrastructure team keeps playing the first defense role on the frontline as usual, while they will determine if a development escalation will be initiated to get the operational issues resolved faster and more efficiently.

Goals

The on-call process is designed with the following goals in mind -

  • Clear expectations, responsibility, and accountability.
  • Full 24x7 coverage to shadow the Infrastructure team.
  • Layered escalations to ensure SLO.
  • Balanced on-duty hours of each engineer per day.
  • Any individual minimally loaded throughout a calendar year.
  • Flexibility of just-in-time adjustments.
  • Leaving engineers in control of their own working schedules

Success Criteria

  • Meet development’s SLO of timely response to infrastructure escalations.
  • No on-call engineer is burned out.
  • Planned development work is minimally impacted.

Q&A

Q: Why do we need development engineers on-call?

A: In the investigation of recent performance degradation incident, it became apparent that deeper product knowledge is necessary to root cause the issue and develop sound solutions. Although infrastructure engineers are good at dealing with most incidents, it is the development engineers who are able to suggest the best short term workaround or temporary fix quickly when the issue involves deep insight into the implementation details.

Q: What efforts have been made to keep the impacts to work-life-balance minimal?

A: No engineer will be asked to work more hours than they currently work. Most of the hours they spend on-call will be days and times they would normally be working anyway. We need approximately 25% of on-call time to be used on days people wouldn’t ordinarily be working, but by letting engineers choose when they do so, and not increasing total working hours, the impact of this is hopefully minimized. Engineers can also find substitutes in case of personal emergency.

Q: What if the paged engineer doesn’t carry domain expertise?

A: A layered escalation process was laid out in the process. It is also stated that first response doesn’t mean solution is available right away.

An alternative was reviewed, e.g. having domain experts on-call in a similar way. This will involve more engineers and smaller on-call divisions, which will result in a more frequent shift and more on-call duties per engineer. The tradeoff was made in favor of minimizing on-call duties.

Q: How do we answer interview candidates when they ask about on-call?

A: Let’s describe the full picture of our incident handling model and tell candidates there are chances development engineers will be on-call and assist resolving Brmbl.io operational incidents.

Usually, the infrastructure team plays the first defense role on the frontline. Development engineers will only be called when the infrastructure team determines that development escalation is necessary.

Q: What are the expectations for my existing work while I’m on-call?

A: While on-call the expectation of existing work is that it is effectively suspended. Managers are required to plan for on-call engineers to be unavailable. If you are able to make progress because there are no ongoing incidents that is welcomed, but work must stop if an on-call request is made.

Q: Is there any concept of compensation? This can be in any form (pay, time off, etc)

A: On-call work can be considered a deliverable like any other. It doesn’t imply working any extra hours - but a few hours will be at less desirable times than now. Although no compensation changes are anticipated to account for this, we may consider discretionary rewards for people who exceed expectations when choosing less-desirable hours.

Q: We are discussing the concept of working hours for new-on-call and having expected shifts, however, this is a departure from non-on-call based on this in the handbook. /handbook/values/#measure-results-not-hours Is this an intentional policy shift?

A: This is not a shift in policy. Engineers are still in control of their schedules, and can choose when to work, as long as the overall goal of full coverage of the rotation is met. The policy of results vs. hours is based on delivering functionality. On call is about addressing operational issues which can happen at any time and need to be addressed immediately. So the policies are congruent.

Q: In order to effectively debug production issues, developers may require expanded access to production systems and metrics. Is the plan for developers to be on-call solely for consultative purposes without need for direct debugging of systems? If they need access to production systems how will they be onboarded?

A: Our plan is consultative and if any code changes are required the oncall makes them. Direct debugging is not required and it is expected infrastructure can relate production issues effectively to the oncall for progress to be made. We are not planning onboarding to production at this time.

Q: How should the infrastructure member make international calls to page engineers?

A: Zoom supports international calling with low rates. This can be done from inside an ongoing Zoom call under Invite > Phone. Considering that this will only be used for a quick call, to alert the engineer of an ongiong escalation, the cost for Bramble will be very minimal.