Engineering Manager series part 6: systems health and how to create a DevOps culture
In today’s post we sit down with Mattias Björnheden, Director of Engineering, to discuss the fifth and final component of our Engineering Manager framework: systems health.
Get a full overview of our Engineering Framework from Udi, our CTO and don’t miss Simone’s post on productivity, Rodrigo’s post on team health, Oliver’s post on stakeholder happiness, and Mathieu’s post on business impact.
GetYourGuide is an online marketplace for amazing travel activities with systems that handle millions of events every day. To manage this in a delightful way, we rely on a cloud-based infrastructure running many different applications. To keep these systems in good health requires technical skills, good organization, and a sense of responsibility from all of our engineers. The Engineering Manager, together with the team, is responsible for making sure all of these things come together, which brings us to the 5th component of our Engineering Manager framework: Systems Health.
In this blog post, we take a closer look at what managing systems health means in practice for Engineering Managers at GetYourGuide.
Know what systems your team owns
Responsibility starts with ownership. While we strive to have an open environment where any engineer can contribute to any application as needed, we also need to know who will take ownership for the quality of that application. This sounds easier than it is. In a fast-growing organization, teams change often as more people join, but functionality that may not be the focus of new feature development still needs to work properly.
Our Engineering Principles state: “The team developing a service is responsible for defining service level agreements (SLAs) and supporting the service in production.” For the Engineering Manager, this means initiating a discussion with the team and product owner on whether the service is business-critical or not and if the team can support and operate it long-term. As an illustration, the checkout flow is viewed as business-critical and losing it for even a minute is a bad outage. On the flip side, the translation system is something we always aim to keep running, but it’s less disastrous if the translations gets delayed for a few minutes. So, this system can be run at a lower service-level.
Ownership is often straightforward for a new service a team is developing, but there are always exceptions. The team developing a new service may also have to own a couple of older services from a since-disbanded team. While we are moving away from monolithic applications too big to be owned by one team, we still have them. In these cases, the Engineering Manager is expected to help figure out which parts are the team’s responsibility.
Find and track the right metrics for your systems
Once the team has created clarity around which systems it owns, the next step is to gain insight into the actual health of these systems. The codebase itself may provide some hints regarding the state of a system, and, at GetYourGuide, we also use a number of additional tools to gain insight into real operational performance. We use Datadog to monitor all of our systems and we collect logs and errors into Kibana, Sentry, and other applications.
Taken together, these tools can provide insight into how our applications are doing, but we first need to know what to look for. The Engineering Manager must work with the team to define and track useful metrics for critical services owned by the team. The metrics can be technical, such as the amount of disk space used by an application, but we tend to aim for more business-related metrics, like the number of checkouts in the last hour. In the end, healthy systems are about the customer being able to do what they intended to do.
Finding the right metrics takes some time, and the manager can help speed things up by running some brainstorming sessions with the team to figure out what to monitor. Another way to move fast is to ask people at other companies what they track and see if there is something we should pick up. In the end, there will be one or two metrics we find out about the hard way: by catching something late and learning from it. In this case, the manager needs to cultivate a team culture that focuses on what was learned and not what went wrong.
Alerting and incident management
Even with a solid base-design and a good view of the health of systems, incidents will happen. GetYourGuide is constantly making improvements to the product, and we operate with risk in order to find the big wins. A key metric we look at in connection to incidents is how fast we are back on-track, or the mean time to recovery.
To help us recover quickly, teams have monitors connected to PagerDuty alerts and we use an on-call structure across the organization to immediately get the right person on the job. During an incident, the on-call person drives the initial work. Unless the Engineering Manager is on-call at the time, they will usually take a backseat. For larger incidents, we appoint an incident manager to handle coordination.
After every severe incident, we create a post-mortem document focused on:
What we can learn to make sure this type of incident can be avoided in the future
What we can do, if anything, to shorten our recovery times.
We follow best practices around post-mortems by avoiding passing blame and instead focus on learning and solutions. During follow-up, the Engineering manager has a big role to play in making sure the team takes care of post-mortem actions in a timely manner.
Healthy systems increase our speed of execution
A lot of focus in engineering is on finding ways to increase our speed of software delivery in order to ship things that help our customers have incredible travel experiences. A key to moving fast is being able to focus and execute quickly on a problem. If your team is constantly interrupted by failing systems and unplanned firefighting, the team is not moving very fast in practice.
By keeping a reasonable bar for quality and “knowing our systems’ health at any given time”, another engineering principle, we can avoid unplanned work and move faster. Problems are detected early, and, as a manager, you can help your team tackle them as part of the normal sprints.
Building a DevOps culture
Through well-defined ownership, clear service level agreements, and good monitoring, we have put most of the pieces in place for establishing a true DevOps culture where teams take end-to-end ownership for their services. The last, and most important, piece is the people in the teams and their leads. Our entire Engineering orgizanition continuously strives to improve the quality of the systems we deploy and hone our operations skills to quickly overcome any unforeseen issues that occur.
With a passion for in-depth ownership, shipping with quality, and operating systems with pride, our Engineering Managers excel in maintaining our systems’ health, the final component of the Engineering Manager framework.