Software has undoubtedly become more complex over the last two decades. In 2000, it was estimated that there were 3.4 million internet users. A few years ago, Google was doing that many searches every minute. While not all websites operate at the scale of Google, the expectation of all software today has increased tremendously. To support these expectations, organizations have had to invest in reliability and operability of their systems. Zero downtime deployments, high levels of resilience to failure, and ease of restoring service are all key characteristics for any software system today.
More than ever we need teams to think about building the right thing for our customers, while also building to the right level of scalability and quality from the beginning. We are seeing a new role rise in popularity to support these needs. This role is the Site Reliability Engineer, or SRE for short, and was popularized by the 2015 book Site Reliability Engineering based on Google.
While testing and test automation roles have a lot in common with SRE roles, there are some key areas which differ. If you are an automation engineer looking to transition into an SRE role just as I did last year, here are a few things I would recommend focusing on.
Automate more than tests
Google writes that “SRE is what you get when you treat operations as if it’s a software problem”. This means applying software engineering principles like automation of repeatable tasks, test automation, source control of changes, and more to operations challenges. Examples of operational use cases being transformed by software engineering include server configuration and maintenance, dynamically scaling servers based on customer load, generating and releasing software in an automated Continuous Delivery way, supporting business continuity through activities like automated database backups and restore testing, and so much more. If you are already an automation engineer, you already have the mindset and skills to bring to this role. You already know how to assess the high value activities to automate and how to build maintainable solutions. So extend this to automate activities like building test environments or even scripts to reduce cost by spinning down those same test environments out of business hours.
Assess risk at scale
One of the activities SREs participate in is capacity planning. If you are an automation engineer who participates in performance testing, you may have already had these conversations. How fast can your service respond to requests? How many requests can your service handle in a given second or minute before errors show up? How many requests can your service handle before it crashes? These types of numbers can help shape capacity planning, but you also need to branch out further. You need to ask questions of your up and downstream dependencies and how you may handle unexpected load from them or unexpected failures from them. You will need to think about things like rebounding from down time and what happens to all those requests you didn’t handle. Can your service be bombarded by queued requests to retry? Or, do you just start fresh with new requests and risk losing customers based on those dropped failures? People like to say the phrase “one in a million” to indicate rare events. But when you think about it, there are almost 10 million people in London, which puts the need to handle “rare” events in perspective.
Tirelessly focus on end user experience
Software teams engage in all sorts of activities to approximate user experience from creating personas to running user testing. One role of an SRE is to help true end user experience play a part in every day decisions by software engineering teams. The tool SREs use for this is called the Service Level Objective or SLO for short. As SLO is a metric which can alert teams to declining user experience based on key indicators. If you already write or design end to end tests that verify key user experience testing, this is a great start. The ability for users to complete key functions is a part of their success, but so is the responsiveness of the site and the accuracy of the data returned. Therefore, as an SRE you will need to understand what user expectations are and how your team can measure these expectations using production data and all that without infringing on user privacy. The feedback from measuring these SLOs then feeds into team prioritization between defect management and shipping new features.
Don’t be a bottleneck
Finally, SREs are not there to be a gatekeeper for their organizations. At most organizations SREs make up a small part of their engineering headcount and in many ways success as an SRE is measured in how much other software engineers can learn to build with reliability and scalability in mind. This most likely sounds very familiar if you are a test engineer. A big measurement of success for testers is the ability to build a testing mindset into each and every software engineer. As an SRE you want to do this same thing but with operability. Show engineers the value in writing good logs and metrics to monitor their changes in production. Encourage ongoing evaluation of performance so as to catch slow declines or clear impacts due to new features. And most importantly, build relationships where people know the value you bring as an SRE and are comfortable reaching out for support when there are large scale or risky production changes being considered.
In conclusion, an SRE mindset is very similar to that of testing and test automation, but the roles do have space to work in tandem. Many testers and automation engineers work tirelessly to build high quality functionality for their users. While their counterpart SREs work equally hard to support the ecosystem that software needs to be able to run at scale. As a tester you have a chance to either reach out and work more closely with the SREs in your organization, or shift into learning more about networking and server maintenance to join them. You are in a great place because both roles will continue to bring both user and business value for a long time.