COVID-19 rapidly accelerated e-commerce, almost shutting down in-person shopping during the lockdown. Ordering from Amazon became second nature. The online giant accounted for approximately 33% growth—50 million new Amazon Prime members.
While there are concerns that consumer spending will fall short this year, according to Adobe’s 2022 holiday shopping predictions, online discounts are already playing a big part in creating record sales. Customers are concerned about their site’s availability. Are your stakeholders saying your service is running too slowly? Look at the latency metric and you may not see a change noted from the week before. With no real decrease in latency, your team is unaware of what “running too slow” means.
Revisiting Technology Vendor Contracts
Page latency is not new. Many retailers have increased their website conversion by improving page load time. The problem with some sites is that no one really knows what was defined as being fast enough and not too slow. And, of course, every stakeholder has different expectations of what the service should deliver. Internet traffic surged during COVID-19 and networks responded at different levels. While not unusual, the rapidity of growth was staggering. Globally, traffic in April 2020 grew 38% with upstream traffic skyrocketing up to 123.18% before leveling off.
Service level objectives (SLOs) can create a common understanding of service capabilities to help define site requirements from the start. Unlike service level agreements (SLAs) which are engineering-driven, SLOs are a business decision that requires input from multiple stakeholders. SLOs allow retailers to match IT delivery models to business outcomes.
Uptime is only one of an organization’s business objectives. First, service level indicators (SLIs) must be measured to identify users’ expectations of reliability. When you combine these individual data points into your SLO, then you’re set to meet your team’s business goals.
SLOs for Existing and New Services
Once your SLOs have been established, you should review all your existing and new services to make sure that your systems are running according to the objective. Moving from SLAs to SLOs isn’t simply a rebranding exercise.
SLOs become the target for the availability of the system. Instead of looking at every individual data point, you can now look at the SLIs over a certain period — to establish an event and success criterion — and, for example, report that for 99% of them, the operation has less than five seconds to succeed, if that’s your goal. The SLO will indicate if a large portion of your SLI events were good. At 99%, there’s still a gap, but aiming for 100% reliability is unrealistic.
No retailer can guarantee that their system will always be up and running 100% of the time. There are too many factors you can’t influence. And aiming for a higher number may not speak to your business case as it will contribute to unnecessary costs. Additionally, if your users have no expectations of the reliability of your service, it’s not prudent to make the investment. Your 1% gap, known as an error budget, is the number of negative events your system can withstand before violating your SLO—when negative events happen but are completely acceptable and fine.
By using SLOs and error budgets, you will deliver less noise for engineers. Often engineers are paged too much which results in alert fatigue. You’ll have happier engineers who will not receive pages in the middle of the night, get fewer alerts and not have to deal with being alerted about small website spikes.
Implementing SLOs
While reliability is the single most important decision when making SLOs, teams are also faced with internal development challenges. Engineers would like to prioritize technical debt to rework existing services versus focusing on the business side demanding new features. When looking at SLOs, they should be based on your user expectations, not an arbitrary value.
Undercutting the reliability of objectives also means that you’re not meeting user expectations. And not meeting user expectations will generate a direct business impact because unhappy users might move away from your service. Create this direct link between reserving tech debt and having a positive business impact by bringing reliability back up to the value which was defined. This approach will help prioritize between tech debt and new feature development.
Getting started with SLOs for existing services is a good jumping-off point. Most likely, there will be information expectations as to what’s a service. Document the services and think about what users expect. You won’t have to deliver anything near 100% but be reassured that by making the informal formal, seeing the user perspective, and not trying to aim for perfection are areas to hone in on when delivering value for your SLO. Once defined, how often, what you want to achieve, and measuring the actual reliability of your service are the next logical steps to knowing how good you’re delivering in comparison to the objective defined.