Digital disruptions have reached alarming levels. Incident response in modern application environments is frequent, time-consuming and labor-intensive. Our team has first-hand experience dealing with the far-reaching impacts of these disruptions and outages, having spent decades in IT Ops. PagerDuty recently published a study1 that shines a light on how broken our existing incident response systems and practices are. The recent Crowdstrike debacle is further evidence of this. Even with all the investment in observability, AI Ops, automation and playbooks, things aren’t improving. In some ways, they are worse; we are collecting more and more data and we are overloaded with tooling, creating confusion between users and teams who struggle to understand the holistic environment and all of its interdependencies. With a mean resolution time of 175 minutes, each customer-impacting digital incident costs both time and money. The industry needs to reset and revisit current processes so we can evolve and change the trajectory.
The Impact of Outages and Application Downtime
Outages erode customer trust. 90% of IT leaders report that digital disruptions have reduced customer confidence. Protecting sensitive data, ensuring swift service restoration and providing real-time customer updates are essential for maintaining trust when digital incidents happen. Thorough, action-oriented postmortems are critical post-incident to prevent recurrences. And — at risk of reinforcing the obvious — IT organizations need to put operational practices in place to minimize outages from happening in the first place.
Yet even though IT leaders understand the implications on customer confidence, incident frequency continues to rise. 59% of IT leaders report an increase in customer-impacting incidents, and it is not going to get better unless we change the way we observe and mitigate problems in our applications.
Automation Can Help, But Adoption is Slow
Despite the growing threat, many organizations are lagging in incident response automation:
- Over 70% of IT leaders report that key incident response tasks are not yet fully automated
- 38% of responders’ time is spent dealing with manual incident response processes
- Organizations with manual processes take on average three hours 58 minutes to resolve customer-impacting incidents, compared to two hours 40 minutes for those with automated processes.
It doesn’t take an IT expert to know that spending nearly half their time in manual processes is a waste of resources. And those that have automated operations are still taking almost three hours to resolve incidents. Why is the incident response still so slow?
It is not just about process automation. We also need to accelerate decision automation, driven by a deep understanding of the state of applications and infrastructure.
Causal AI for DevOps: The Missing Link
Causal AI for DevOps promises a bridge between observability and automated digital incident response. By ‘Causal AI for DevOps’ I mean causal reasoning software that applies machine learning (ML) to automatically capture cause and effect relationships. Causal AI has the potential to help dev and ops teams better plan for changes to code, configurations or load patterns, so they can stay focused on achieving service-level and business objectives instead of firefighting.
With Causal AI for DevOps, many of the incident response tasks that are currently manual can be automated:
- When service entities are degraded or failing and affecting other entities that makeup business services, causal reasoning software surfaces the relationship between the problem and the symptoms it is causing.
- The team with responsibility for the failing or degraded service is immediately notified so they can get to work resolving the problem. Some problems can be remediated automatically.
- Notifications can be sent to end users and other stakeholders, letting them know that their services are affected along with an explanation for why this occurred and when things will be back to normal.
- Postmortem documentation is automatically generated.
- There are no more complex triage processes that would otherwise involve multiple teams and managers to orchestrate. Digital incidents and outages are reduced, and root cause analysis is automated, so DevOps teams spend less time troubleshooting and more time shipping code.
It is Time for Automated Incident Response
It is time to shift from manual to automated incident response. Causal AI for DevOps can help teams prevent outages, reduce risk, cut costs and build sustainable customer trust. This is a topic we care about at Causely, where we are building a Causal Reasoning Platform to help organizations ensure continuous application reliability and eliminate human troubleshooting. You can learn more about us and our platform at Causely.io.