From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE
DOI:
https://doi.org/10.47363/JAICC/ICAICC/2025(4)46Keywords:
Predictive Reliability, Architecting AI-DrivenAbstract
Traditional reactive approaches to reliability are becoming inadequate due to the escalating complexity of IT systems encompassing
microservices, multi-cloud strategies, and extensive container orchestration. In this talk, I will share how AI-driven incident
management pipelines transform reactive workflows into predictive, self-healing systems. We will start by exploring the core AIOps
pillars—data ingestion, anomaly detection, event correlation, root-cause analysis, and automated remediation—and see how they
map to SRE principles like error budgets and Service Level Objectives. You will discover a reference architecture that unifies multi
cloud observability, integrates open-source and commercial tools, and embeds machine learning models into CI/CD and ticketing
pipelines. Along the way, I will highlight strategies for combating alert fatigue, handling model drift, and optimizing costs at
scale. By the end, you will understand how to harness predictive insights to reduce your MTTD/MTTR, minimize toil, and deliver
enterprise-grade reliability with confidence.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Artificial Intelligence & Cloud Computing

This work is licensed under a Creative Commons Attribution 4.0 International License.