From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE

Authors

  • Arun Pandiyan Perumal Mountain House, CA, USA Author

DOI:

https://doi.org/10.47363/JAICC/ICAICC/2025(4)46

Keywords:

Predictive Reliability, Architecting AI-Driven

Abstract

Traditional reactive approaches to reliability are becoming inadequate due to the escalating complexity of IT systems encompassing 
microservices, multi-cloud strategies, and extensive container orchestration. In this talk, I will share how AI-driven incident 
management pipelines transform reactive workflows into predictive, self-healing systems. We will start by exploring the core AIOps 
pillars—data ingestion, anomaly detection, event correlation, root-cause analysis, and automated remediation—and see how they 
map to SRE principles like error budgets and Service Level Objectives. You will discover a reference architecture that unifies multi
cloud observability, integrates open-source and commercial tools, and embeds machine learning models into CI/CD and ticketing 
pipelines. Along the way, I will highlight strategies for combating alert fatigue, handling model drift, and optimizing costs at 
scale. By the end, you will understand how to harness predictive insights to reduce your MTTD/MTTR, minimize toil, and deliver 
enterprise-grade reliability with confidence.

Author Biography

  • Arun Pandiyan Perumal, Mountain House, CA, USA

    Arun Pandiyan Perumal, Mountain House, CA, USA

Downloads

Published

2025-05-10

How to Cite

From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE. (2025). Journal of Artificial Intelligence & Cloud Computing, 4(3), 1-1. https://doi.org/10.47363/JAICC/ICAICC/2025(4)46

Similar Articles

1-10 of 412

You may also start an advanced similarity search for this article.