From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE

Arun Pandiyan Perumal Perumal

doi:10.47363/JAICC/ICAICC/2025(4)46

From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE

Authors

Arun Pandiyan Perumal Mountain House, CA, USA Author

DOI:

https://doi.org/10.47363/JAICC/ICAICC/2025(4)46

Keywords:

Predictive Reliability, Architecting AI-Driven

Abstract

Traditional reactive approaches to reliability are becoming inadequate due to the escalating complexity of IT systems encompassing
microservices, multi-cloud strategies, and extensive container orchestration. In this talk, I will share how AI-driven incident
management pipelines transform reactive workflows into predictive, self-healing systems. We will start by exploring the core AIOps
pillars—data ingestion, anomaly detection, event correlation, root-cause analysis, and automated remediation—and see how they
map to SRE principles like error budgets and Service Level Objectives. You will discover a reference architecture that unifies multi
cloud observability, integrates open-source and commercial tools, and embeds machine learning models into CI/CD and ticketing
pipelines. Along the way, I will highlight strategies for combating alert fatigue, handling model drift, and optimizing costs at
scale. By the end, you will understand how to harness predictive insights to reduce your MTTD/MTTR, minimize toil, and deliver
enterprise-grade reliability with confidence.

Author Biography

Arun Pandiyan Perumal, Mountain House, CA, USA

Arun Pandiyan Perumal, Mountain House, CA, USA

Downloads

View PDF

Published

2025-05-10

Issue

Vol. 4 No. 3 (2025): Conference Proceedings: International Conference on Artificial Intelligence and Cloud Computing (ICAICC-2025)

Section

Articles

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

From Reactive to Predictive Reliability: Architecting AI-Driven Incident Management Pipelines for Enterprise-Scale SRE. (2025). Journal of Artificial Intelligence & Cloud Computing, 4(3), 1-1. https://doi.org/10.47363/JAICC/ICAICC/2025(4)46

Download Citation