AI Powered Root Cause Analysis for Production Alerts
At PayPal, SRE team troubleshoots production alerts (from ~2500 applications and services). There is always an inherent urgency in resolving the alert. At times, we are swamped with alerts, all requiring attention at the same time.
In this talk, we will share how we have started employing Machine Learning from the ground-up to give our platform the necessary power to predict the probable root cause of alerts.
Also, will elaborate how we use the existing troubleshooting results (from traditional programming) in machine learning to help improve the accuracy of the prediction. The design, working and methodology followed in experimental trials to identify the best model. The model that we built is integrated with our platform and pronounces the root cause in real time. The model has been showing promising results and is a game changer for SREs.
This presentation will mainly walk you through the journey of how we have built the machine learning models and employed the same in production.
You may also be interested in
Non-violent communication will help you communicate with your coworkers in a manner that enables productivity and helps you understand how...