Resilience and Chaos Engineering in the Cloud
At Hotels.com (part of Expedia Group) we run microservices and infrastructure in production at a large scale. Where applications previously ran on fixed hosts for their lifetime, moving our services to AWS and on Kubernetes presented us with a whole new set of challenges that we must be prepared for.
Every production incident not only impacts our revenue but also our customers' trust. In an effort to build resilience into our highly performant and highly scalable services, we at Expedia Group explored processes and tools to stress and 'break' our systems on purpose and without impacting production.
In this talk, we'll talk you through the following:
- Parallels between Resilience Engineering in tech and in other industries
- Why resilience matters and how we can become better at that
- Why we need Chaos Engineering and practical examples on cloud and Kubernetes
- State of the art and current limitations