Resiliency Testing in Cloud Infrastructure for Distributed Systems
Main Article Content
Abstract
Resiliency testing validates whether distributed cloud systems can withstand disruptions without compromising critical services. As enterprises increasingly adopt cloud-native architectures, resiliency becomes a cornerstone of operational excellence. This paper explores strategies for designing and executing resiliency tests, emphasizing architectural awareness, technology stack considerations, failover mechanisms, regional redirection, staged testing, layered backups, and customer-centric validation. Drawing on principles from site reliability engineering (SRE), chaos engineering, and distributed systems theory, the paper provides a comprehensive framework for organizations seeking to ensure high availability and fault tolerance. This paper is based on resiliency testing experience for a cloud based interoperable video conferencing solution.
Article Details
Section
How to Cite
References
1. Apple Insider. (2015). Apple’s iCloud outage affects millions of users.
2. Amazon Web Services (AWS). (2017). AWS well-architected framework: Reliability pillar.
3. Basiri, A., Behnam, N., Hochstein, L., et al. (2016). Chaos engineering. IEEE Software.
4. Beyer, B., Jones, C., Petoff, J., & Murphy, N. (2016). Site reliability engineering. O’Reilly Media.
5. Bort, J. (2017). AWS S3 outage takes down major websites. Business Insider.
6. Brewer, E. (2000). Towards robust distributed systems. Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC).
7. Claburn, T. (2012). Lightning strike causes AWS outage. InformationWeek.
8. Google Cloud. (2019). Regions and zones.
9. LinkedIn Engineering. (2019). Dark launches and testing in production.
10. Miller, R. (2011). Amazon EC2 outage: Lessons learned. Data Center Knowledge.
11. Microsoft. (2019). Designing for resiliency.
12. Office of Government Commerce (OGC). (2007). ITIL service design.
13. Slack Engineering. (2019). Postmortems and incident response.
14. Voas, J., & McGraw, G. (1998). Software fault injection: Inoculating programs against errors. Wiley.