Resiliency Testing in Cloud Infrastructure for Distributed Systems

Ravikiran Karanjkar

doi:10.15662/IJRPETM.2022.0504007

PDF

Published: 2022-08-03

DOI: https://doi.org/10.15662/IJRPETM.2022.0504007

Keywords:

Resiliency testing, cloud infrastructure, distributed systems, fault tolerance, chaos engineering, failover mechanisms, site reliability engineering (SRE), high availability

Ravikiran Karanjkar

Quality Assurance Manager - Amazon Inc. USA

Abstract

Resiliency testing validates whether distributed cloud systems can withstand disruptions without compromising critical services. As enterprises increasingly adopt cloud-native architectures, resiliency becomes a cornerstone of operational excellence. This paper explores strategies for designing and executing resiliency tests, emphasizing architectural awareness, technology stack considerations, failover mechanisms, regional redirection, staged testing, layered backups, and customer-centric validation. Drawing on principles from site reliability engineering (SRE), chaos engineering, and distributed systems theory, the paper provides a comprehensive framework for organizations seeking to ensure high availability and fault tolerance. This paper is based on resiliency testing experience for a cloud based interoperable video conferencing solution.

Issue

Vol. 5 No. 4 (2022): The International Journal of Research Publications in Engineering, Technology and Management

Section

Articles

How to Cite

Resiliency Testing in Cloud Infrastructure for Distributed Systems. (2022). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 5(4), 7142-7144. https://doi.org/10.15662/IJRPETM.2022.0504007

References

1. Apple Insider. (2015). Apple’s iCloud outage affects millions of users.

2. Amazon Web Services (AWS). (2017). AWS well-architected framework: Reliability pillar.

3. Basiri, A., Behnam, N., Hochstein, L., et al. (2016). Chaos engineering. IEEE Software.

4. Beyer, B., Jones, C., Petoff, J., & Murphy, N. (2016). Site reliability engineering. O’Reilly Media.

5. Bort, J. (2017). AWS S3 outage takes down major websites. Business Insider.

6. Brewer, E. (2000). Towards robust distributed systems. Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC).

7. Claburn, T. (2012). Lightning strike causes AWS outage. InformationWeek.

8. Google Cloud. (2019). Regions and zones.

9. LinkedIn Engineering. (2019). Dark launches and testing in production.

10. Miller, R. (2011). Amazon EC2 outage: Lessons learned. Data Center Knowledge.

11. Microsoft. (2019). Designing for resiliency.

12. Office of Government Commerce (OGC). (2007). ITIL service design.

13. Slack Engineering. (2019). Postmortems and incident response.

14. Voas, J., & McGraw, G. (1998). Software fault injection: Inoculating programs against errors. Wiley.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

How to Cite

References