Resiliency testing plays an important role in the successful performance of applications on the cloud. At least 77 percent of enterprises have a minimum of one application or a portion of their enterprise applications running on cloud infrastructure. However, research shows that around two-thirds of these companies haven’t experienced any significant benefits in scalability, savings, or IT modernisation.
While cloud infrastructure does have the potential to make businesses leaner, more efficient, and secure, several underlying risks undermine its ability to do so. These risks range from a failure in performance of the migrated systems to underutilised business potential. Quite often, performance issues persist, especially in cases where on-premises applications and legacy systems are migrated without optimisation and testing.
For the purpose of this article, let’s review one of the most common failures seen in applications with high availability requirements. A closer look at the causes and possible methods of prevention could be eye-opening.
High availability is the ability of a system to provide smooth, uninterrupted service no matter what the conditions are. This is one of the most appreciated advantages of migrating to the cloud. High availability is typically achieved through an integrated approach involving high levels of automation and monitoring, load balancing, clustering, early detection of imminent failures, and automated failover to a secondary system in the event the primary fails. What is important to note is that all these precautionary activities take place at the infrastructure layer.
Often during an on-premise system migration to the cloud, customers overlook the robustness and resiliency of their system under the assumption that the cloud providers are completely responsible for assuring over 99% uptime and availability. This assumption is however only partly true. While cloud infrastructure providers do employ various strategies to guarantee high availability, application architecture and design for resiliency still remains the responsibility of the customer. Since most customers are unaware of this requirement and are ill-prepared for resiliency, system availability is often compromised despite having the best cloud infrastructure provider at their service.
Application resilience refers to the ability of an application to continually provide and maintain acceptable levels of service even in the face of challenges and less than ideal conditions of operation. In simpler words, it is an application’s ability to be prepared for disruptions and unforeseen changes to the environment. This includes its ability to recover from faults and also, under extreme conditions, graceful degradation.
Whether an application is migrated from an on-premise location to the cloud or has been developed natively, it stands the risk of failure during operation if it hasn’t been architected and remediated for resilience before migration. For the full benefits of the cloud infrastructure, resiliency best practices are imperative.
Recently, Swedish music streaming service, Spotify, experienced a rate outage due to which their customers were not able to listen to their favourite songs for nearly an hour. The outage was caused by failures in several microservices triggered by a network problem. Unfortunately their system had not been adequately architected for early detection of failure and recovery, or tested for resilience. This resulted in a cascade of downstream failures.
Resiliency testing checks for a system’s capacity to remain operational without disrupting service under live conditions. Some of the challenges faced in resiliency testing involve the prospect of how the cloud application can be tested, evaluated, and characterised. Conventional testing, on the other hand, cannot adequately reveal application resiliency issues due to the following reasons:
In late 2015, a major outage shook the foundations of Amazon Web Services and resulted in hours of outage for most of the giant tech companies that depended on its service. However, Netflix recovered relatively unscathed with just a few minutes of downtime. The secret of Netflix’s success was the Netflix Simian Army that ran outage and system failure simulations alerting their cloud engineers of impending failure modes. Since then, this scenario has become a well researched case study, forever replacing cloud migration complacency with a new norm – failure is the rule, not the exception.
In the cloud, application resiliency is a challenge and all the more a necessity due to the multi-tier, multi-integrated technology infrastructure and distributed cloud system employed. The interplay of these varying elements can often cause surprising hiccups and outages even if the cloud infrastructure provider keeps up their side of the bargain. Therefore, cloud engineers will have to be on the watch for imminent failure by using the following strategies to test, evaluate and characterise application resilience:
Organisations can be better prepared to harvest the benefits of the cloud by adopting an architecture-driven testing approach that provides insights into cloud application resiliency way before the application goes live, leaving sufficient time to perform necessary remediations.