With the recent release of some additional features and availability of AWS Fault Injection Simulator in more regions we thought it would be good to cover off this awesome new AWS feature.
What’s the problem with my AWS Architecture?
AWS provide the tools, patterns and services needed to build highly available, performant and resilient systems and platforms for your workloads. In order to give your operations staff confidence in the design of architecture, companies usually perform some kind of infrastructure testing – usually referred to as Chaos Engineering. These test focus on trying to break the underlying architecture at specific points to test the resiliency, performance, scalability and availability of architectures.
The main goal of this approach is to identify poorly architected practices or applications that don’t handle load, restarts or connection drop outs well.
So what’s new?
Up until now this has either required 3rd party external tooling, a bespoke solution written by the infrastructure/platforms team or a simple manual, repetitive process to terminate resources in the AWS console. This is often time consuming and involves installing custom tooling/services on production workloads to simulate scenarios such as high CPU or memory to stress test an environment. Furthermore, this approach lends itself to introducing an element of risk to targeting the wrong workloads or running poorly tested scenarios.
Fear not! AWS have a solution…AWS Fault Injection Simulator to the rescue! Now you can use an AWS managed service to throw everything you have at your AWS solutions. Some of these are shown in the screenshot below…
As you can see there are some interesting scenarios you can throw at your workloads. Remember when you needed to test what would happen to your workload if it was too loose connectivity with your database? I do…it would involve going into the AWS console and manually changing security groups to block access to attempt to simulate a network failure. Well now you don’t need to do that – you can use AWS FIS to send a Network-Blackhole action to your EC2 instance to simulate the database connection dropping – all the while not having to manually amend production security group rules!
Some very interesting other actions available are
- CPU Stress Test
- Kill Process
- Memory Stress
Which make what was once a laborious task so much simpler. AWS FIS also makes this process a lot safer by using Targets to process these actions against. Now you can specify which AWS resources to run your test scenarios against by targeting things like ARN, tag or resource type – ensuring you’re only running Chaos Engineering against specified workloads.
More recently, AWS FIS has introduced both container and Spot actions to help test your containerised (ECS & EKS) workloads.
This is a great feature that lets you properly test your ECS and EKS platforms, specifically those that run on Spot Instance technology, rather than trying to engineer the solution yourself through the console or Kubernetes Administration commands.
AWS FIS also allows you to monitor the status of a test scenario and integrate with Amazon CloudWatch to kick off tests at certain times or events. It’s also possible to stack up actions together into chains to create fully fledged infrastructure real world problems that you might anticipate.
Automate AWS FIS
Finally, the true power of AWS FIS comes in its ability to be integrated into CI/CD pipelines as part of your build and deployment mechanisms. For example, when you perform a build of your docker images and deploy them onto your test ECS cluster, you can add another step to run your ECS resiliency test to ensure the application can handle reboots, network connectivity problems, process being killed on EC2 instances, high CPU, memory etc.
Building this kind of test into your CI/CD pipelines will undoubtably give product owners and operations staff within your organisation a real sense of confidence that your applications are highly performant and your platforms well architected.