Title
AWS re:Invent 2023 - Improve application resilience with AWS Fault Injection Service (ARC317)
Summary
- Adrian Hornsby, a principal engineer with the AWS Reliability team, and Iris, a senior product manager, presented on improving application resilience using AWS Fault Injection Service (formerly known as the simulator).
 - They discussed the high cost of downtime for enterprises and the importance of resilience in maintaining a good reputation and avoiding financial losses.
 - The concept of resilience was broken down into four pillars: anticipation, monitoring, responding, and learning.
 - AWS's Resilience Lifecycle Framework was introduced, emphasizing the continuous process of improving system resilience.
 - The importance of fault isolation boundaries, such as regions and availability zones (AZs), was highlighted, along with the concept of static stability in system design.
 - AWS Fault Injection Service (FIS) was presented as a tool for resilience testing, allowing controlled experiments to inject faults into systems to uncover hidden issues and improve operational practices.
 - Iris introduced new features in FIS, including Scenarios for predefined experiment templates and multi-account experiments.
 - Two new scenarios were announced: AZ Availability Power Interruption and Cross-Region Connectivity, designed to test multi-AZ and multi-region applications.
 - The session concluded with a call to practice resilience and provided resources for further learning.
 
Insights
- The cost of downtime is significant, with enterprises potentially losing hundreds of thousands to millions of dollars per hour of downtime.
 - Resilience is not just about technology; it involves culture, mechanisms, and tools.
 - The Resilience Lifecycle Framework is a holistic approach to improving system resilience, which can be entered at any stage.
 - Fault isolation boundaries are crucial for minimizing the impact of failures and ensuring that systems can handle traffic surges without control plane operations.
 - AWS FIS is a powerful tool for resilience testing, allowing users to simulate faults and improve their systems' reliability and performance.
 - The introduction of Scenarios in FIS simplifies the process for customers to start testing their applications' resilience.
 - The new multi-AZ and multi-region scenarios in FIS enable customers to test complex applications and ensure they can handle real-world failure modes.
 - Building resilience is an ongoing process that requires regular practice and testing to ensure systems can withstand and recover from failures.