How to build a Well-Architected SaaS solution that’s resilient to failure

Oct 2, 2020 | Announcements, Migration, MSP

Is your business part of the fast-growing Software-as-a-Service (SaaS) market? This market is on fire — Gartner recently cited SaaS as the largest segment (at 41%) of the worldwide public cloud market, with a 2020 forecast of nearly $105 billion that’s forecast to grow to nearly $141 billion in 2022. That’s an 11.3% CAGR between 2019 and 2022.

A key factor driving this growth during 2020 is the global pandemic, as companies shift to remote work and collaboration, schools rely on SaaS for online classes, and consumers use SaaS applications for entertainment and communicating with friends and family. With so much dependence on SaaS, it’s more important than ever for SaaS solutions to be resilient to failure.

Resilience to failure — the ability to recover from infrastructure or service disruptions —  is a primary component of the Reliability pillar of the AWS Well-Architected Framework. In this blog post, we’ll dive into how to architect a SaaS workload on AWS to be resilient to failure by applying Well-Architected best practices.

In his blog post, “10 Lessons from 10 Years of Amazon Web Services,”’s VP & CTO, Werner Vogels, says, “Failures are a given and everything will eventually fail over time … We needed to build systems that embrace failure as a natural occurrence even if we did not know what the failure might be. Systems need to keep running even if the ‘house is on fire.’”

At nClouds, we know that AWS Well-Architected is essential to achieving a customer-first culture. When we partner with our clients to build or improve a SaaS infrastructure so that it’s resistant to failure, we apply the best-practices guidance of the AWS Well-Architected Framework:

  • Implement redundancy to assure availability. In addition to including multiple AWS Availability Zones (AZs) with redundant instances in each AZ, we often implement redundancy in our clients’ architecture by using:
    • Multiple AWS Direct Connect or VPN tunnels between separately deployed private networks.
    • Amazon Virtual Private Cloud (Amazon VPC) endpoints to privately connect the VPC to AWS services.
    • AWS Transit Gateway to route traffic across multiple networks when we build networking connections using Amazon VPC peering, AWS Direct Connect, or an AWS Virtual Private Network (AWS VPN).
  • Design scalable and reliable workloads. We apply AWS services to obtain or scale resources, including:
    • Amazon Simple Storage Service (Amazon S3)
    • AWS Auto Scaling
    • AWS Solution Developer Kits (AWS SDKs)
    • Amazon CloudFront
    • AWS Lambda (Lambda)
    • Amazon DynamoDB
    • AWS Fargate
    • Amazon Route 53

    When building a microservices architecture, we use Amazon API Gateway (API Gateway) to handle acceptance and processing of up to hundreds of thousands of concurrent API calls. In a distributed system, we use asynchronous component interactions (when possible) by using Amazon Simple Queue System (Amazon SQS) queues or Elastic Load Balancing. For event-driven architecture, we use Amazon EventBridge. When high-throughput, push-based, many-to-many messaging is required, we use Amazon Simple Notification Service (Amazon SNS).

  • Improve mean time to recovery (MTTR). To handle an unexpected spike in demand, we use Amazon API Gateway to throttle requests; we can buffer requests using Amazon SQS or Amazon Kinesis. To control and limit retry calls, we use AWS SDKs. We use stateless services when possible so that servers can be replaced at will without causing an availability impact:
    • Amazon Elastic Compute Cloud (Amazon EC2)
    • AWS Fargate
    • Lambda
    • Elastic Load Balancing
    • Amazon Route 53
    • Amazon SQS
    • Amazon Kinesis
  • Monitor workload components. We use the nOps cloud management platform to monitor all workload components and external endpoints from remote locations. nOps’ monitoring and real-time alerts focus on:
    • Critical changes to resources, configurations, and security groups, with associated timelines.
    • Functionality of Amazon DynamoDB on-demand backup and restore.
    • Enablement of automated backups of Amazon RDS database instances.
    • Operability of Amazon Elastic Block Store (Amazon EBS) volumes with snapshots, to provide a baseline for new volumes or data backup.
    • Monitoring the encryption of data at rest.
    • Checking backup policy for resources.
    • EBS volumes with public snapshots.
    • Tracking all scheduled events.
  • Use automation. To avoid error-prone human elements, we use AWS services, including:
    • AWS CodePipeline to automate the build, test, and deploy phases of release pipelines.
    • AWS CodeDeploy to automate software deployments to a variety of compute services.
    • AWS CloudFormation to automate operations and bring up new environments.
    • AWS OpsWorks’ scripted configuration to reduce errors and provide fine-grained permissions to improve control.
    • Amazon Aurora Serverless’s auto-scaling configuration to start up, shut down, and scale capacity up or down.
    • AWS Systems Manager to automate operational tasks across AWS resources.
    • AWS Backup to automatically back up data across AWS services.
    • AWS Shield to provide Distributed Denial of Service (DDoS) protection via inline mitigations that minimize application downtime and latency.


In conclusion


When it comes to failure in the architecture of your SaaS product, it’s not about whether or not it will fail, because it’s inevitable that something will fail at some point. It’s about building a Well-Architected solution that’s resilient to failure. You can mitigate failure by implementing redundancy to assure availability, designing scalable and reliable workloads, improving MTTR, monitoring workload components, and using automated AWS services.

Need help building the right architecture for your SaaS solution on AWS? The nClouds team is here to help with that and all your AWS infrastructure requirements.


Contact us


nClouds is a cloud-native services company that helps organizations maximize site uptime, performance, stability, and support, bringing out the best of their people and technology using AWS