Part 2 – General Design Principles
AWS Well-Architected Framework Design Principles
In part 1 of this multi-part series, we explored what the AWS Well-Architected framework is all about and why it is such an invaluable tool to any company with new or existing workloads on AWS. The ability to learn from trials and tribulations of thousands of other AWS customers so you don’t have to go through the same growing pains to learn best practices for yourself can save you time, money, and a lot of frustration.
This series takes a layered approach to our deep-dive exploration of the AWS Well-Architected Framework.
If you haven’t read Part 1 of this series yet, take a moment now so you have the important foundational context. As we peel back the layers and build on our increasing understanding of the framework and all its parts.
Also, if you’re not already, signup for our newsletter, sign up at the bottom of this post!
You’ll get notified about the next parts to this series and get access to other exciting Cloud and DevOps content!
So, now onto the good stuff! The AWS Well-Architected Framework and the General Design Principles that are at its core.
These principles are the core of the framework and important for us to understand. They’re our blueprint or ‘guiding light’ for our cloud design journey. They help to make sure we’re on the right path when building in the AWS cloud.
So, let’s learn what these principles are all about and how we can make use of them when reviewing our own AWS workloads.
There are six General Design Principles of the AWS Well-Architected Framework
- Stop guessing your capacity needs
- Test systems at production scale
- Automate to make architectural experimentation better
- Allow for evolutionary architectures
- Drive architectures using data
- Improve through game days
Ok, simple enough. Even if you’re new to cloud computing and AWS, these likely sound like some good practices and sound advice. But why of all things are these the core principles of the well-architected framework? What is AWS trying to get us to think about when making cloud design choices through these general design principles?
Taking our same layered approach to this series, let’s go through each of these to help us understand them better.
Stop Guessing your Capacity Needs
When designing infrastructure to support your IT workloads, you need to consider the resource capacity necessary to handle the current and future load on these systems. Getting this wrong can lead to very expensive idle resources that do nothing except eat into your operating budgets.
In the traditional IT environments, you’d also have thousands or maybe millions of dollars tied up as capital expenses in your data center for this equipment to support your planned capacity requirements.
Or on the other hand, maybe costs are a big concern. Leading to conservative forecasts for the capacity needs of your infrastructure, to avoid the potential for this wasted money spent on idle IT infrastructure.
What if there’s some unexpected increase in user traffic or overall system load as the business grows faster than expected?
IT capacity planning is a guessing game. A dangerous game at that.
You could be wasting thousands of dollars on resources that aren’t needed anymore. Or even worse you are putting your critical business applications at risk. All because you don’t have the IT capacity to handle sudden spikes in load on your systems.
Instead, we should be designing our workloads to take advantage of the on-demand usage model and scaling capabilities of AWS.
When our cloud workloads are designed to be flexible and leverage the auto-scaling and other automation capabilities of AWS, we no longer have to guess about our capacity needs anymore.
With the AWS Well-Architected Framework, we can build cloud designs that automatically adapt to changes in the capacity needs of your business. Resources scale out to meet your unexpected load increases and scale back in when demand decreases.
Avoid costly guesses on capacity, build your cloud infrastructure to adapt in real-time to changes in system demands
Test Production Systems at Scale
Testing changes to important business infrastructure and applications is critical before making those changes to your production environment. But more often than not, test environments look very different from production environment configuration.
There may be significant design differences in the networking design, the resources being used (type, size, configuration), and the overall scale of the test environment.
But there’s often a good reason why test environments are not mirrored to production. Typically, that reason is cost.
Some production environments can be massive. They may consist of hundreds or thousands of servers and large databases to power your production workload demand. Running a second copy of our production environments all the time as a test environment essentially doubles our production infrastructure costs now. The expense associated with this often forces companies to run very small scaled-down test environments to verify important infrastructure or application changes.
If the infrastructure differs, how can you confidently say the successful testing you did in the test environment will also be successful in production?
What if we consider performance aspects here? If you have a small functional test environment, how do you know the application or infrastructure will work properly at production scale?
This General Design Principle is guiding us to do testing at the full production scale so we don’t have those blind spots inherent with doing testing in smaller functional test environments.
Well, ok, sure, this sounds great. Of course, everyone would love to have production-scale test environments. Remember though that the barrier to doing this in the first place was cost!
It’s unlikely you’d be able to justify having two or more production-like environments running all the time to perform your testing against.
However, in the AWS cloud, you can again take advantage of the on-demand resources and the fact you only pay for what you use. Then if we leverage infrastructure-as-code (Iac) and other automations, we could likely have full production-scale environments built from scratch in a matter of minutes.
So while a full production-scale environment sounds cost prohibitive, if we say our automated test runs take 10 minutes to complete, our new reality in the AWS cloud may look something like this:
- Deploy test environment using IaC and other pipeline automations (10 minutes )
- Run performance tests against the production environment ( 10 minutes )
- Delete test environment using IaC and other pipeline automations ( 5 minutes )
Now if the resources that comprise our production-scale environment cost $100 an hour to run, we could run our performance tests against a full mirror of our production environment setup, at the same scale, for under $50.
Not too bad at all considering the benefits of testing against this type of full production environment.
This of course is a simple example here, but even if your testing takes a few hours to complete, there is still tremendous value in using a full environment like this compared to the costs involved.
If we contrast those costs to what it would cost using a traditional on-premises datacenter environment, imagine the capital and operating expense tied up with operating a production-scale environment for testing purposes. We’d be looking and tens of thousands of dollars or more with large environments.
With AWS, we can take advantage of the on-demand resources to quickly build out production-scale test environments, run our tests and validations, then shut it all down, only paying for that brief time we used them.
In fact, this whole testing lifecycle of standing up the test environment, running the tests, verifying the results, then shutting it all down, can be automated end-to-end. You’re now able to test more often, at production scale, for very little cost considering the value and potential speed and reliability this could add to your product feature releases and infrastructure changes.
And with all this focus on automation here, what a great segue into the next design principle…
Automate to Make Architectural Experimentation Easier
A lot of the concepts we just explored with the test at production scale design principle carry over here as well.
This general design principle uses the same idea of being able to leverage automation, combined with the on-demand and pay-as-you-go usage and pricing models of the AWS cloud, to achieve a lot of flexibility with your cloud infrastructure.
With AWS, there are no long-term commitments necessary for using most AWS services. This lets you try things out with little risk. This flexibility and no long-term cost implications facilitates easy experimentation within your AWS cloud environments.
If you’ve ever wondered how switching storage solutions from Elastic Block Store volumes to using AWS Elastic File System would impact your application, or what would switching the instance types or sizes of your Amazon RDS database do to the cost-to-performance ratio of running your databases…
Well, try it out and see!
To tie this in with the previous testing at scale principle again, leveraging your IaC and automation tools, you could quickly deploy a new test environment with the changes you would like to try out. Then you can quickly and reliably experiment with a variety of configurations and designs. All this experimentation can help make data driven decisions about what design is the most optimal based on your workload and business objectives at play.
Again, in the AWS cloud, you don’t have to wait weeks or months to purchase and rack-and-stack expensive IT hardware like you wound with a traditional on-premises datacenter environment. You’re not buying and waiting for physical servers, storage systems, and networking gear to use so you can try new configurations and infrastructure designs.
For the failed experiments, you can quickly shut down those resources through automation, and you stop paying for them.
Allow for Evolutionary Architectures
Our next general design principle of the AWS Well-Architected Framework is, well, a bit of an evolution of the ones we’ve just looked at so far.
Performing frequent experimentation of your cloud workload designs, then the ability to test those designs at full production scale, gives you the confidence needed to ensure you have the optimal design for your needs and facilitates you to evolve your existing cloud architecture designs.
So while this general design principle may be seen as almost a byproduct of the experimentation and testing principles, the key thing here is allowing for evolutionary architectures of your cloud workloads.
Your business evolves over time. Your applications evolve, your products evolve, and your customers evolve. Factor in the rapid changes going on with the cloud technology ecosystem and the blistering pace AWS releases new features or completely new service offerings, it’s hard to imagine that having a static architecture that works for your business today, will meet your needs tomorrow.
You need to take advantage of the latest technology innovations and change your cloud architectures accordingly to keep up with the new business demands and ensure you’re continuously optimizing for better reliability, security, operational ease, lower costs, performance efficiency, and reducing your environmental impacts of your cloud infrastructure.
With the AWS cloud, you’re able to quickly experiment and test with the latest technologies for very little cost allowing your business to quickly adapt to changing requirements with little risk or cost.
Drive Architectures Using Data
In the AWS cloud, your infrastructure design and the resulting resources that get deployed can all be done through code. You can use infrastructure-as-code to deploy your environments and then collect a tremendous amount of data through the logs and performance metrics from all the infrastructure and service resources that you use.
Similar to the stop guessing you capacity needs design principle, using all the data sources available to you in the AWS cloud helps you to avoid guessing about your architectural designs as well.
If you collect all the performance data from your infrastructure resources and applications running in the cloud, you can continuously analyze this information to support current and future architectural decisions for your AWS cloud workloads.
By leveraging this data available to you, you’re able to quickly identify problem areas where you can make improvements, narrow down design solutions, and experiment to see if they improve things through testing at production scale, then evolve your architecture based on data-driven decisions.
Improve through Game Days
You can implement a continuous cycle of improvement of your AWS cloud designs through scheduled game day events at your company.
So what’s a game day?
Within the context of this design principle, game days are simulated events where you can test out various production event scenarios in a controlled setting. These game day events allow you to test various production incident situations, and they often bring visibility to a number of areas of improvement across the company.
An example of a game day scenario may be where you get your cloud operations team and an incident manager together to facilitate the event, then throw out a production event scenario to the group and see what happens.
The event scenario is best kept a secret from the operations team until the game day starts so they don’t have time to prepare and they’re forced to think on their feet using their current skills and company resources and processes to navigate the situation.
Ok, team, we’ve been hit by a ransomware attack that has left the production database unavailable and encrypted so the data is all corrupted.
Now, how this game day plays out will vary greatly from company to company, but it can quickly find a number of holes with your cloud architecture designs, skill gaps, business processes and procedures, and everything in between.
Do we have a standby database to fail over to? Do we know how to do that failover?
But wait, the data is replicated between the databases so the tables are all corrupt on the standby as well…
Do we have backups available? How recent are they?
Do we have a restore point recent enough that we minimize data loss and impact to the business?
How long will it take to deploy a new database and restore from a backup?
Do we have automation or Infastructure-as-code to deploy the database of do we have to do this manually?
Do we know how to do this manually and ensure the configurations set exactly like the current production database was?
What parts of the business are impacted during this time? How do we notify our customers?
If this happened at 3AM, do we have a sound on-call process to contact folks that need to be involved? Are our employee records and contact information up-to-date?
These are just the tip of the iceberg of some of the questions that may come up for something like this game day event. Of course, you can, and should, pick different scenarios as you continue these scheduled game day events in the future.
Ok team, this month, unfortunately due to a natural disaster event, our main AWS region we use is down and it is estimated it will be 3-4 days before it will recover.
Since these are simulated events, game days are often quite fun despite the somewhat ominous scenarios at hand. Along with all the people, process, and technology improvement areas these events uncover, they’re great team-building exercises as well!
What went well?
Where can you improve?
Perhaps you’ll need to re-think your cloud architecture designs to survive the event in a real-life scenario. Maybe you need procedural runbooks for the operations team to follow to quickly reference tested procedures on performing things like database restorals or site fail-overs or other situations.
Game days may sound silly, but they’re a great way to find areas of improvement across your business, build operational experience across a number of teams at the company, and a great team building exercise to open up communication channels within the organization.
If one of these game day events happens in production, you’ll be glad you’ve been through it before and are much better prepared.
This general design principle of the AWS Well-Architected Framework embodies an entire discipline or practice commonly known as Chaos Engineering.
Game days are essentially where these chaos engineering principles and other AWS general design principles can be applied.
Through game days, you’re forced to experiment and test your systems through the simulated events.
You can continuously evolve your cloud architecture designs by collecting data from your cloud environment and processes through the post-mortems.
You can then use that information to stop guessing where you need to improve things and find ways to improve production incident detection, response and remediation times, by ensuring your collecting the right log and performance metric data and have the automation in place to enable self-healing architectures and avoid the slow manual remediation tasks.
Don’t forget to sign up for our notifications so you won’t miss out on the next parts of this series and many more Cloud and DevOps articles to help you with your cloud journey.
Learn more about the AWS Well-Architected Framework here.
How can Autimo help?
We work with customers to go much deeper than the AWS Well-Architected Framework, tightly integrating with your teams to learn about your business, your challenges, and goals.
We don’t take these generic reviews at face value. We understand your business and team dynamics and can help tailor specific action plans that matter the most to you.
Autimo can work with your teams to take the results of the AWS Well-Architected review, prioritize them and simply engage as a trusted advisor with your team as they work through the design improvement tasks. If more help is needed, Autimo can directly augment your existing team, by providing AWS experts and project management capabilities to help shorten project deliverables.