Part 5 – Cost Optimization Pillar Best Practices
Cost Optimization in AWS
In Part 5 or our deep-dive into the AWS Well-Architected Framework, we begin exploring the best practice areas of this pillar.
Check out parts 1-4 of this series first if you haven’t already. We’ve been taking a layered approach with this deep-dive series on the AWS Well-Architected Framework, and the earlier parts of the series provides the baseline understanding of the framework itself and the design principles of this pillar.
If you missed those earlier articles, don’t fall behind again. Sign up for our newsletter to get the upcoming parts of this series and future articles on interesting Cloud and DevOps topics.
To quickly recap, Part 4 of this series introduced the five design principles of the Cost Optimization pillar of the AWS Well-Architected framework.
AWS Well-Architected Cost Optimization Pillar
Design Principles
- Implement Cloud Financial Management
- Adopt a Consumption Model
- Measure Overall Efficiency
- Stop Spending Money on Undifferentiated Heavy Lifting
- Analyze and Attribute Expenditure
Along with these design principles, the Cost Optimization pillar is made up of five best practice areas.
AWS Well-Architected Cost Optimization Pillar
Best Practice Areas
- Practice Cloud Financial Management
- Expenditure and Usage Awareness
- Cost-Effective Resources
- Manage Demand and Supply Resources
- Optimize Over Time
Let’s start looking at these best practice areas one by one and learn what they’re all about.
Practice Cloud Financial Management
This best practice area is outlining the importance of developing financial management capabilities focused on your overall cloud strategy. Having a mature cloud financial management program in place will help ensure all the various internal teams that operate in your cloud infrastructure align with the overall business and financial objectives and optimize the overall value you get from your cloud infrastructure investment.
The review question AWS provides as part of this best practice area is simply:
How do you implement cloud financial management?
So what are the best practices that we can reference to help implement cloud financial management?
Establish a Cost Optimization Function
This really comes down to building or understanding the existing internal team that makes up your task force focused on cost optimization. Understanding all the business dynamics, financial and accounting aspects, and the technical specifics that make up your cloud infrastructure costs take a wide variety of skill sets and expertise in these areas.
Creating a cloud financial management team and the overall program can take time, but the important thing is to start. The earlier you have some form of cloud financial management in place, the easier it is to ensure your cloud costs are under control and those costs are providing the best overall value to your company.
Establish a Partnership Between Finance and Technology
AWS guides us to ensure we include finance and technology teams to meet and collaborate regularly on overall business objectives. We want to ensure both areas understand the current and long-term objectives of the business and to help understand the current and future cost and usage details from the technology and accounting perspectives.
Establish Cloud Budgets and Forecasts
With a strong partnership between the finance and technology areas of your business, you can better understand the highly-variable cost and usage data and make more accurate budgets and forecasting decisions. The flexibility and on-demand nature lead to potentially highly variable cloud infrastructure costs. As a result, your organization’s budgeting processes need to take into account the variability of your cloud infrastructure costs and utilize tooling to help dynamically adapt to the frequently changing usage data and assist with making more accurate budgeting decisions.
Implement Cost Awareness in Your Organizational Processes
You need to think about how to drive cost awareness. You may want to consider a variety of checkpoints as part of your application or infrastructure development releases to ensure cost estimates have been performed and approved. There is often a training and education component here, ensuring existing staff and new hires are aware of these processes and approach cloud architecture decisions with cost awareness top-of-mind.
Report and Notify on Cost Optimization
For this best practice, AWS guides us to take advantage of tools like AWS Budgets to set up automated alerts for our cost and usage targets. Work together with your assembled Cloud Financial Management team to regularly review your cloud usage data, then measure and report on this data.
Monitor Cost Proactively
With the potential for highly-variable cost and usage, waiting for those reactive alerts and notifications is often too late. Implementing analytics and easy-to-digest visual dashboards of your AWS costs can help your teams stay proactive about monitoring your cost and usage. Having runaway recursive AWS Lambda functions or other application or infrastructure mishaps are often easy to catch with the right tooling and proactive monitoring in place, allowing your teams to quickly remediate issues before they end up costing tens of thousands of dollars.
Now proactively catching unusual usage is certainly important. You can also leverage the monitoring and dashboard data to showcase the teams that have done well by reducing cloud infrastructure waste in their departments and resulting in lowered AWS costs. This can help raise cost-conscious teams and even create friendly competition between teams to see who can run the leanest, most highly-optimized AWS designs in their department or team.
Keep Up To Date with New Service Releases
The cloud moves fast, especially AWS. There are constant feature updates and changes, new service launches, and new technology advancements that happen at a dizzying pace. Many of these updates and releases have the potential to offer significant cost savings to your business through lower costs, or even fundamentally change your overall workload architecture by incorporating a new service offering and improving the overall efficiency and resulting in lower costs.
Expenditure and Usage Awareness
With the agility AWS provides customers to provision resources across hundreds of service offerings and seemingly unlimited scale, the potential to lose track of all these is very real. When we’re looking at organizations with tens of thousands of AWS resources across a variety of development, test, and production environments, spread out over tens or hundreds of AWS accounts, it can be difficult to keep track of your cloud costs and ultimately attribute those costs back to the resource owners.
Tagging your AWS resources is a foundational step for the best practices in this area. Using AWS tags can help provide additional resource metadata and business context to your AWS resources to help understand ownership, cost attribution, allow for more advanced analytics and dashboarding of your resource usage, and enable a number of security and other automation capabilities.
Let’s take a look through each of the framework review questions here and understand the best practices they intend to help AWS customers implement for their cloud workloads.
How do you govern usage?
Develop Policies Based on Your Organization Requirements
AWS is suggesting that resource lifecycle policies should be created here, covering the full end-to-end lifecycle of the resource. Well-defined policies on how your resources are created, modified and decommissioned, then also how your resources are managed when running. Having these policies in place can help get a handle on your AWS costs and help with many other security and governance aspects.
Implement Goals and Targets
Having cost optimization goals can help drive your organization’s cloud workload designs and operations practices in the right direction. With the goals clearly defined and communicated with the teams involved, they become more cost-aware and make design decisions with costs top of mind.
With the goals defined, you can leverage tools like AWS Budgets, and other custom automation tooling to help track your progress through measurable targets.
Implement an Account Structure
As your AWS usage grows, customers often implement a multi-account AWS account structure. The strategy of a multi-account design is a whole topic on its own, but from the Cost Optimization pillar context here, you’ll want to consider an AWS account structure that can map to your organization structure or overall business functions.
When you have workloads split into different accounts that align to different internal business units, determining ownership and who are responsible for those costs becomes much easier.
Implement Groups and Roles
While groups and roles are typically discussed from a security and access control standpoint, implementing groups and roles that align with your business functions, team structures, and governance policies can provide some control layers. These control layers determine who can create and modify your cloud infrastructure and ultimately cause the potential for significant changes to your cloud infrastructure resources and, therefore, substantial changes in your costs.
For those people on your team who do require permissions to your AWS accounts to create and modify resources, ensure they are brought to the table as part of your Cloud Financial Management program and ensure they have the training and tools they need to be cost-aware of the actions they have the capability to perform within your cloud infrastructure
Implement Cost Controls
AWS offers hundreds of services. The chances that you will use every single one of them in all your AWS accounts is near zero. So why not put some guardrails in place to prevent unexpected charges from services and resource types your organization has no intent on ever using? There are a number of access controls and AWS services at your disposal to help implement controls around what resources can be created or modified, therefore providing another layer of control around unexpected AWS costs.
Track Project Lifecycle
Thinking back to the AWS Well-Architected Framework general design principles, AWS customers are encouraged to experiment, fail fast, and allow for evolutionary architectures. But with all this change and experimentation happening, it’s important to track the overall project lifecycle to ensure you’re not paying for any unnecessary resources that didn’t get cleaned up.
How do you monitor usage and cost?
Configure Detailed Information Sources
AWS guides us here to use the AWS Cost and Usage Report (CUR), and Cost Explorer tools with hourly granularity settings to get detailed cost and usage information.
Identify Cost Attribution Categories
Here you’d want to consider your organizational structure or finance-based cost centers to help attribute cloud spend to the appropriate groups within your organization.
Establish Organizational Metrics
To aid with monitoring usage and costs, establish what metrics are important to the business for the workload being evaluated. Determine what these metrics are, then you can reference these against cost and usage data to help determine a variety of insights around your cloud costs compared to the business value metrics.
Configure Billing and Cost Management Tools
To help with your cloud cost monitoring and alerting, you can set up AWS Cost Explorer and AWS Budgets. Set the budgets according to the business objectives and reasonable cost expectations of the workload, and ensure alerts are sent to the appropriate people if cost and usage starts climbing.
Add Organizational Information to Cost and Usage
For this best practice, AWS suggests that you define a standardized tagging scheme based on your organization’s cost center, org structure, or other workload specifics that would help further identify and measure your cloud cost and usage at more granular levels.
Allocate Costs Based on Workload Metrics
Using the AWS Cost and Usage Report (CUR) data, combined with your workload tagging metadata you can analyze your usage and costs through tools like Amazon Athena to help accomplish detailed chargeback or showback of costs mapped to specific business outcomes.
How do you decommission resources?
Track Resources Over Their Lifetime
Here, you want to leverage your tagging and other resource identifiers to put a system in place that tracks your cloud resources over their entire lifecycle.
Implement a Decommissioning Process
This best practice guides customers to develop a process to first identify, and then decommission any orphaned resources. The metrics you use to determine orphaned resources can vary, but you should have a way to determine when they are not being used anymore. When the resources are flagged as orphaned or not being used anymore, have a system in place that shuts those resources down to reduce any wasted cloud spend.
Decommission Resources
In the cloud, gone are the days of being stuck with expensive IT hardware sitting in data centers consuming your CapEx and OpEx. In AWS, you can quickly decommission your infrastructure.
No longer need those EC2 servers from a failed project?Shut them down.
Have your traffic demands decreased and you can get by with ten percent less compute capacity for your workload? Great, decommission those unnecessary resources.
In the beginning, even an occasionally scheduled manual audit to do a resource cleanup occasionally is better than paying for resources you don’t need for eternity. Ideally, you would want to leverage an automated process to detect and decommission any unnecessary resources.
Decommission Resources Automatically
Continuing on with the previous best practice, we want to ensure our cloud workload can support the graceful shutdown of any unused resources that are no longer required. There are seemingly countless ways to monitor and scale your AWS workloads and leverage automation to handle the resource lifecycle, but you want to ensure that the application can handle the decommissioning of its workload components that are no longer utilized.
Cost-Effective Resources
As cost is a large part of an overall “Well-Architected” workload, picking the most optimal instances and resources for your needs is important to the overall cost efficiency of your design. Further, AWS offers a number of managed services, taking on the heavy lifting of operating and maintaining things like databases or container orchestration platforms. Leveraging these managed services can greatly reduce your overall operating costs when you factor in the Total-Cost-of-Ownership (TCO) when compared to running these all by yourself.
AWS also has a number of pricing options to best fit your cost and flexibility requirements. Using EC2 Spot instances can save up to 90% of your compute costs. Then you can evaluate if AWS Savings Plans or Reserved Instances are a good fit for your business. There’s potential to save up to 75% compared to the typical on-demand pricing rates.
What questions should we ask here to know if we’re thinking about making the most cost-effective resource choices for your cloud workload?
How do you evaluate cost when you select services?
Identify Organizational Requirements for Cost
With any cloud workload design, the overall architecture typically has many tradeoffs.
How important is cost in the workload design?
Sure, you could design your workload to run across multiple availability zones and multiple AWS regions for high availability and active-active disaster recovery regions for near-instant failover, but is that necessary or jus too cost prohibitive? You often will need to work with key stakeholders across the organization to find the right mix of tradeoffs and stay within budgetary constraints.
Perform a Thorough Analysis of Each Component
Dive deep and look at each resource that is part of your overall cloud workload architecture. Perform a TCO analysis of that resource. If the resource has a large potential cost, spend a good chunk of time doing a thorough review as part of the TCO. Perhaps it makes sense to look at a managed service offering instead. Maybe a different database technology could be used. The cost aspects of the TCO analysis can have a big impact on the workload design choices being made. Take your time and ensure your team has made the most optional resource selections based on all the business and technology factors in scope.
Select Software with Cost Effective Licensing
Strongly consider open-source software to help avoid any expensive software licensing costs. If commercial software licenses are absolutely required, pay very close attention to the fineprint of the licenses involved. Many commercial software licenses are crafted where the license is tied to a hardware attribute (virtual or physical) like CPU processor sockets. These types of licenses greatly limit your flexibility of resource choices and your ability to scale in a cost-effective way.
Perform Cost Analysis for Different Usage Over Time
Going back to the framework’s General Design Principles, we should “allow for evolutionary architectures”. Your products can change and your business demands can change, resulting in very different requirements and traffic loads on your workload. You should continuously evaluate each component of your workload to make sure you are still using the most optional resources based on current load metrics. Some resources that were the most cost-effective to handle large traffic volumes, may no longer be the optimal choice if your usage gets cut in half. Maybe a simple scale in or out of resources is needed, maybe you vertically scale to a more or less powerful instance type, or maybe you redesign the workload completely. The key point here is to periodically evaluate your workload components to see if yesterday’s choices are still the most cost-effective for today.
How do you meet cost targets when you select resource type, size and number?
Perform Cost Modeling
In this best practice, AWS guides you to perform cost modelling across your entire workload. The cost modelling exercise should test workload components under varying load amounts to establish more cost-effective versions. While you’re at it, perhaps model what costs may look like if your load goes beyond predicted levels, or less than expected numbers. Do your infrastructure costs linearly scale based on the load? Or are there some exponential cost increases as your infrastructure scales to handle the load? Ideally, you’d want to see the exponential value of your infrastructure resources as load increases, or worst case they scale somewhat linearly with your load. If cost modelling indicates exponential growth of your cloud resource costs as load increases, improvements to the overall workload design or resource technologies being used may need to be reconsidered.
Select Resource Type, Size, and Number Based on Data
AWS offers us a tremendous amount of resource choices, especially with compute, database and storage-based resources. Picking the best fit from all these choices available should be a data-driven decision based on the specific workload dynamics. Leverage data from previous environments if available, or perform load tests with predicted traffic levels to determine the overall workload performance characteristics. Then use this data to understand the specific compute, memory, storage, and networking dynamics and pick the AWS resource type, size and quantity to maximize your cost-to-performance ratio benefits.’’
Select Resource Type, Size, and Number Automatically Based on Metrics
Similar to the above best practice, we want to maximize our resource cost-to-performance ratio, but here, AWS suggests leveraging our workload metrics to continuously monitor our workload key performance metrics, then make automated decisions about resource selections that best fit the current workload demand.
A basic implementation of this best practice may be simply leveraging EC2 auto-scaling capabilities to horizontally scale your compute resources based on load changes. A more sophisticated implementation may leverage extensive workload metric data and machine learning capabilities to make automated decisions and changes to resource types, sizes, storage, provisioned IOPS settings or more. The benefit of using automation here is the speed at which you can implement changes based on workload demand changes. Performing scaling activities and resource selections manually can be time-consuming, potentially leaving you with higher than necessary cloud usage costs by using suboptimal resources during the time it takes you to review, decide, and implement changes.
How do you use pricing models to reduce cost?
Perform Pricing Model Analysis
After performing your cost modelling exercises at varying load levels, you should look at which resources will be continuously running for extended periods of time. If there are EC2 instances that will need to stay running 24 hours a day, and databases that will always be up and running, you may be able to take advantage of Reserved Instances or Savings Plan based committed discounts. If areas of your workload are batch-based, short-lived, or can tolerate the loss of compute resources from time to time, maybe you can leverage AWS Spot Instances and take advantage of significant savings compared to on-demand rates.
Implement Regions Based on Cost
AWS Region selection is a much bigger topic, but in the context of this Cost Optimization Pillar, and this specific best practice, AWS guides us to look at our region selection carefully. AWS resource usage and network utilization-based costs can vary significantly from region to region. There may be tradeoffs to consider between region proximity to your end-users of the workload to reduce latency, disaster-recover strategy, compliance and other data-residency governance aspects, and costs.
Keep in mind that not all AWS regions have the same resources available. Some services may not exist at all or have limited resource types available compared to other regions. Often region selection is quickly narrowed down based on your data governance or other policy-based compliance factors, then based on if the resources you require even exist in a given region. Once those are determined, you can start looking at the cost tradeoffs of the remaining region options compared to user experience impacts and other factors.
Select Third Party Agreements with Cost Efficient Terms
Leveraging the scaling and flexibility of the AWS cloud can offer tremendous cost advantages, especially over traditional on-premises IT infrastructure. If you utilize a variety of third-party software and other Software-as-a-Service based integration with your workload, carefully consider your licensing and contract agreements. You want to ensure your agreements allow for flexible scaling of your workload in AWS and that your third-party licensing or other consumption costs of these third-party integrations also provide a cost-effective benefit as you scale in or out your workload.
Perform Pricing Model Analysis at the Master Account Level
In larger multi-account organizations, take advantage of the consolidated billing and AWS Organizations based account structures to evaluate your cost and usage at the “root” management level account. Considering the overall resource consumption across all the AWS accounts and their different workloads may influence your Reserved Instance or Savings Plan commitments. Keep the holistic AWS cost and usage of all your linked AWS accounts in mind for these committed discount decisions rather than look at a specific workload in isolation for commitment-based discounts.
How do you plan for data transfer charges?
Perform Data Transfer Modeling
Understand your workload traffic predictions and perform a cost analysis across the workload and each of its components to understand the cost implications of network traffic.
Select Components to Optimize Data Transfer Costs
AWS directs us to design our workload and make component selections to reduce data transfer costs. In practice, there are often many considerations and tradeoffs to consider here, but at the least, you should be aware of the data transfer costs involved with your workload design and look at ways to optimize this. Perhaps the design uses multiple availability zones when there is actually no business need or application requirement for highly-available resources.
Again, many factors come into play with selecting AWS components but don’t forget to include data transfer costs as part of that decision process.
Implement Services to Reduce Data Transfer Costs
Similar to the above, we want to understand our networking requirements and have accurate data transfer cost models to work with to help make decisions about AWS service selections. AWS provides many connectivity options to customers. These can all have significant performance and cost impacts on our overall workload. As a few examples, if your workload is serving a lot of static web-based content, leveraging the caching capabilities of some AWS services, or using Amazon CloudFront as a content delivery network to provide a caching layer for your content delivery can significantly reduce overall bandwidth and data transfer costs. If you expect a lot of data transfer between the workload being reviewed and your office locations or perhaps on-premises data centers if you’re operating a hybrid-cloud infrastructure model, leveraging AWS Direct Connect may offer a large reduction to your data transfer costs.
Manage Demand and Supply Resources
When you move to the cloud, you have to pay only for the resources that you actually use. Resources are supplied on demand and you only pay for what you use, rather than paying for all the servers to run in an idle state. This allows your application to be cost-efficient and reduces your cost for short-term projects.
With AWS, you have the ability to make automatic adjustments to your resource usage on a massive scale. Using techniques such as automatic scaling based on demand (or time), you can increase or decrease the number of servers you use to meet the workload demands. If you can plan for and anticipate necessary adjustments, such as an increase in activity after an upcoming holiday, you can save money since you aren’t using as many resources and can ensure that your servers meet your workload demands.
The questions below help keep these supply and demand considerations top of mind when reviewing your workload.
How do you manage demand, and supply resources?
Perform an Analysis on the Workload Demand
For this best practice, AWS guides us to do our workload analysis over a period of time. This more long-term ongoing demand analysis will start to pull in data that covers a variety of demand trends that occur for the workload. Your business may have daily peaks and low-demand periods, or maybe there are some month-end spikes, or perhaps your business has some seasonality aspects, and demand is higher at certain times of the year. Or maybe in the retail industry, sales events like Black Friday need to be analyzed carefully to ensure your workload will scale to meet unexpected demands.
Implement a Buffer or Throttle to Manage Demand
Having a loosely coupled workload design to help implement areas where you can queue or throttle spikes in demand can help balance your workload scale with client-side expectations. Being able to smooth out those sharp spikes of load and process them at a more efficient rate can help limit unnecessary infrastructure scaling activity. It is important to know what is an acceptable delay to processing requests from the client-side applications or end-users, and configure your infrastructure scaling trigger points based on these metrics or service level objectives. If your processing queue backlog grows too large, or other processing delay metrics reach a certain limit that is no longer acceptable, you can scale out your workload to quickly relieve the processing backlog, then scale back down when that demand spike backpressure is relieved.
Supply Resourced Dynamically
As mentioned above, we want to leverage all the auto-scaling capabilities of AWS along with our metric collection around key workload functions to make automated data-driven decisions around scaling activities. Having automation in place can help ensure the number of resources provisioned is meeting the current real-time workload demands. You don’t want to pay for resources that you don’t need, but want to ensure your applications are meeting the current load demands.
Depending on your workload specifics, automated scaling may end up being too much of a reactive approach. For very sudden load spikes, it can take some time to trigger the scaling activity, provision the necessary resources, then get them into a state where they’re ready to service the demand requests. How long this takes depends on a large number of factors, but you may also want to consider time-based scaling events if your workload demand pattern analysis shows consistent trends. You can use this time-based scaling method to pre-provision new resources ahead of time so that your workload is ready to handle the expected load. If your demand trends are very consistent, you can stay just a step ahead of the demand curve so your workload is prepared to handle it, without having any significant amount of idle resources waiting for the demand increase to occur.
Optimize Over Time
Cost optimization is never done. There are so many factors that can change with your workloads over time. New or changing applications that you’re running for your business, changes in traffic demand, different third-party integrations, or even different governance or compliance factors can greatly change how your workload operates in AWS. Then if we layer on the firehose of new features and services AWS frequently releases, along with technological advancements with processors and storage media, the optimal resource choices made for the workload initially, likely won’t be the optimal choices six months or a year later.
How do you evaluate new services?
Develop a Workload Review Process
AWS guidance with this best practice suggests we should define a clear process that defines the workload review task. The amount of effort put into the review and the frequency the reviews are performed should reflect the potential benefits to the business. There’s no need to review a small workload every month when it accounts for less than 1% of your overall AWS cost and usage. However, if you have a core workload that is a significant portion of your overall AWS costs, and those costs associated with that workload keep climbing, you’ll likely want to keep a closer eye on it and review it quarterly or monthly.
Develop your review process and schedule accordingly.
Review and Analyze this Workload Regularly
Frequent workload evaluation, along with a review of new AWS services, technologies, instance type families, and features is needed to ensure your workload is evolving to utilize these latest options and you can maximize the cost-to-performance ratios of your AWS infrastructure investment and help reduce your overall cloud spend.
Well, that concludes our first look at the best practices contained within one of the AWS Well-Archicted Framework pillars.
As you can see, there are often many best practices to consider. All of them should be considered when evaluating a workload running on AWS, but the importance of any individual best practice may vary depending on the workload itself and the dynamics of your business objectives.
Keep in mind that these are the best practices of a single pillar of the Well-Architected Framework, Cost Optimization. There are five other pillars to understand the design principles and best practices in order to do a thorough Well-Architected Review of a given workload running in AWS.
In the next part of this series, we’ll switch to a new pillar of the AWS Well-Architected Framework, and get an introduction to the newest Sustainability pillar that was recently added to the framework.
Don’t miss the next part! Sign up for our newsletter to get all the latest insights from the Autimo team!
Learn more about the AWS Well-Architected Framework here.
If you need to get started with your AWS Well-Architected Review today, the Autimo team is here to help!
How can Autimo help you with cloud cost management?
We work with customers to go much deeper than the AWS Well-Architected Framework. We’ll tightly integrate with your teams to learn about your business, your challenges, and goals.
We strive to understand your business and team dynamics. Helping to tailor specific action plans that matter the most to you.
Our team can work with your internal groups to take the results of the AWS Well-Architected review, prioritize them and simply engage as a trusted advisor with your team as they work through the design improvement tasks. If more help is needed, Autimo can directly augment your existing team, by providing AWS experts and project management capabilities to help shorten project deliverables.
Want to learn more about cloud cost management?