When does it make financial sense to host services/applications in the public cloud and when in a colo? In this post, I will investigate the economics of building a cluster in a carrier hotel (colo) vs public cloud, e.g. Amazon Web Services (AWS).
Adding resources on-demand makes it is a lot easier to achieve better resource utilization in the cloud. At the same time, a naive comparison shows that the cost per CPU cycle and one GB of RAM of a physical server is cheaper than that of a server in AWS. Yet, physical servers require hosting, connectivity, and a support team. The cost of a team is a fixed cost, while the other two are variable costs. Obviously, with the increase of the size of the infrastructure, the relative cost of people diminishes. So, where is the crossover point after which running in-house infra, including the support team, is cheaper than employing a public cloud?
The material from this post was a part DevOps Days Vancouver 2017 presentation by Gordon Klok, for more information check out his talk on Youtube.
For additional details on the data used in the model refer to the spreadsheet.
Being puzzled with that question I have built an analytical cost model for each of the hosting options. After running that model for a load of a more or less typical organization providing internet services I found that:
In the case where the monthly recurring cost (MRC) of running the infrastructure in the cloud exceeds 120,000 $/month there is potential for cost reduction by running the infrastructure in a colo space.
Now let me describe how I got there. The model I employed had two parts.
The first part focused on the demand for the infrastructure (computational, network and storage). The second part translated that demand into the capacities needed in each of the environments. At the end, the cost of running those capacities was calculated. In the model, I kept increasing the demand until I found the crossover point. Then, I pushed the load even further to observe how the difference in the costs grew even further.
In the first part, I tried to reflect the needs of a more or less typical organization running internet services/applications. The numbers used for quantifying the workloads came from personal interviews with a few industry practitioners and past experience; such a workload should not favor any of the environments. Note that, in a case of a specific organization the parameters of the workload and consequently the projected cost may differ.
As for the services workload I assumed that a container management system, e.g. Kubernetes, Mesos and a distributed storage, e.g. Ceph were in place. The cost of supporting those was factored in in the cost of the staff.
I assumed that the load was gradually increasing throughput the year and that the growth trend can be predicted. Yet, the accuracy of forecasting the load for a specific time was quite limited. The forecast was normally distributed around the expected value with the standard deviation of 20%, meaning that only 68% of time the forecast was within 20% of the reality.
Given the uncertainties in the workload prediction, the amounts of the capacities needed for handling the same load differ greatly depending on whether the cluster is hosted in a colo or in AWS. The difference lies in the inherent difference of way the resource are provisioned in AWS and in a colo.
One of the most attractive parts of the cloud is the ability to acquire resources in a fast on-demand manner, in case of a load change. As a result, the amount of the allocated capacities can follow the demand curve quite closely, as shown in the Figure 2.
Now, let’s talk a bit about AWS provisioning. AWS offers several types of instances, on-demand, reserved for 1 year, or 3 years, spot instances and so on. Using reserved instances offers substantial savings (up to 40%), yet requires a rather long commitment of 1 to 3 years. It would be quite rational to switch from on-demand instances to reserved, once there is certainty that those would be fully utilized. This can be done by following a simple rule: “If a server of a specific instance was used for more than 3 months than it should be converted to a reserved instance”. I modeled this effect by making 50% of the instances 1y reserved, and the other 50% on-demand.
In a colo, the capacities are usually fixed for about a year (the most typical provisioning interval). This happens due the fact that ordering the h/w, wiring up servers, setting up the network, and doing burn-in tests takes time. Besides that, there are also budgeting cycles in the organization, money needs to be allocated, colo expansions need to be scheduled and so on. Since the provisioning interval for colos is one year long, the preceding forecasting should be done for a year as well. The longer is the load forecasting interval, the harder it is to predict the load with the good accuracy, as requirements may change, user behavior may alternate and so on. As a result:
A significantly long re-provisioning interval, typically one year long, demands substantial colo over-provisioning to compensate for unexpected spikes in the load.
An example of such a provisioning process is illustrated in Figure 2. In practice, often clusters are allocated capacities three times larger than the expected load. In this model, the colo was provisioned up to 30%. In other terms, the colos are overbuilt by 3.3 times more than expected average load.
The last piece of the model reflects the cost as a function of the capacities. More specifically, it predicts the cost of acquiring CPU, memory, storage and backup space in these two different environments.
Modeling the cost of running a cluster in AWS, is a rather straightforward procedure which requires some due diligence. In the model I considered the following items:
- EC2 nodes
- EBS Volumes and provisioned IOPS
- Load balancing
- Data transfer
- Backups to S3
Calculating the expected MRC of the infra in AWS is a really straightforward procedure which requires some patience and information about the workload. Amazon did a great job of putting together a nice calculator online. I did not use exactly this calculator but replicated the relevant parts of it using the pricing information published on AWS pricing website.
A typical mistake I tried to avoid is looking only at the cost of the EC2 instances. Such an oversimplification results in a significant cost underestimation, which might have a lot of serious consequences. Especially, if you base your budget on those numbers.
Figure 3 shows the dispersion of the expected charges for 500 of m4.2xlarge instances where 50% of them are 1y reserved and the rest are on-demand. As one can see from the chart, once I’ve added other costs such as IOPS, data transfer and so the expected AWS bill almost doubled. Data transfer and provisioned IOPS come at a rather substantial price. At the same time, the relative cost of the load balancing and S3 backups is quite low.
Forgetting to factor in charges for ELB, IOPS, Data transfer and backups results in a significant underestimate of the expected AWS bill.
When it comes to building clusters in a colo one would face the largest expenses in:
- salaries for the operations people
- computational and storage capacities
- power and real estate in a colo
- network fabric and uplink
- rails, racks and power distribution units (PDUs)
Figure 4 shows the breakdown of the expected monthly costs or running the equivalent of the 500 m4.2xlarge. instances. It won’t be a surprise that a large fraction of the costs is going to towards paying salaries of the staff. The cost of the computing power is the second biggest expense, and the cost of the colo real estate and power comes third. It is interesting to see the cost of the data transfer (uplink) is a lot smaller when compared to AWS. The load balancing comes at zero cost, as s/w based load balancing is assumed to be in place, e.g. it can be provided by Kubernetes.
In the next posts, I will discuss in greater details the software and hardware stack assumed in the model. So don’t forget to subscribe.
So Colo or AWS?
After I’ve put together cost models for both options, I started increasing the expected demand for resources to see how the expected costs would trend. For the convenience of understanding, I expressed the infrastructure demand as a number of the m4.2xlarge instances needed for handling it in AWS. Note that the backup, data transfer, and storage load remained to be proportional to that count.
As it was expected, hosting smaller clusters of lesser than 100 instances significantly cheaper in AWS [Fig. 5]. However,
If one runs more than 160 2xlarge instances (or spends more than 120k/month on AWS) this might a good time start thinking about moving to a colo.
This is the point where the annual investments into people are being outweighed by the higher cost of the computational resources in AWS. The further the load increases the larger is the difference in the cost.
If one’s bill is more than 250k/month, then up to 50% savings can be achieved by switching from AWS to a colo.
The savings come from the substantially lower costs of the computational power and data transfer. The initial upfront people costs are quite substantial, thus for smaller deployments of less than 100 instances, it absolutely makes no rational sense to use colos.
The opportunity of reducing the costs for larger clusters may get a lot of people excited, so I would like to warn in advance about some catches:
- In case the demand for the infra cannot be predicted for the provisioning interval (1 year or so), switching to a colo would create too great of a risk of an overload.
- The peak to mean ratio for the yearly load stays less than 5. If the expected peak is a lot higher than the average, then average utilization cannot stay high, meaning that most of the resources in the colo will stay idle, while the organization is still paying for them. Meanwhile, in the AWS, the capacity can follow the load a lot closer, meaning that there will be a lot fewer idle resources. As a result in such a scenario, the crossover point would be much further.
- Your organization should be comfortable with the lead times related to hiring people and a colo cluster build-out. Hiring a good engineer might take up to 6 months, and ordering h/w might take another month or two. The transition to a colo should be a more planned action, where all the future users understand and are comfortable with the lead times.
All the data, including the details of infra load, h/w employed for the colo cluster model and so on, is available, as a CSV file.
Stay tuned and follow @pax_automa for blog updates.