Optimizing the costs of an IT infrastructure while improving its security and effectiveness.

Background

Cloud services are very powerful but it is very easy to use them in the wrong way.

In this article we will analyze the errors we have detected in the management of cloud resources due to incorrect engineering assessment and errors related to the DevOps processes of the company acquired by one of our customers.

These errors over the years have caused a huge and unnecessary waste of energy and money and have also had a negative impact on the security of the IT infrastructure and increased the deployment time of the platform. That was done manually.

Up until our involvement, the company had never taken charge of the issue because, first and foremost, it had never detected the problem, everything was still operational and functioning, and the cost of the infrastructure could be absorbed by the high revenues.

Very often companies and teams are reluctant to change the status quo, either for fear of causing damage and disservice, or trivially for habit and laziness, but a correct and accurate assessment will be able to:

Save millions of euros/dollars
Optimising the infrastructure and its security
Improving work and collaboration between teams

Initial context

Our client acquired a company and its platforms, and asked us to evaluate its infrastructure and optimize it.

This company provided its services from the AWS cloud and used hundreds of virtual machines (ec2 instances) with os windows and some rds instances. The resources used to run the company were expensive, in a year it was almost a million dollars!

After a detailed analysis we classified the problems into two categories: infrastructural errors and engineering errors.

Infrastructure errors

1. Oversizing and wrong instances: the same quality of services offered to their users could be guaranteed by a smaller number of Ec2 instances of different types and organized in a more efficient way.

We have analyzed the use of resources (CPU, ram) by each Ec2 instance for a certain period of time, this allowed us to reduce the amount of resources made available to each Ec2 instance. Moreover, many of these instances could be converted into t2 instances, which allowed us to halve the amount of resources and consequently reduce management costs.

This simple operation allowed us to halve the monthly cost of each EC2 instance.

2. Organisation of s3 buckets: some buckets pointed at each other forming cycles, increasing exponentially the number of requests (and therefore the cost) to be processed to obtain the data.

3. Unassociated Elastic IPs: some public IP addresses did not point to any instance and this represents a cost because IP addresses cost more if they are not used.

4. Wrong volume management: there were several test instances in the infrastructure to which were connected big volumes. Each volume, depending on its size, has a monthly cost to pay.

In addition, there were hundreds of volumes not attached to any instance.

5. Use of inappropriate volumes: the infrastructure had a very high number of Io1 volumes, i.e. volumes capable of supporting very high and constant IOPS over a period of time. Half of the volume management cost came from Io1 volumes. Machine profiling revealed that few of these instances needed to handle such traffic (e.g. the master database server). After studying the trend of incoming traffic to these machines, we realized that gp2 type volumes were ideal because, if necessary, they are able to withstand peaks in IOPS for limited periods of time. While under normal conditions they were able to withstand the average incoming traffic.

Engineering errors

1. The application server used coldFusion with IIS on Windows Server, our solution was to dockerize coldFusion with Apache on Linux servers, avoiding having to pay for Windows licenses.

2. MSSQL licenses: similarly to the previous point, the dockerized version of mssql was sufficient to deliver the service.

3. Inefficient deployment: the deployment was done by extracting a compressed file containing the updated code from an ftp server. The process was unnecessarily expensive and slow. We intervened by re-engineering the deployment process, since all the applications and the database were containerized we were able to apply our know-how acquired working on very large realities. The process we designed includes a QA environment where modifications need to be approved before being brought into production. This has brought all the benefits of such a deployment process: more user-satisfaction and less time-to-market

Bonus

The application in the application server consumed a lot of RAM, the dev team simply asked the operations team to increase the RAM in the application servers, which resulted in additional costs that could have been avoided by investigating the performance of the app and working on refactoring.

Conclusions

1. It is important to ask yourself questions about your infrastructure and be critical of what you have, as even small mistakes inevitably lead to high costs due to the number of utilities and the complexity of cloud services.

2. Downsizing your infrastructure does not mean delivering a worse service but simply eliminating unnecessary costs due to unidentified errors and lack of knowledge of the tools used.

Why is CI/CD important?

1. Greater user-satisfaction: the QA environment ensures that the code is delivered without bugs in production, guaranteeing excellent service.

2. Faster time-to-market: automating the deployment of new features of your products/services allows you to eliminate the downtime of manual deployment. Publishing faster means receiving user feedback first and consequently improving products/services faster.

Finally, after applying all the improvements we have described, our client has saved one million dollars a year.

How we saved $1 million through Cloud Engineering Optimization.