DevOps has taken center stage in the world of software development, and most companies are jumping on the bandwagon by implementing efforts to automate their deployment and development processes. By combining IT operations and software development processes, DevOps can significantly shorten the software development lifecycle. In addition, it outputs frequent updates, features and fixes that are in accordance with current business objectives.
Needless to say, DevOps is a complicated process and there is no one-size-fits-all solution that will work for every company and every team. What’s more, it can be difficult to correctly gauge the success of your DevOps efforts if you are not familiar with basic DevOps metrics and what they mean for your business. The four main metrics you should be keeping a keen eye on are: deployment frequency, lead time for changes, restore service time change fail percentage.
Deployment Frequency
A signature characteristic of DevOps-driven software development is the possibility of continuous or frequent deployment, which is especially beneficial for cloud based and high traffic websites. Small batch development and agile feedback allow for daily software update deployments, in some cases even several times a day. Frequent releases lead to better automation, fewer errors, and helps build product trust.
Deployment frequency is a valuable indicator of DevOps efficiency, developer capabilities, dev tool effectiveness as well as response time and above all team cohesiveness. This metric usually has very low variability and is easy to measure. Crucially, deployment frequency deals with production as opposed to staging environments. This means that you are measuring the real world value of your updates to end users; not what you planned or intended to deploy but for some reason did not.
Lead Time for Changes
The lead time for changes is most commonly defined as the time needed for committed code to start running successfully in production. This stat speaks to the health and efficiency of your deployment pipeline. To work out your lead time for changes, you first need to note the exact time when each revision was started in AWS CodePipeline. As soon as the final action of the deployment pipeline is run for that revision, update the row with this timestamp.
Comparing these two times will give you the lead time for changes. One LTC does not provide a lot of information when viewed in isolation, but averaging these numbers over a longer period of time can provide valuable insight. There are a lot of factors that can affect your lead time for changes, including your amount of technical debt, code and architecture complexity as well as the overall competence of your team and quality of teamwork. As it takes so many different variables into account, it is one of the most holistic metrics out there. If your lead time for changes is too long, this is probably a sign that you have a weak link somewhere in your development chain.
Restore Service Time
Also known as MTTR, or mean time to recover, this metric lets you know how much time, on average, it takes to restore services after they go down. The most common way of obtaining this data in AWS is to test key use cases in production by running automated synthetic tests. Any resulting failures are then captured, and we can track how long it takes for the same test to run successfully.
Other than letting you know how quickly your system can heal itself, MTTR is also a great learning tool. Collaboration is at the core of DevOps culture, and having your entire team work together to repair failures, introduce better tests and repair broken software will improve overall cooperation, quality and team confidence. As your DevOps maturity grows, you should expect to see diminishing MTTRs.
Change Fail Percentage
A failed deploy is most commonly defined as one that results in degraded or impaired service that requires immediate correction. The remediation can be in the form of a rollback, fix-forward or a patch. It is important to note that changes that fail to deploy are not viewed as change fails, as they do not represent systemic failures.
The change failure rate is a measure of what percentage of changes result in failure. It answers the questions “what percentage of deployments result in broken builds or service outages?”. To have a complete change failure rate, you need to track each deployment and note whether it was successful or not. By doing this, you will have insight into the ratio of successful to broken deployments to production.
The change failure rate is inversely related to the quality of your DevOps process. If you note a consistently high or rising failure rate, this is a clear indication of serious problems in the overall DevOps effort. A DORA report from 2018 states that top performers have a change failure rate of 0-15%, while low performers range from 46% to 60%.
Metrics to Avoid
While the four metrics above will help you assess and improve your DevOps culture and structures, there are plenty of metrics out there that could cause more harm than good. When transitioning to DevOps some traditional metrics become obsolete. An example of this is the Mean Time Between Failures (MTBF), due to the increased frequency of deliveries. Failures are to be expected, and should not drive your business decisions.
Conflict metrics should also be avoided, because they tend to promote individual contributions as opposed to team performance, and can even result in teams being pitted against each other. Some examples of this are ranking employees based on broken builds and rewarding high performers who do not collaborate.
The final type of metrics to avoid in DevOps is commonly known as vanity metrics, as they promote speed or quantity over quality and value. Examples of this are any metrics that capture the length of code deployed, the number of bugs reported and fixed or the speed of new commits. It is easy to fall into the habit of collecting a metric just because it is easy, but this can lead to negative behaviors and misleading information.