Cloud Agility: March 2013

I have seen quite a few flame-ups, particularly in the Kanban community, around release batching. It is a subject that I have spent a considerable amount of my career addressing.

Most IT Operations teams abhor and resist change. Change usually brings with it uncertainty and the potential for sleepless nights around a failed service. As it is often impossible to replicate the exact production environment in a test setting, there is a real risk that changes have not been tested adequately. There is also the problem that changes often mean a service outage is required, directly affecting customers.

What makes this worse is the fact that most software engineers as well as IT Operations people are less than sufficiently diligent with properly utilizing configuration management techniques in order to track the changes they make. It is always far easier to hack in a change directly into a production system than it is to write it, check it in, package it and release it properly while managing all the dependencies. But such changes cause configuration drift that makes even the most ideal situation where development and test hardware is identical to production still produce different behaviours.

Release batching is the way that many organizations try to solve this conundrum. By piling up all of the changes into one big release, the number of change events goes down. This gives false security to the IT Operations people, who feel that they can heavily man those few events and field all the failures at one time. It comforts the testers, who feel that they can spend the time to test through all the various changes. It also allows the developers to be a bit lazier about the way they check in code and manage builds, as there are long periods where they have to code and fix builds.

But all of these supposed benefits not only hide dysfunction, but take value away from the business. The longer that code is sitting waiting to be shipped, the longer it is sitting not being used to provide value to the business. While the code is sitting, assumptions that were made that resulted in the code being written are not being tested, and even if they were correct at the time the moment may have been missed and the market may have moved on.

The idea that fewer big changes means less downtime is also flawed. Bigger changes, by definition, usually mean that more has changed. More changes often make for more things to potentially go wrong. It also means that when something goes wrong, the haystack you are having to dig through to find the problem is far bigger. The idea that QA is going to be able to catch all defects, especially with large changes, is also flawed. The measure of defects found is a gauge of how effective your QA team is at finding them against an always unknown metric of the total number of defects. It is not a particularly good indicator of the number of defects in the code, the quality or maintainability of the code, or even the amount of problems you might encounter when releasing the code. Lots of changes mean an exponential increase in the number of potential test cases required to find issues.

While pull mechanisms such as Kanban help capture an understanding of flow within the development phase, release batching tends to counteract much of the benefits by hiding dysfunctions under a false security blanket. Difficulties in configuration management and automated deployment are solvable, and can be tackled incrementally to reduce release cycle time. By moving progressively towards a system of continuous delivery, ever smaller changes can be understood, tested and released, allowing for very quick feedback and even faster rollback in case of problems. Changes become far more atomic, risk becomes better understood and easier to manage, and pull can ultimately be achieved across the entire product lifecycle.

Cloud Agility

Wednesday 20 March 2013

Release Batching - When is it a Problem?