Monday, February 9, 2015

The Three Rules of the DevOps Game

“Of course we do agile development,” she told me. “That’s just table stakes. What we need to do now is learn how to play the DevOps game. We need to know how to construct a deployment pipeline, how to keep test automation from turning into a big ball of mud, whether micro-services are just another fad, what containers are all about. We need to know if outsourcing our infrastructure is a good long term strategy and what happens to DevOps if we move to the cloud.”

Ask the right questions

Imagine something we will call the IT stack. At one end of the stack is hardware and at the other end customers get useful products and services. The game is to move things through the stack in a manner that is responsive, reliable, and sustainable. The first order of business is to understand what responsive, reliable, and sustainable mean in your world. Then you need to be the best in your field at providing products and services that strike the right balance between responsiveness, reliability and sustainability.

1. What does it mean to be Responsive?

In many industries, responsive has come to mean devising and delivering features through the entire IT stack in a matter of minutes or hours. From hosted services to bank trading desks, the ability to change software on demand has become an expected practice. In these environments, a deployment pipeline is essential. Teams have members from every part of the IT stack. Automation moves features from idea, to code, to tested feature, to integrated capability, to deployed service very quickly.

Companies that live in this fast-moving world invest in tools to manage, test, and deploy code, tools to maintain infrastructure, and tools to monitor production environments. In this world, automation is essential for rapid delivery, comprehensive testing, and automated recovery when (not if, but when) things go wrong.

On the other end of the spectrum are industries where responsiveness is a distant second to safety: avionics, medical devices, chemical plant control systems. Even here, software is expected to evolve, just more slowly. Consider Saab’s Gripen, a small reconnaissance and fighter jet with a purchase and operational cost many times lower than any comparable fighter. Over the past decade, the core avionics systems of the Gripen have been updated at approximately the same rate as major releases of the android operating system. Moreover, Gripen customers can swap out tactical modules and put in new ones at any time, with no impact on the flight systems. This “smartphone architecture” extends the useful life of the Gripen fighter by creating subsystems that use well-proven technology and are able to change independently over time. In the slow-moving aircraft world, the Gripen is a remarkably responsive system.

2. What does it mean to be Reliable?

There are two kinds of people in the world – optimists and pessimists – the risk takers and the risk adverse – those who chase gains and those who fear loss. Researcher Troy Higgins calls the two world views “promotion-focus” and “prevention-focus”. If we look at the IT stack, one end tends to be populated with promotion-focus people who enjoy creating an endless flow of new capabilities. [Look! It works!] As you move toward the other end of the stack, you find an increasing number of prevention-focused people who worry about safety and pay a lot of attention to the ways things could go wrong. They are sure that anything which CAN go wrong eventually WILL go wrong.

These cautious testers and operations people create friction, which slows things down. The slower pace tends to frustrate promotion-focused developers. To resolve this tension, a simple but challenging question must be answered: What is the appropriate trade-off between responsiveness and safety FOR OUR CUSTOMERS AT THIS TIME? Depending on the answer, the scale may tip toward a promotion-focused mindset or a prevention-focused mindset, but it is never appropriate to completely dismiss either mindset.

Consider Jack, whose team members were so frustrated with the slow pace of obtaining infrastructure that they decided to deploy their latest update in the cloud. Of course they used an automated test harness, and they appreciated how fast their tests ran in the cloud. Once all of the tests passed, the team deployed a cloud-based solution to a tough tax calculation problem. One evening a couple nights later, Jack had just put his children to bed when the call came: “A lot of customers are complaining that the system is down.” He got on his laptop and rebooted the system, praying that no one had lost data in the process. Around midnight another call came: “The complaints are coming in again. Maybe you had better check on things regularly until we can look at it in the morning.” It was a sleepless night – something Jack was not familiar with. These were the kinds of problems that operations used to handle, but since operations had been bypassed, it fell to the development team to monitor the site and keep the service working. This was a new and unpleasant experience. First thing in the morning, the team members asked an operations expert to join them. They needed help discovering and dealing with all of the ways that their “tested, integrated, working” cloud-based service could fail in actual use.

The cause of the problem turned out to be a bit of code that expected the environment to behave in a particular way, and in certain situations the cloud environment behaved differently. The team decided to use containers to ensure a stable environment. They also set up a monitoring system so they could see how the system was operating and get early warnings of unusual behavior. They discovered that their code had more dependencies on outside systems than they knew about, and they hoped that monitoring would alert them to the next problem before it impacted customers. The team learned that all of this extra work brought its own friction, so they asked operations to give them a permanent team member to advise them and help them deploy safely – whether to internal infrastructure or to the cloud.

Of course no one was in mortal danger when Jack’s system locked up – because it wasn’t guiding an aircraft or pacing a heartbeat. So it was fine for his team to learn the hard way that a good dose of prevention-focus is useful for any system, even one running in the cloud. But you do not want to put naive teams in a position where they can generate catastrophic results.

It is essential to understand the risk of any system in terms of: 1) probability of failure, 2) ability to detect failure, 3) resilience in recovering from failure, 4) level of risk that can be tolerated, and 5) remediation required to keep the risk acceptable. Note that you do not want this understanding to come solely from people with a prevention-focused mindset (eg. auditors) nor solely from people with a promotion-focused mindset. Your best bet is to assemble a mixed team that can strike the right balance – for your world – between responsiveness and reliability.

3. What does it mean to be Sustainable?

We know that technology does not stand still; in fact, most technology grows obsolete relatively quickly. We know that the reason our systems have software is so that they can evolve and remain relevant as technology changes. But what does it take to create a system in which evolution is easy, inexpensive and safe? A software-intensive system that readily accepts change has two core characteristics – it is understandable and it is testable.

a. What does it mean to be understandable?

If a system is going to be safely changed, then members of a modest sized team[1] must be able to wrap their minds around the way the system works. In order to understand the implications of a change, this team should have a clear understanding of the details of how the system works, what dependencies exist, and how each dependency will be impacted by the change.

An understandable system is bounded. Within the boundaries, clarity and simplicity are essential because the bounded system must never outgrow the team’s capacity to understand it, even as the team members change over time. The boundaries must be hardened and communication through the boundaries must be limited and free of hidden dependencies.

Finally, the need for understanding is fractal. As bounded sub-systems are wired together, the resulting system must also be understandable. As we create small, independently deployable micro-services, we must remember that these small services will eventually get wired together into a system, and a lot of micro-things with multiple dependencies can rapidly add up to a complex, unintelligible system. If a system – at any level – is too complex to be understood by a modest sized team, it cannot be safely modified or replaced; it is not renewable.

b. What does it mean to be testable?

A testable system, sub-system, or service is one that is testable both within its boundaries and at each interface with outside systems. For example, consider Service A which runs numbers through a complex algorithm and returns a result. The team responsible for this service develops a test harness along with their code which assures that the service returns the correct answer given expected inputs. It also creates a contract which clearly defines acceptable inputs, the rate it can accept inputs, and the format and meaning of the results it returns. The team documents this by writing contract tests which are made available to any team that wishes to invoke the service. Assume that service B would like to use service A. Then the team responsible for service B must place the contract tests from service A in its automated test suite and run the tests any time a change is made. If the contract tests for service A are comprehensive and the testing of service B always includes the latest version of these tests, then the dependency between the services is relatively safe.

Of course it’s not that simple. What if service A wants to change its interface? Then it is expected to maintain two interfaces, an old version and a new version, until service B gets around to upgrading to the new interface. And every service invoking service A is expected to keep track of which version it is certified to use.

Then again, service A might want to call another service – let’s say service X – and so service A must pass all of the contract tests for service X every time it makes a change. And since service X might branch off a new version, service A has to deal with multi-versioning on both its input and its output boundaries.

If you have trouble wrapping your head around the last three paragraphs, you probably appreciate why it is extremely difficult to keep an overall system with multiple services in an understandable, testable state at all times. Complexity tends to explode as a system grows, so the battle to keep systems understandable and testable must be fought constantly over the lifetime of any product or service.

A Reference Architecture

Over the last couple of decades, the most responsive, reliable, renewable systems seem to have platform-application architectures. (The smartphone is the most ubiquitous example.) Platforms such as Linux, android, and Gripen avionics focus on simplicity, low dependency, reliability, and slow evolution. They become the base for swappable applications which are required to operate in isolation, with minimum dependencies. Applications are small (members of a modest sized team can get their heads around a phone app), self-sufficient (apps generally contain their own data or retrieve it through a hardened interface), and easy to change (but every change has to be certified). If an app becomes unwieldy or obsolete, it is often easiest to discard it and create a new one. While this may appear to be a bit wasteful, it is the ability of a platform-app architecture to easily throw out old apps and safely add new ones that keeps the overall ecosystem responsive, fault tolerant, and capable of evolving over time.  

So these are the three rules of the DevOps game: Be responsive. Be reliable. Be sure your work is sustainable.

[1] What is a modest sized team? We have found that in hardware-software environments, a team the size of a military platoon (three squads) is often a good size for major sub-systems. Robert Dunbar found in his research that a hunting group (30-40 people) brings the diversity of sills necessary to achieve a major objective.  See the essays “Before there was Management and "The Scaling Dilemma."