LeanEssays: 2015

Wednesday, August 19, 2015

Friction

One third of the fuel that goes into a car is spent overcoming friction. By comparison, an electric car loses half as much energy - one sixth - to friction. Who knew electric cars had such an advantage?

Friction is the force that resists motion when the surface of one object comes into contact with the surface of another. You can imagine parts moving against one another in cars, but do you ever wonder what happens when your products and services come in contact with customers? Might this create some friction? Could there be competing offerings that create considerably less friction? If so, you can be sure your customers will find the low friction offering more attractive than yours.

Friction in the Customer Journey

Think of friction as the cognitive overhead that a system places on those who use it. Let’s consider the friction involved in taking a taxi from an airport to a hotel. When I arrive at most airports, I get in a taxi queue, heeding the conspicuous warnings not to ride with unauthorized drivers. When I reach the front of the queue, I take the next taxi in line, and I assume that the cost is the same no matter which taxi I take. But this is not true in Stockholm, where taxis can charge any rate they wish simply by posting it in the window. Nor is it true in many other locations, so I have learned to research the taxi systems in every city I visit. That’s cognitive load. I also bring enough local currency to pay for a taxi ride to my hotel and check on whether a tip is expected. More cognitive load.

Uber set out to remove the friction from taking a taxi by reimagining the entire experience, from hailing to routing to paying; from drivers and cars to insurance and regulations. By removing as many points of friction as possible for riders, Uber has become wildly popular in a very short time. In January 2015, four years after launch, Uber reported that its revenue its home city of San Francisco had grown to three times the size of the entire taxi market in that city. Uber has recently opened a robotics center in Pittsburgh and joins Google in working to create a practical driverless car. Its intent is to bring the cost and convenience of ride services to a point where owning a car becomes the expensive option.

Full Stack Startups

Uber is among the largest of a new crop of startups – investor Chris Dixon calls them full stack startups – that bypass incumbents and reinvent the entire customer experience from start to finish with the aim of making it as frictionless as possible. Full stack startups focus on creating a world that works the way it should work, given today’s technology, rather than optimizing the way it does work, given yesterday’s mental models. Because these companies are creating a new end-to-end experience, they rarely leverage existing capabilities aimed at serving their market; they develop a “full stack” of new capabilities.

"The challenge with the full stack approach is you need to get good at many different things: software, hardware, design, consumer marketing, supply chain management, sales, partnerships, regulation, etc. The good news is that if you can pull this off, it is very hard for competitors to replicate so many interlocking pieces." Chris Dixon

Large companies have the same full stack of capabilities as startups, but these capabilities lie in different departments and there is friction at every department boundary. Moreover, incumbents are deeply invested in the way things work today, so large incumbent companies are usually incapable of truly reinventing a customer journey. As hard as they try to be innovative, incumbents tend to be blind to the friction embedded in the customer journey that they provide today.

Consider banks. They have huge, complex back end systems-of-record that are expensive to maintain and keep secure. But customers expect mobile access to their bank accounts, so banks have added “front end teams” to build portals (mobile apps) to access the back end systems. Typically banks end up with what Gartner calls “Bimodal IT.” One part of IT handles the backend systems using traditional processes, while a different group uses different processes to deliver web and mobile apps. As a result, the front end teams are not able to reimagine the customer journey; they are locked into the practices and revenue models embedded in the back end systems. So in the end, banks have done little to change the customer journey, the fee structure, or anything else fundamental to banking.

For example, in the US it is nearly impossible for me to transfer money to my granddaughter’s bank account without physically mailing a check or paying an exorbitant wire transfer fee. Not only that, but I cannot use my chip-and-pin card at many places in Europe because US banks don’t let me enter a pin with my card (they still depend on signatures!), while unstaffed European kiosks always require a pin. Anyone who banks in Europe would find US banking practices archaic. I find that they generate a lot of friction and I expect that they cost a lot of money.

Creative Friction

Why do banks adopt Bimodal IT? According to Gartner, “There is an inherent tension between doing IT right and doing IT fast.” I respectfully disagree; there is nothing inherently wrong about being fast. In fact, when software development is done right, speed, quality and low cost are fully compatible. Hundreds of enterprises, including Amazon and Google (whose systems manage billions of dollars of revenue every month), have demonstrated that the safest approach to software development is automated, it is adaptive, and it is fast.

It is true that there is tension between different disciplines: front end and back end; dev and ops; product and technology. But the best way to leverage these tensions is not to separate the parties, but to put them together on the same team with a common goal. You will never have a great product, or a great process, without making tradeoffs – that is the nature of difficult engineering problems. If your teams lack multiple perspectives on a problem, they will be unable to make consistently good tradeoff decisions, and their results will be mediocre.

Friction in the Code

The Prussian general and military theorist Carl von Clausewitz (1780-1831) thought of friction as the thing which tempers the good intentions of generals with the reality of the battlefield. He was thinking of the friction caused by boggy terrain that horses cannot cross, soldiers exhausted by heat and heavy burdens, fog that obscures enemy positions, supplies that don’t keep pace with military movements. He noted that battalions are made up of many individuals moving at a different rates with different amounts of confusion and fear, each one affecting the others around him in unpredictable ways. It is impossible for the thousands of individual agents on the battlefield to behave exactly according to a theoretical plan, Clausewitz wrote. Unless generals have actually experienced war, he said, they will not be able to account for the accumulated friction created by all of these agents interacting with each other and their environment.

Anyone who has ever looked closely at a large code base would be forgiven for thinking that Clausewitz was writing about software systems. Over time, any code base acquires lots of moving parts and increasing amounts of friction develops between these parts, until eventually the situation becomes hopeless and the system is either replaced or abandoned. Unless, of course, the messy parts are systematically cleaned up and friction is kept in check. But who is allowed to take time for this sort of refactoring if the decision-makers have never written any code, never been surprised by hidden dependencies, never been bitten by the unintended consequences of seemingly innocuous changes?

Failure

Not long ago the New York Stock Exchange was shut down for half a day due to “computer problems.” It’s not uncommon for an airline reservation systems suffer from “computer problems” so severe that planes are grounded. But we don’t expect to hear about “computer problems” at Twitter or Dropbox or Netflix or similar systems – maybe they had problems a few years ago, but they seem to be reasonably reliable these days. The truth is, cloud-based systems fail all the time, because they are built on unreliable hardware running over unreliable communication links. So they are designed to fail, to detect failure, and to recover quickly, without interrupting or corrupting the services they provide. They appear to be reliable because their robust failure detection and recovery mechanisms isolate users from the unreliable infrastructure.

The first hint of this approach was Google’s early strategy for building a server farm. They used cheap off-the-shelf components that would fail at a known rate, and then they automated failure detection and recovery. They replicated server contents so nothing was lost during a failure, and they automated the monitoring, detection, and recovery process. Amazon built its cloud with the same philosophy – they knew that at the scale they intended to pursue, everything would fail sooner rather than later, so automated failure detection and recovery had to be designed into the system.

Designing failure recovery into a system requires a special kind of software architecture and approach to development. To compensate for unreliable communication channels, messaging is usually asynchronous and on a best-efforts basis. Because servers are expected to fail, interfaces are idempotent so you get the same results on a retry as you get the first time. Since distributed data may not always match, software is written to deal with the ambiguities and produce eventual consistency.

Fault tolerance is not a new concept. Back in the days before solid state components, computer hardware was expected to fail, so vast amounts of time and energy were dedicated to failure detection and recovery. My first job was programming the Number 2 ESS (Electronic Switching System) being built in Naperville, IL by Bell Labs about the time I got out of college. This system, built out of discrete components prior to the days of integrated circuits, had a design goal of a maximum downtime of two hours in forty years. The hardware was completely duplicated and easily half of the software was dedicated to detecting faults, switching out the bad hardware, and identifying the defective discrete component so it could be replaced. This allowed a system built on unreliable electronic components to match the reliability of the electro-mechanical switching systems that were commonly in use at the time.

Situational Awareness

Successful cloud-based systems have a LOT of moving parts – that pretty much comes as a byproduct of success. With all of these parts moving around, designing for failure hardly seems like an adequate explanation for the robustness of these systems. And it isn’t. At the heart of a reliable cloud-based system are small teams (you might call them “full stack” teams) of people who are fully responsible for their piece of the system. They pay attention to how their service is performing, they fix it when it fails, and they continuously improve it to better serve its consumers.

Full stack teams that maintain end-to-end responsibility for a software service do not fit the model we used to have of the “right” way to develop software. These are not project teams that write code according to spec, turn it over to testing, and disband once it’s tossed over the wall to operations. They are engineering teams that solve problems and make frequent changes to the code they are responsible for. Code bases created and maintained by full stack teams are much more resilient than the large and calcified code bases created by the project model precisely because people pay attention to (and change!) the internal workings of “their” code on an on-going basis.

Limited Surface Area

Clearly, many small teams making independent changes to a large code base can generate a lot of Clausewitzian friction. But since friction occurs when the surfaces of two objects come in contact with each other, strictly limiting the surface area of the code exposed by each team can dramatically reduce friction. In cloud-based systems, services are designed to be as self-contained as possible and interactions with other services are strictly limited to hardened interfaces. Teams are expected to limit changes to the surface area (interfaces) of their code and proactively test any changes that might make it through that surface to other services.

Modern software development includes automated testing strategies and automated deployment pipelines that take the friction out of the deployment process, making it practical and safe to independently deploy small services. Containers are used to standardize the surface area that services expose to their environment, reducing the friction that comes from unpredictable surroundings. Finally, when small changes are made to a live system, the impact of each change is monitored and measured. Changes are typically deployed to a small percentage of users (limiting the deployment surface area), and if any problems are detected small changes can be rolled back quickly. We know that the best way to change a complex system is to probe and adapt, and we know that software systems are inherently complex. This explains why the small rapid deployments common in cloud-based systems turn out to be much safer and more robust than the large releases that we used to think were the “right” way to deliver software.

Shared Learning

Do you ever wonder how the sophisticated testing and deployment tools used at companies like Netflix actually work? Would you like to know how Netflix stores and analyzes data or how it monitors the performance of its platform? Just head over to the Netflix Open Source Center on GitHub; it’s all there for you to see – and use if you’d like. Want to analyze a lot of data? You will undoubtedly consider Hadoop, originally developed at Yahoo! based on Google research papers, open sourced through Apache, and now at the core of many open source tools that abstract its interface and extend its capability.

The world-wide software engineering community has developed a culture of sharing intellectual property, in stark contrast to the more common practice of keeping innovative ideas and novel tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues. Because of this, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

Friction in the Process

Between 2004 and 2010, the FBI tried twice to develop an electronic case management system, and it failed both times, squandering hundreds of millions of dollars. UK’s National Health system lost similar amounts of money on a patient booking system that was eventually abandoned, and multiple billions of pounds on a patient record system that never worked. In 2012 Sweden decided to scrap and rewrite PUST, a police automation system that actually worked quite well, but not well enough for those who chose to have it rewritten the “right” way. The rewrite never worked and was eventually abandoned, an expensive fiasco that left the police without any system at all.

I could go on and on – just about every country has its story about an expensive government-funded computer system that cost extraordinary amounts of money and never actually worked. The reason? Broadly speaking, these fiascoes are caused by the process most governments use to procure software systems – a high friction process with a very high rate of failure.

One country that does not have an IT fiasco story is Estonia, probably the most automated country in the world. A few years ago British MP Francis Maude visited Estonia to find out how they managed to implement such sophisticated automation on a small budget. He discovered that Estonia automated its government because it had such a small budget, and properly automated government services are much less expensive than their manual counterparts.

Estonia’s process is simple: small internal teams work directly with consumers, understand their journey, and remove the friction. Working software is delivered in small increments to a small number of consumers, adjustments are made to make it work better, and once things work well the new capability is rolled out more broadly. Then another capability is added in the same manner, and thus the system grows steadily in small steps over time. (Incidentally, when this process is used, it is almost impossible to spend a lot of money only to find out the system doesn’t work.)

The UK government formed a consortium with Estonia and three other countries (called the Digital 5) to “provide a focused forum to share best practice [and] identify how to improve the participants’ digital services.” Maude started up the UK’s Government Digital Services, where small internal teams focus on making the process of obtaining government information and services as frictionless as possible. If you want to see how the UK Government Digital Services actually works, check out its Design Principles which summarize a new mental model for creating digital services, and the Governance approach, which outlines an effective, low friction software development process.

The HealthCare.gov fiasco in the US in 2013 led to the creation of US Digital Services, which is working in partnership with UK Digital Services to rethink government software development and delivery strategies. The US Digital Services Playbook is a great place for any organization to find advice on implementing a low friction development process.

DIGITAL SERVICE PLAYS:

Understand what people need

Address the whole experience, from start to finish

Make it simple and intuitive

Build the service using agile and iterative practices

Structure budgets and contracts to support delivery

Assign one leader and hold that person accountable

Bring in experienced teams

Choose a modern technology stack

Deploy in a flexible hosting environment

Automate testing and deployments

Manage security and privacy through reusable processes

Use data to drive decisions

Default to open

US Digital Services Playbook

The New Mental Model

The UK government changed – seemingly overnight – from high friction processes orchestrated by procurement departments to small internal teams governed by simple metrics. Instead of delivering “requirements” that someone else thinks up, teams are required to track four key performance indicators and figure out how to move these metrics in the right direction over time.

UK Digital Service’s four core KPIs:

Cost per transaction

User satisfaction

Completion rate

Digital take-up

See Gov.UK’s Performance Dashboard.

This is an entirely new mental model about how to develop effective software – one that removes all of the intermediaries between an engineering team and its consumers. It is a model that makes no attempt to define requirements, make estimates, or limit changes; instead it assumes that digital services are best developed through experimentation and require on-going improvement.

This is the mental model used by those who developed the first PUST system in Sweden, the one that was successful and appreciated by the police officers who used it. But unfortunately, conventional wisdom said it was not developed the “right” way, so the working system was shut down and rebuilt using the old mental model. And thus Sweden snatched failure from the jaws of success, proving once again that when it comes to developing interactive services, the old mental model simply Does. Not. Work.

Unexpected Points of Friction

It turns out that when governments move from the old mental model to the new mental model, many of the things that were considered “good” or “essential” in the past turn out to be “questionable” or “to be avoided” going forward. It’s a bit jarring to look at the list of good ideas that should be abandoned, but when you consider the friction that these ideas generate, it’s easier to see why forward-looking governments have eliminated them.

1. Requirements generate friction. The concept that requirements are specified by [someone] and implemented by “the team” has to be abandoned. Rather a team of engineers should explore hypotheses, testing and modifying ideas until they are proven or abandoned. Engineering teams should be expected to figure out how to make a positive impact on business metrics within valid constraints.

2. Handovers generate friction. The engineering team should have direct contact with at least a representative sample of the people whose journey they are automating. Just about any intermediary is problematic, whether the go-between is a procurement officer, business analyst, or product owner.

3. Organizational boundaries generate friction. There is a reason why the UK and US use internal teams to develop Digital Services. Going through a procurement office creates insurmountable friction – especially when procurement is governed by laws passed in the days of the old mental model. The IT departments of enterprises often generate similar friction, especially when they are cost centers.

4. Estimates generate friction. Very little useful purpose is served by estimates at the task level. Teams should have a good idea of their capacity by measuring the rate at which they complete their current work or the time it takes work to move through their workflow. Teams should be asked "What can be completed in this time-frame?" rather than "How long will this take?" The UK Digital Services funds service development incrementally, with a general time limit for each phase. If a service does not fall within the general time boundaries, it is usually broken down into smaller services.

5. Multitasking generates friction. Teams should do one thing at a time and get it done, because task switching burns up a lot of cognitive overhead. Moreover, partially done work that has been put aside gums up the workflow and slows things down.

6. Backlogs generate friction. A long "to do" list takes time to compile and time to prioritize, while everything on the list grows old and whoever put it there grows impatient. Don't prioritize - decide! Either the capacity exists to do the work, or it doesn't. Teams need only three lists: Now, Next, and Never. There is no try.

If Governments can do it, so can Enterprises

If governments can figure out how to design award-winning services [GOV.UK won the Design Museum Design of the Year Award in 2013] while moving quickly and saving money, surely enterprises can do the same. But first there is a lot of inertia to overcome. Once upon a time, governments assumed that obtaining software systems through a procurement process was essential, because it would be impossible to hire the people needed to design and develop these systems internally. They were wrong. They assumed that having teams scattered about in various government agencies would lead to a bunch of unconnected one-of systems. They were wrong. They were afraid that without detailed requirements and people accountable for estimates, there would be no governance. They were wrong. Once they abandoned these assumptions and switched to the low friction approach pioneered by Estonia, governments got better designs, more satisfied consumers, lower cost, and far more predictable results.

Your organization can reap the same benefits, but first you will have to check your assumptions at the door and question some comforting things like requirements, estimates, IT departments, contracts, backlogs – you get the idea. Read the US Digital Services Playbook. Could you run those 13 plays in your organization? If not, you need to uncover the assumptions that are keeping you in the grasp of the old mental model.

Thursday, July 16, 2015

Pitfalls of Agile Transformations

“We are a conservative company, so we are just starting our agile transformation,” the manager told me. “But we expect big things from it: faster delivery, easier recruiting, happier customers.”

“Interesting objectives,” I thought to myself. “Something I might have heard ten years ago.” It struck me that the reason an organization opts for late adoption is to learn from those who go first – from the companies that bushwhacked through the agile swamp a decade ago, or the organizations that followed a few years later. I wondered how much of what we have learned in the last decade will inform this budding agile transformation. I sensed that the answer was “not enough.”

Once you get past the sales pitches and confirmation biases, it doesn’t take much research to discover that agile and Scrum don’t have such a great track record. In the First Round Review article I'm Sorry, But Agile Won't Fix Your Products, Adam Pisoni, co-founder and former CTO of Yammer, contends that “While SCRUM did manage to rein in impulsive managers, it ended up being used more to exert tighter control over engineers’ work.” In The Failure of Agile, Andy Hunt, an original signatory of the Agile Manifesto, writes “Agile methods themselves have not been agile. Now there‘s an irony for you.” Both of these pieces complain that agile does not provide real empowerment – one of several persistent problems we have observed in many organizations as they adopt agile practices.

Every organization undertaking an agile transformation imagines that the problems with other agile implementations will not plague THEIR transformation. If they hire the right consultants and use the best practices, they assume they will be fine. This kind of wishful thinking only lengthens the list of mediocre agile transformations. It would be more useful to understand the most predictable problems with agile implementations and actively help your organization avoid them.

With this in mind, I offer three questions you might ask to expose some of the typical ways in which agile disappoints, along with the best current approaches for avoiding these common agile pitfalls.

Question 1: Should you use Scrum or Continuous Delivery?

This may come as a surprise, but quite frankly, Scrum says nothing about how to develop software, nothing about how to deliver defect-free code and nothing about techniques for faster production releases. Other agile methodologies – especially the long lost Extreme Programming – have more to say on these topics, but most agile transformations reserve little time for improving the actual work involved in generating top notch software. Yet without a solid foundation in the technology that produces great systems, agile is pretty hollow.

The technical heart of agile is embodied in the practices articulated by Jez Humble and Dave Farley in Continuous Delivery: acceptance test-driven development; automated builds, automated testing, automated database migration, and automated deployment; everyone checks their code into the mainline at least daily (there are no branches!); the mainline is ALWAYS production ready and is deployed very frequently (daily is slow); release is by switch rather than by deployment. If you aren’t heading toward these or similar technical practices and you think you are doing an agile transformation, think again. Agile without a strong technology base is usually a mistake.

Start your agile transformation by acknowledging that software development is a deeply technical endeavor leading to highly complex systems. These systems behave like all complex systems – if you smash them with a big change, all bets are off – you cannot predict the results. The only way to have predictable, stable code bases is to modify them with small probes, observe the results, modify the code and probe again. [Incidentally, a small probe is not two weeks of work; it’s more like two hours of work.] If deploying small probes to live systems is not at the core of your agile transformation strategy, you are missing today’s most reliable tools for delivering stable systems with predictable results.

Yes, this means writing a lot more code. It means tests as code, infrastructure as code, deployment as code. It means no one writes production code until there is an acceptance test for it, written in an executable language. It means teams can pretend they are working in a cloud because the infrastructure they need is always available and can be provisioned as needed. It means that whole teams (which include everyone from product to operations) retain responsibility for their code even after it goes live. And it means that the most common way teams decide what to do next is to examine feedback from the effects of their work in actual use.

The technology enabling Continuous Delivery should be at the core of any modern agile transformation because it has proven to be the safest way for an organization to gain and maintain control of complex software systems. If your agile transition team does not understand this technology, then you are probably trying to switch to agile without adequate technical leadership. This is not a good strategy.

Admittedly, Continuous Delivery is technically challenging, but no more so than the many other challenges that technical teams deal with every day. In fact, we have found that almost without exception, software engineers love to work in a Continuous Delivery environment because of the challenge, the discipline, the clarity, and the immediate feedback. One financial services company told us that in the three years since their (large) IT department switched to Continuous Delivery, they have had zero turnover, except for emigration. Their transformation resulted in the most desirable jobs in the area.

Question 2: Do you hire Developers or Engineers?

What title do you use for people who solve problems with software? Years upon years ago, I was called a programmer and that was a high status job. But once waterfall processes placed analysts between programmers and their customers, the programmers were no longer expected to analyze customer problems and solve them. The title “programmer” was downgraded to a second class job which mostly involved coding what someone else wrote in a specification. Over time a new term – developers – came into use and referred to a more holistic job. But then, agile processes placed a product owner between developers and customers, so developers were no longer expected to analyze customer problems and solve them. Instead, they were given a prioritized list of relatively small stories to estimate, code, and (hopefully) test.

If you visit Silicon Valley these days you will find that software developers have been replaced by software engineers. We can only hope that those smart people who have this title will be presented with complete problems and expected to engineer a solution. They will not be given specs, because whoever wrote the spec designed the solution. They will not be given stories, because whoever wrote the stories designed the solution. They will be given real problems – customer problems, business problems, technical problems – and asked to engineer a solution. They will be expected to implement the solution within valid constraints and take responsibility for its success. Silicon Valley companies understand that this is the kind of job that attracts the best engineers.

If you want more effective recruiting in today’s very tight talent market, don’t look for software developers or mention your agile transformation. Look for software engineers and reliability engineers and make it clear that you expect them to engineer effective solutions to meaningful problems. Then make sure that your agile transformation makes this challenging work the responsibility of your engineers, because most agile methodologies place it elsewhere.

Question 3: How will you handle dependencies?

I was astonished when I heard that after Amazon completed its switch to services, the company no longer used central databases. How could this possibly work? I thought it was self-evident that a single system of record is fundamental to the success of an enterprise – so how could Amazon possibly survive without a central database? Either the information about abandoning central databases was wrong or Amazon was doing something that defied all conventional wisdom.

It turns out that the second was correct – Amazon had discovered something so obvious that it had escaped us for decades: A central database is one humongous dependency generator. Ouch! Take a look at Sam Newman’s book Building Microservices – where the case is made that dependencies are among the greatest evils in software development and central databases are among the most pernicious creators of dependencies in the software world. It’s eye-opening.

These days we see a lot of companies building microservices – Netflix and realestate.com.au and Gilt and many more. Why? Because when they experience extremely high volume, the code that handles this volume needs constant attention and tuning. The only way to make that happen at scale is to adopt a structure which allows individual teams to deploy their code – live to production – independently of other teams. A microservice is exactly that – code owned by one (small) team that designs, monitors, maintains, and deploys the service – independent of other teams and other code.

If this sounds a lot like something you’ve heard of before, that’s because independent module deployment has been the dream of software development just about forever. A couple decades ago, object-oriented programming promised this nirvana, but it never quite delivered. Now microservices are making the same promise, and there are instances of them working pretty well. Of course, microservices are rather new and the jury is still out. (See Martin Fowler’s summary of Microservices.) But we know that for very high volume systems, independent deployment appears to be mandatory and microservices seem to be the architecture of choice. Clearly microservices are a viable way – but not the only way – to handle dependencies.

No matter what kind of system you have, dependencies must be dealt with or else they will eventually haunt you. The Google code base started out as a monolith which rapidly developed many dependencies, but fortunately, Google's engineers understood the danger. So they developed a dependency matrix to keep track of code interactions, and whenever code was pushed to the test framework, the new code and all of its dependencies were tested together – immediately. If the test found problems, the code was reverted and everyone involved was notified. New code was system-tested thousands of times a day, which required a massive environment with a lot of automation. But it worked infinitely better than manually testing large changes because it identified the precise cause of potential problems before they happened. As expensive as it seems, it turns out that testing each small change with its complete stack of dependent code is better, cheaper, safer and faster than testing big batch releases the way we used to in the past.

“But how do we get from our legacy systems to that ideal state?” we are often asked. Well, that is precisely the question your agile transformation should answer. There are plenty of places to look for ideas, because this is a path many companies have taken. To get started, Martin Fowler's Strangler Application provides a general pattern for migrating away from legacy code, and several case studies can be found here. However, there are no canned answers for dealing with legacy code; the problems are quite specific to each situation. You need good engineers to take up the challenge supported by leadership that appreciates the importance of the issue. But the bottom line is that if an agile transformation does not provide a path from smashing your system with big releases to probing it with tiny bits of code, you have more homework to do before you get started.

We have learned a lot about how to deal with dependencies over the last few years. We can do it with an architecture that isolates dependencies – perhaps microservices – or by automatically testing the complete system of dependent code after every small change. We know we should NOT deal with dependencies by consuming the last third of a release cycle with system testing (and fixing) the way we used to in the waterfall days. And we know it does not make sense to automate tests just to make this back-end testing go faster – a mistake we have seen frequently that you want to avoid. Test automation should be aimed at defect prevention, not defect discovery. Preventing defects as the code is written pays for itself. Many times over. Every time.

Ask the Right Questions

If you are one of those conservative organizations that is just getting around to an agile transformation, be sure you ask the right questions before you take the leap. Remember that typical agile practices are just table stakes. You need to know how to play the complex systems game, a deeply technical game played by very smart engineers. Don’t insult their intelligence if you want to engage them.

Understand that dependencies cause most defects and fragile code bases, and they also lead to tangled organizational structures. Really. If you’re skeptical, check out Conway’s Law. Get your technical and architectural act together, as well as your strategy for dealing with dependencies, before you begin. This may prompt you to consider an organizational change as part of the transformation.

When you are ready to start, be sure to articulate the specific business goals the agile transition will help achieve and how you will measure the agile transition’s contribution to these goals. Then challenge your smart people to figure out how to move those metrics – and your transition will be off to a good start.

As an industry, we know how to do this. Your colleagues have done it. You may as well avoid the pitfalls they have discovered. Start by asking a few questions.

Friday, June 5, 2015

Lean Software Development: The Backstory

We were in a conference room near the Waterfront in Cape Town. “I just lost a crown from one of my teeth.” my husband Tom declared just before I was scheduled to open the conference. Someone at our table responded, “You’re lucky, Cape Town has some of the best dentists in the world.” It didn’t feel very lucky; Cape Town was the first stop on a ten week trip to Africa, Europe, and Australia.

The situation was eerily familiar. A year earlier a chip had cracked off of my tooth as I ate a pizza in Lima, the first stop of a ten week trip to South America. I ate gingerly during the rest of the trip, worried that the tooth would crack further. Luckily I made it back home with no pain and little additional damage. Once there, it took three days to get a dentist appointment. The dentist made an impression of the gap in my tooth and fashioned a temporary crown. “This will have to last for a week or two,” she said. “If it falls out, just stick it back in and be more careful what you eat.” Luckily the temporary crown held, and ten days later a permanent crown arrived from the lab. Two weeks after we arrived home, my tooth was fixed.

We were scheduled to be in Cape Town for only two days. How was Tom going to get a crown replaced in two days? A small committee formed. Someone did a phone search; apparently the Waterfront was a good place to find dentists. A call was made. “You can go right now – the dental office is nearby. Do you want someone to walk you over?” As Tom headed out the door with an escort, I got ready for my presentation. Half way through the talk, I saw Tom return and signal that all was well.

“I lost a part of my tooth, not just the crown,” Tom told me after the talk. “I’m supposed to return at 3:30 this afternoon; I should have a new crown by the end of the day.” The dentist had a mini-lab in his office. Instead of making a temporary crown, he used a camera to take images of the broken tooth and adjacent teeth. The results were combined into a 3D model of the crown to which the dentist made a few adjustments. Then he selected a ceramic blank that matched the color of Tom’s teeth and put it in a milling machine. With the push of a button, instructions to make the crown were loaded into the machine. Cutters whirled and water squirted to keep the ceramic cool. Ten minutes later the crown was ready to cement in place. Ninety minutes after he arrived that afternoon and eight hours after the incident, Tom walked out of the dental office with a new permanent crown. It cost approximately the same amount as my crown had cost a year earlier.

Lean is about Flow Efficiency

The book This is Lean (Modig and Ahlström, 2013) describes “lean” as a relentless focus on efficiency – but not the kind of efficiency that cuts staff and money, nor the kind of efficiency that strives to keep every resource busy all of the time. In fact, a focus on resource efficiency will almost always destroy overall efficiency, the authors contend, because fully utilized machines (and people) create huge traffic jams, which end up creating a lot of extra work. Instead, Modig and Ahlström demonstrate that lean is about flow efficiency – that is, the efficiency with which a unit of work (a flow unit) moves through the system.

Consider our dental experience. It took two weeks for me get a new crown, but in truth, only an hour and a half of that time was needed to actually fix the tooth; the rest of the time was mostly spent waiting. My flow efficiency was 1.5÷336 (two weeks) = 0.45%. On the other hand, Tom’s tooth was replaced in eight hours – 42 times faster – giving him a flow efficiency of 1.5÷8 = 18.75%.

In my case, the dental system was focused on the efficiency of the lab’s milling machine – no doubt an expensive piece of equipment. But add up all of the extra costs: a cast of the crown for the lab, a temporary crown for me, two separate hour-long sessions with the dentist, plus all of the associated logistics – scheduling, shipping, tracking, etc. In Tom’s case, the dental system was focused on the speed with which it could fix his tooth – which was good for us, because a long wait for a crown was not an option. True, the milling machine in the dentist’s office sits idle much of each day. (The dentist said he has to replace two crowns a day to make it economically feasible.) But when you add up the waste of temporary crowns, the piles of casts waiting for a milling machine, and the significant cost of recovering from a mistake – an idle milling machine makes a lot of sense.

What does flow efficiency really mean? Assume you have a camera and efficiency means keeping the camera busy – always taking a picture of some value-adding action. Where do you aim your camera? In the case of resource efficiency, the camera is aimed at the resource – the milling machine – and keeping it busy is of the utmost importance. In the case of flow efficiency, the camera is on the flow unit – Tom – and work on replacing his crown is what counts. The fundamental mental shift that lean requires is this: flow efficiency trumps resource efficiency almost all of the time.

Lean Product Development: The Predecessor

During the 1980’s Japanese cars were capturing market share at a rate that alarmed US automakers. In Boston, both MIT and Harvard Business School responded by launching extensive studies of the automotive industry. In 1990 the MIT research effort resulted in the now classic book The Machine that Changed the World: the Story of Lean Production (Womack et al., 1990), which gave us the term “lean.” A year later, Harvard Business School published Product Development Performance. (Clark and Fujimoto, 1991) and the popular book Developing Products in Half the Time (Smith and Reinertsen, 1991) was released. These two 1991 books are foundational references on what came to be called “lean product development,” although the term “lean” would not be associated with product development for another decade.

Clark and Fujimoto documented the fact that US and European volume automotive producers took three times as many engineering hours and 50% more time to develop a car compared to Japanese automakers, yet the Japanese cars had substantially higher quality and cost less to manufacture. Clearly the Japanese product development process produced better cars faster and at lower cost that typical western development practices of the time. Clark and Fujimoto noted that the distinguishing features of Japanese product development paralleled features found in Japanese automotive production. For example, Japanese product development focused on flow efficiency, reducing information inventory, and learning based on early and frequent feedback from downstream processes. By contrast, product development in western countries focused on resource efficiency, completing each phase of development before starting the next, and following the original plan with as little variation as possible.

In 1991 the University of Michigan began its Japan Technology Management Program. Over the next several years, faculty and associate members included Jeffrey Liker, Allen Ward, Durward Sobek, John Shook, and Mike Rother. This group has published numerous books and articles on lean thinking, lean manufacturing, and lean product development, including The Toyota Product Development System (Morgan and Liker, 2006), and Lean Product and Process Development (Ward, 2007). The second book summarizes the essence of lean product development this way:

Understand that knowledge creation is the essential work of product development.
Charter a team of responsible experts led by an entrepreneurial system designer.
Manage product development using the principles of cadence, flow, and pull.

It is important to recognize that even though lean product development is based on the same principles as lean production, the practices surrounding development are, quite frankly, not the same as those considered useful in production. In fact, transferring lean practices from manufacturing to development has led to some disastrous results. For example, lean production emphasizes reducing variation – exactly the wrong thing to do in product development. The western practice of following a plan and measuring variance from a plan is often justified by the slogan “Do it right the first time.” Unfortunately, this approach does not allow for learning; it confines designs to those conceived when the least amount of knowledge is available. A fundamental practice in lean product development is to create variation (not avoid it) in order to explore the impact of multiple approaches. (This is called set-based engineering.)

The critical thing to keep in mind is that knowledge creation is the essential work of product development. While lean production practices support learning about and improving the manufacturing process, their goal is to minimize variation in the product. This is not appropriate for product development, where variation is an essential element of the learning cycles that are the foundation of good product engineering. Thus instead of copying lean manufacturing practices, lean product development practices must evolve from a deep understanding of fundamental lean principles adapted to a development environment.

Lean Software Development: A Subset of Lean Product Development

In 1975, computers were large, expensive, and rare. Software for these large machines was developed in the IT departments of large companies and dealt largely with the logistics of running the company – payroll, order processing, inventory management, etc. But as mainframes morphed into minicomputers, personal computers, and microprocessors, it became practical to enhance products and services with software. Then the internet began to invade the world, and it eventually became the delivery mechanism for a large fraction of the software being developed today. As software moved from supporting business process to enabling smart products and becoming the essence services, software engineers moved from IT departments to line organizations where they joined product teams.

Today, most software development is not a stand-alone process, but rather a part of developing products or services. Thus lean software development might be considered a subset of lean product development; certainly the principles that underpin lean product development are the same principles that form the basis of lean software development.

Agile and Lean Software Development: 2000 - 2010

It’s hard to believe these days, but in the mid 1990’s, developing software was a slow and painful process found in the IT departments of large corporations. As the role of software expanded and software engineers moved into line organizations, reaction against the old methods grew. In 1999, Kent Beck proposed a radically new approach to software development in the book “Extreme Programming Explained” (Beck, 1999). In 2001 the Agile Manifesto (Beck et al., 2001) gave this new approach a name – “Agile.”

In 2003, the book Lean Software Development (Poppendieck, 2003) merged lean manufacturing principles with agile practices and the latest product development thinking, particularly from the book Managing the Design Factory (Reinertsen, 1997). Lean software development was presented as a set of principles that form a theoretical framework for developing and evolving agile practices:

Eliminate waste
Amplify learning
Decide as late as possible
Deliver as fast as possible
Empower the team
Build quality in
See the whole

Although the principles of lean software development are consistent with lean manufacturing and (especially) lean product development, the specific practices that emerged were tailored to a software environment and aimed at the flaws in the prevailing software development methodologies. One of the biggest flaws at the time was the practice of moving software sequentially through the typical stages of design, development, test, and deployment – with handovers of large inventories of information accumulating at each stage. This practice left testing and integration at the end of the development chain, so defects went undetected for weeks or months before they were discovered. Typical sequential processes reserved a third of a release cycle for testing, integration, and defect removal. The idea that it was possible to “build quality in” was not considered a practical concept for software.

To counter sequential processes and the long integration and defect removal phase, agile software development practices focused on fast feedback cycles in these areas:

Test-driven development: Start by writing tests (think of them as executable specifications) and then write the code to pass the tests. Put the tests into a test harness for ongoing code verification.
Continuous integration: Integrate small increments of code changes into the code base frequently – multiple times a day – and run the test harness to verify that the changes have not introduced errors.
Iterations: Develop working software in iterations of two-to four weeks; review the software at the end of each iteration and make appropriate adjustments.
Cross-functional teams: Development teams should include customer proxies and testers as well as developers to minimize handovers.

During its first decade, agile development moved from a radical idea to a mainstream practice. This was aided by the widespread adoption of Scrum, an agile methodology which institutionalized the third and fourth practices listed above, but unfortunately omitted the first two practices.

The Difference between Lean and Agile Software Development

When it replaced sequential development practices typical at the time, agile software development improved the software development process most of the time – in IT departments as well as product development organizations. However, the expected organizational benefits of agile often failed to materialize because agile focused on optimizing software development, which frequently was not the system constraint. Lean software development differed from agile in that it worked to optimize flow efficiency across the entire value stream “from concept to cash.” (Note the subtitle of the book Implementing Lean Software Development: From Concept to Cash (Poppendieck, 2006)). This end-to-end view was consistent with the work of Taiichi Ohno, who said:

“All we are doing is looking at the time line, from the moment the customer gives us an order to the point when we collect the cash. And we are reducing that time line by removing the non-value-added wastes.” (Ohno, 1988. p ix)

Lean software development came to focus on these areas:

Build the right thing: Understand and deliver real value to real customers.
Build it fast: Dramatically reduce the lead time from customer need to delivered solution.
Build the thing right: Guarantee quality and speed with automated testing, integration and deployment.
Learn through feedback: Evolve the product design based on early and frequent end-to-end feedback.

Let’s take a look at each principle in more detail:

1. Understand and deliver real value to real customers.

A software development team working with a single customer proxy has one view of the customer interest, and often that view is not informed by technical experience or feedback from downstream processes (such as operations). A product team focused on solving real customer problems will continually integrate the knowledge of diverse team members, both upstream and downstream, to make sure the customer perspective is truly understood and effectively addressed. Clark and Fujimoto call this “integrated problem solving” and consider it an essential element of lean product development.

2. Dramatically reduce the lead time from customer need to delivered solution.

A focus on flow efficiency is the secret ingredient of lean software development. How long does it take for a team to deploy into production a single small change that solves a customer problem? Typically it can take weeks or months – even when the actual work involved consumes only an hour. Why? Because subtle dependencies among various areas of the code make it probable that a small change will break other areas of the code; therefore it is necessary to deploy large batches of code as a package after extensive (usually manual) testing. In many ways the decade of 2000-2010 was dedicated to finding ways to break dependencies, automate the provisioning and testing processes, and thus allow rapid independent deployment of small batches of code.

3. Guarantee quality and speed with automated testing, integration and deployment.

It was exciting to watch the expansion of test-driven development and continuous integration during the decade of 2000-2010. First these two critical practices were applied at the team level – developers wrote unit tests (which were actually technical specifications) and integrated them immediately into their branch of the code. Test-driven development expanded to writing executable product specifications in an incremental manner, which moved testers to the front of the process. This proved more difficult than automated unit testing, and precipitated a shift toward testing modules and their interactions rather than end-to-end testing. Once the product behavior could be tested automatically, code could be integrated into the overall system much more frequently during the development process – preferably daily – so software engineers could get rapid feedback on their work.

Next the operations people got involved and automated the provisioning of environments for development, testing, and deployment. Finally teams (which now included operations) could automate the entire specification, development, test, and deployment processes – creating an automated deployment pipeline. There was initial fear that more rapid deployment would cause more frequent failure, but exactly the opposite happened. Automated testing and frequent deployment of small changes meant that risk was limited. When errors did occur, detection and recovery was much faster and easier, and the team became a lot better at it. Far from increasing risk, it is now known that deploying code frequently in small batches is best way to reduce risk and increase the stability of large complex code bases.

4. Evolve the product design based on early and frequent end-to-end feedback.

To cap these remarkable advancements, once product teams could deploy multiple times per day they began to close the loop with customers. Through canary releases, A/B testing, and other techniques, product teams learned from real customers which product ideas worked and how to fine tune their offerings for better business results.

When these four principles guided software development in product organizations, significant business-wide benefits were achieved. However, IT departments found it difficult to adopt the principles because they required changes that lay beyond span of control of most IT organizations.

Lean Software Development: 2010 - 2015

2010 saw the publication of two significant books about lean software development. David Anderson’s book Kanban (Anderson, 2010) presented a powerful visual method for managing and limiting work-in-process (WIP). Just at the time when two week iterations began to feel slow, Kanban gave teams a way to increase flow efficiency while providing situational awareness across the value stream. Jez Humble and Dave Farley’s book Continuous Delivery (Humble and Farley, 2010) walked readers through the steps necessary to achieve automated testing, integration and deployment, making daily deployment practical for many organizations. A year later, Erik Reis’s book The Lean Startup (Reis, 2011) showed how to use the rapid feedback loop created by continuous delivery to run experiments with real customers and confirm the validity of product ideas before incurring the expense of implementation.

Over the next few years, the ideas in these books became mainstream and the limitations of agile software development (software-only perspective and iteration-based delivery) were gradually expanded to include a wider part of the value stream and a more rapid flow. A grassroots movement called DevOps worked to make automated provision-code-build-test-deployment pipelines practical. Cloud computing arrived, providing easy and automated provisioning of environments. Cloud elements (virtual machines, containers), services (storage, analysis, etc.) and architectures (microservices) made it possible for small services and applications to be easily and rapidly deployed. Improved testing techniques (simulations, contract assertions) have made error-free deployments the norm.

The State of Lean Software Development in 2015

Today’s successful internet companies have learned how to optimize software development over the entire value stream. They create full stack teams that are expected to understand the consumer problem, deal effectively with tough engineering issues, try multiple solutions until the data shows which one works best, and maintain responsibility for improving the solution over time. Large companies with legacy systems have begun to take notice, but they struggle with moving from where they are to the world of thriving internet companies.

Lean principles are a big help for organizations that want to move from old development techniques to modern software approaches. For example, (Calçado, 2015) shows how classic lean tools – Value Stream Mapping and problem solving with Five Whys – were used to increase flow efficiency at Soundcloud, leading over time to a microservices architecture. In fact, focusing on flow efficiency is an excellent way for an organization to discover the most effective path to a modern technology stack and development approach.

For traditional software development, flow efficiency is typically lower than 10%; agile practices usually bring it up to 30 or 40%. But in thriving internet companies, flow efficiency approaches 70% and is often quite a bit higher. Low flow efficiencies are caused by friction – in the form of batching, queueing, handovers, delayed discovery of defects, as well as misunderstanding of consumer problems and changes in those problems during long resolution times. Improving flow efficiency involves identifying and removing the biggest sources of friction from the development process.

Modern software development practices – the ones used by successful internet companies – address the friction in software development in a very particular way. The companies start by looking for the root causes of friction, which usually turn out to be 1) misunderstanding of the customer problem, 2) dependencies in the code base and 3) information and time lost during handovers and multitasking. Therefore they focus on three areas: 1) understanding the consumer journey, 2) architecture and automation to expose and reduce dependencies, and 3) team structures and responsibilities. Today (2015), lean development in software usually focuses on these three areas as the primary way to increase efficiency, assure quality, and improve responsiveness in software-intensive systems.

1. Understand the Customer Journey.

Software-intensive products create a two-way path between companies and their consumers. A wealth of data exists about how products are used, how consumers react to a product’s capabilities, opportunities to improve the product, and so on. Gathering this data and analyzing it has become an essential capability for companies far beyond the internet world: car manufactures, mining equipment companies, retail stores and many others gather and analyze “Big Data” to gain insights into consumer behavior. The ability of companies to understand their consumers through data has changed the way products are developed. (Porter, 2015) No longer do product managers (or representatives from “the business”) develop a roadmap and give a prioritized list of desired features to an engineering team. Instead, data scientists work with product teams to identify themes to be explored. Then the product teams identify consumer problems surrounding the theme and experiment with a range of solutions. Using rapid deployment and feedback capabilities, the product team continually enhances the product, measuring its success by business improvements, not feature completion.

2. Architecture and Automation.

Many internet companies, including Amazon, Netflix, eBay, realestate.com.au, Forward, Twitter, PayPal, Gilt, Bluemix, Soundcloud, The Guardian, and even the UK Government Digital Service have evolved from monolithic architectures to microservices. They found that certain areas of their offerings need constant updating to deal with a large influx of customers or rapid changes in the marketplace. To meet this need, relatively small services are assigned to small teams which then split their services off from the main code base in such a way that each service can be deployed independently. A service team is responsible for changing and deploying the service as often as necessary (usually very frequently), while insuring that the changes do not break any upstream or downstream services. This assurance is provided by sophisticated automated testing techniques as well as automated incremental deployment.

Other internet companies, including Google and Facebook, have maintained existing architectures but developed sophisticated deployment pipelines that automatically send each small code change through a series of automated tests with automatic error handling. The deployment pipeline culminates in safe deployments which occur at very frequent intervals; the more frequent the deployment, the easier it is to isolate problems and determine their cause. In addition, these automation tools often contain dependency maps so that feedback on failures can be sent directly to the responsible engineers and offending code can be automatically reverted (taken out of the pipeline in a safe manner).

These architectural structures and automation tools are a key element in a development approach that uses Big Data combined with extremely rapid feedback to improve the consumer journey and solve consumer problems. They are most commonly found in internet companies, but are being used in many others, including organizations that develop embedded software. (See case study, below.)

3. Team Structures and Responsibilities.

When consumer empathy, data analytics and very rapid feedback are combined, there is one more point of friction that can easily reduce flow efficiency. If an organization has not delegated responsibility for product decisions to the team involved in the rapid feedback loop, the benefits of this approach are lost. In order for such feedback loops to work, teams with a full stack of capabilities must be given responsibility to make decisions and implement immediate changes based on the data they collect. Typically such teams include people with product, design, data, technology, quality, and operations backgrounds. They are responsible for a improving set of business metrics rather than delivering a set of features. An example of this would be the UK Government Digital Service (GDS), where teams are responsible for delivering improvements in four key areas: cost per transaction, user satisfaction, transaction completion rate, and digital take-up.

It is interesting to note that UK laws makes it difficult to base contracts on such metrics, so GDS staffs internal teams with designers and software engineers and makes them responsible for the metrics. Following this logic to its conclusion, the typical approach of IT departments – contracting with their business colleagues to deliver a pre-specified set of features – is incompatible with full stack teams responsible for business metrics. In fact, it is rare to find separate IT departments in companies founded after the mid 1990’s (which includes virtually all internet companies). Instead, these newer companies place their software engineers in line organizations, reducing the friction of handovers between organizations.

In older organizations, IT departments often find it difficult to adopt modern software development approaches because they have inherited monolithic code bases intertwined with deep dependencies that introduce devious errors and thwart independent deployment of small changes. One major source of friction is the corporate database, once considered essential as the single source of truth about the business, but now under attack as a massive dependency generator. Another source of friction are outsourced applications, where even small changes are difficult and knowledge of how to make them no longer resides in the company. But perhaps the biggest source of friction in IT departments is the distance between their technical people and the company’s customers. Because most IT departments view their colleagues in line businesses as their customers, the technical people in IT lack a direct line of sight to the real customers of the company. Therefore insightful trade-offs and innovative solutions struggle to emerge.

The Future of Lean Software Development

The world-wide software engineering community has developed a culture of sharing innovative ideas, in stark contrast to the more common practice of keeping intellectual property and internally developed tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues through open source projects and repositories like GitHub. This reflects the longstanding practices of the academic world but is strikingly unique in the commercial world. Because of this intense industry-wide knowledge sharing, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

As long as the software community continues to leverage its knowledge-sharing culture it will continue to grow rapidly, because sophisticated solutions to seemingly intractable problems eventually emerge when many minds are focused on the problem. The companies that will benefit the most from these advances are the ones that not only track new techniques as they are being developed, but also contribute their own ideas to the knowledge pool.

As microstructured architectures and automated deployment pipelines become common, more companies will adopt these practices, some earlier and some later, depending on their competitive situation. The most successful software companies will continue to focus like a laser on delighting customers, improving the flow of value, and reducing risks. They will develop (and release as open source) an increasingly sophisticated set of tools that make software development easier, faster, and more robust. Thus a decade from now there will be significant improvements in the way software is developed and deployed. The Lean principles of understanding value, increasing flow efficiency, eliminating errors, and learning through feedback will continue to drive the evolution, but the term “lean” will disappear as it becomes “the way things are done.”

— Case Study —

Hewlett Packard LaserJet Firmware

The HP LaserJet firmware department had been the bottleneck of the LaserJet product line for a couple of decades, but by 2008 the situation had turned desperate. Software was increasingly important for differentiating the printer line, but the firmware department simply could not keep up with the demand for more features. Department leaders tried to spend their way out of the problem, but more than doubling the number of engineers did little to help. So they decided to engineer a solution to the problem by reengineering the development process.

The starting point was to quantify exactly where all the engineers’ time was going. Fully half of the time went to updating existing LaserJet printers or porting code between different branches that supported different versions the product. A quarter of the time went to manual builds and manual testing, yet despite this investment, developers had to wait for days or weeks after they made a change to find out if it worked. Another twenty percent of the time went to planning how to use the five percent of time that was left to do any new work. The reengineered process would have to radically reduce the effort needed to maintain existing firmware, while seriously streamlining the build and test process. The planning process could also use some rethinking.

It’s not unusual to see a technical group use the fact that they inherited a messy legacy code base as an excuse to avoid change. Not in this case. As impossible as it seemed, a new architecture was proposed and implemented that allowed all printers – past, present and even future – to operate off of the same code branch, determining printer-specific capabilities dynamically instead of having them embedded in the firmware. Of course this required a massive change, but the department tackled one monthly goal after another and gradually implemented the new architecture. But changing the architecture would not solve the problem if the build and test process remained slow and cumbersome, so the engineers methodically implemented techniques to streamline that process. In the end, a full regression test – which used to take six weeks – was routinely run overnight. Yes, this involved a large amount of hardware, simulation and emulation, and yes it was expensive. But it paid for itself many times over.

During the recession of 2008 the firmware department was required to return to its previous staffing levels. Despite a 50% headcount reduction, there was a 70% reduction in cost per printer program once the new architecture and automated provisioning system were in place in 2011. At that point there was a single code branch and twenty percent of engineering time was spend maintaining the branch and supporting existing products. Thirty percent of engineering time was spent on the continuous delivery infrastructure, including build and test automation. Wasted planning time was reclaimed by delaying speculative decisions and making choices based on short feedback loops. And there was something to plan for, because over forty percent of the engineering time was available for innovation.

This multi-year transition was neither easy nor cheap, but it absolutely was worth the effort. If you would like more detail, see (Gruver et al, 2013).

A more recent case study of how the software company Paddy Power moved to continuous delivery can be found in (Chen, 2015). In this case study the benefits of continuous delivery are listed: improved customer satisfaction, accelerated time to market, building the right product, improved product quality, reliable releases, and improved productivity and efficiency. There is really no downside to continuous delivery. Of course it is a challenging engineering problem that can require significant architectural modifications to existing code bases as well as sophisticated pipeline automation. But technically, continuous delivery is no more difficult than other problems software engineers struggle with every day. The real stumbling block is the change in organizational structure and mindset required to achieve serious improvements in flow efficiency.

— End Case Study —

Credit

This essay is a preprint of the author’s original manuscript of a chapter to be published in Netland and Powell (eds) (2016) "Routledge Companion to Lean Management"

References

Anderson, David. Kanban, Blue Hole Press, 2010

Beck, Kent. Extreme Programming Explained, Addison-Wesley, 2000

Beck, Kent et al. Manifesto for Agile Software Development, http://agilemanifesto.org/, 2001

Calçado, Phil. How we ended up with microservices. http://philcalcado.com/2015/09/08/how_we_ended_up_with_microservices.html

Chen, Lianping. "Continuous Delivery: Huge Benefits but Challenges Too" IEEE Software 32 (2). 50-54. 2015

Clark, Kim B. and Takahiro Fujimoto. Product Development Performance, Harvard Business School Press, 1991

Gruver, Gary, Mike Young, and Pat Fulghum. A Practical Approach to Large-Scale Agile Development, Pearson Education, 2013

Humble, Jez and David Farley. Continuous Delivery, Addison-Wesley Professional, 2010

Modig, Niklas, and Par Ahlstrom. This is Lean, Stockholm: Rheologica Publishing, 2012

Morgan, James M and and Jeffrey K Liker. The Toyota Product Development System, Productivity Press, 2006

Ohno, Taiichi. Toyota Production System, English, Productivity, Inc. 1988, published in Japanese in 1978

Poppendieck, Mary and Tom. Lean Software Development, Addison Wesley, 2003

Poppendieck, Mary and Tom. Implementing Lean Software Development, Addison Wesley, 2006

Porter, Michael E. and James E. Heppelmann. How Smart, Connected Products are Transforming Companies, Harvard Business Review 93 (10), 97-112, 2015

Reinertsen, Donald G. Managing the Design Factory, The Free Press, 1997

Ries, Eric. The Lean Startup, Crown Business, 2011

Smith, Preston G. and Donald G. Reinertsen. Developing Products in Half the Time, Van Nostrand Reinhold/co Wiley, 1991

Ward, Allen. Lean Product and Process Development, Lean Enterprise Institute, 2007

Womack, James P., Daniel T. Jones, and Daniel Roos. The Machine That Changed the World; the Story of Lean Production, Rawson & Associates, 1990

Monday, February 9, 2015

The Three Rules of the DevOps Game

“Of course we do agile development,” she told me. “That’s just table stakes. What we need to do now is learn how to play the DevOps game. We need to know how to construct a deployment pipeline, how to keep test automation from turning into a big ball of mud, whether micro-services are just another fad, what containers are all about. We need to know if outsourcing our infrastructure is a good long term strategy and what happens to DevOps if we move to the cloud.”

Ask the right questions

Imagine something we will call the IT stack. At one end of the stack is hardware and at the other end customers get useful products and services. The game is to move things through the stack in a manner that is responsive, reliable, and sustainable. The first order of business is to understand what responsive, reliable, and sustainable mean in your world. Then you need to be the best in your field at providing products and services that strike the right balance between responsiveness, reliability and sustainability.

1. What does it mean to be Responsive?

In many industries, responsive has come to mean devising and delivering features through the entire IT stack in a matter of minutes or hours. From hosted services to bank trading desks, the ability to change software on demand has become an expected practice. In these environments, a deployment pipeline is essential. Teams have members from every part of the IT stack. Automation moves features from idea, to code, to tested feature, to integrated capability, to deployed service very quickly.

Companies that live in this fast-moving world invest in tools to manage, test, and deploy code, tools to maintain infrastructure, and tools to monitor production environments. In this world, automation is essential for rapid delivery, comprehensive testing, and automated recovery when (not if, but when) things go wrong.

On the other end of the spectrum are industries where responsiveness is a distant second to safety: avionics, medical devices, chemical plant control systems. Even here, software is expected to evolve, just more slowly. Consider Saab’s Gripen, a small reconnaissance and fighter jet with a purchase and operational cost many times lower than any comparable fighter. Over the past decade, the core avionics systems of the Gripen have been updated at approximately the same rate as major releases of the android operating system. Moreover, Gripen customers can swap out tactical modules and put in new ones at any time, with no impact on the flight systems. This “smartphone architecture” extends the useful life of the Gripen fighter by creating subsystems that use well-proven technology and are able to change independently over time. In the slow-moving aircraft world, the Gripen is a remarkably responsive system.

2. What does it mean to be Reliable?

There are two kinds of people in the world – optimists and pessimists – the risk takers and the risk adverse – those who chase gains and those who fear loss. Researcher Troy Higgins calls the two world views “promotion-focus” and “prevention-focus”. If we look at the IT stack, one end tends to be populated with promotion-focus people who enjoy creating an endless flow of new capabilities. [Look! It works!] As you move toward the other end of the stack, you find an increasing number of prevention-focused people who worry about safety and pay a lot of attention to the ways things could go wrong. They are sure that anything which CAN go wrong eventually WILL go wrong.

These cautious testers and operations people create friction, which slows things down. The slower pace tends to frustrate promotion-focused developers. To resolve this tension, a simple but challenging question must be answered: What is the appropriate trade-off between responsiveness and safety FOR OUR CUSTOMERS AT THIS TIME? Depending on the answer, the scale may tip toward a promotion-focused mindset or a prevention-focused mindset, but it is never appropriate to completely dismiss either mindset.

Consider Jack, whose team members were so frustrated with the slow pace of obtaining infrastructure that they decided to deploy their latest update in the cloud. Of course they used an automated test harness, and they appreciated how fast their tests ran in the cloud. Once all of the tests passed, the team deployed a cloud-based solution to a tough tax calculation problem. One evening a couple nights later, Jack had just put his children to bed when the call came: “A lot of customers are complaining that the system is down.” He got on his laptop and rebooted the system, praying that no one had lost data in the process. Around midnight another call came: “The complaints are coming in again. Maybe you had better check on things regularly until we can look at it in the morning.” It was a sleepless night – something Jack was not familiar with. These were the kinds of problems that operations used to handle, but since operations had been bypassed, it fell to the development team to monitor the site and keep the service working. This was a new and unpleasant experience. First thing in the morning, the team members asked an operations expert to join them. They needed help discovering and dealing with all of the ways that their “tested, integrated, working” cloud-based service could fail in actual use.

The cause of the problem turned out to be a bit of code that expected the environment to behave in a particular way, and in certain situations the cloud environment behaved differently. The team decided to use containers to ensure a stable environment. They also set up a monitoring system so they could see how the system was operating and get early warnings of unusual behavior. They discovered that their code had more dependencies on outside systems than they knew about, and they hoped that monitoring would alert them to the next problem before it impacted customers. The team learned that all of this extra work brought its own friction, so they asked operations to give them a permanent team member to advise them and help them deploy safely – whether to internal infrastructure or to the cloud.

Of course no one was in mortal danger when Jack’s system locked up – because it wasn’t guiding an aircraft or pacing a heartbeat. So it was fine for his team to learn the hard way that a good dose of prevention-focus is useful for any system, even one running in the cloud. But you do not want to put naive teams in a position where they can generate catastrophic results.

It is essential to understand the risk of any system in terms of: 1) probability of failure, 2) ability to detect failure, 3) resilience in recovering from failure, 4) level of risk that can be tolerated, and 5) remediation required to keep the risk acceptable. Note that you do not want this understanding to come solely from people with a prevention-focused mindset (eg. auditors) nor solely from people with a promotion-focused mindset. Your best bet is to assemble a mixed team that can strike the right balance – for your world – between responsiveness and reliability.

3. What does it mean to be Sustainable?

We know that technology does not stand still; in fact, most technology grows obsolete relatively quickly. We know that the reason our systems have software is so that they can evolve and remain relevant as technology changes. But what does it take to create a system in which evolution is easy, inexpensive and safe? A software-intensive system that readily accepts change has two core characteristics – it is understandable and it is testable.

a. What does it mean to be understandable?

If a system is going to be safely changed, then members of a modest sized team[1] must be able to wrap their minds around the way the system works. In order to understand the implications of a change, this team should have a clear understanding of the details of how the system works, what dependencies exist, and how each dependency will be impacted by the change.

An understandable system is bounded. Within the boundaries, clarity and simplicity are essential because the bounded system must never outgrow the team’s capacity to understand it, even as the team members change over time. The boundaries must be hardened and communication through the boundaries must be limited and free of hidden dependencies.

Finally, the need for understanding is fractal. As bounded sub-systems are wired together, the resulting system must also be understandable. As we create small, independently deployable micro-services, we must remember that these small services will eventually get wired together into a system, and a lot of micro-things with multiple dependencies can rapidly add up to a complex, unintelligible system. If a system – at any level – is too complex to be understood by a modest sized team, it cannot be safely modified or replaced; it is not renewable.

b. What does it mean to be testable?

A testable system, sub-system, or service is one that is testable both within its boundaries and at each interface with outside systems. For example, consider Service A which runs numbers through a complex algorithm and returns a result. The team responsible for this service develops a test harness along with their code which assures that the service returns the correct answer given expected inputs. It also creates a contract which clearly defines acceptable inputs, the rate it can accept inputs, and the format and meaning of the results it returns. The team documents this by writing contract tests which are made available to any team that wishes to invoke the service. Assume that service B would like to use service A. Then the team responsible for service B must place the contract tests from service A in its automated test suite and run the tests any time a change is made. If the contract tests for service A are comprehensive and the testing of service B always includes the latest version of these tests, then the dependency between the services is relatively safe.

Of course it’s not that simple. What if service A wants to change its interface? Then it is expected to maintain two interfaces, an old version and a new version, until service B gets around to upgrading to the new interface. And every service invoking service A is expected to keep track of which version it is certified to use.

Then again, service A might want to call another service – let’s say service X – and so service A must pass all of the contract tests for service X every time it makes a change. And since service X might branch off a new version, service A has to deal with multi-versioning on both its input and its output boundaries.

If you have trouble wrapping your head around the last three paragraphs, you probably appreciate why it is extremely difficult to keep an overall system with multiple services in an understandable, testable state at all times. Complexity tends to explode as a system grows, so the battle to keep systems understandable and testable must be fought constantly over the lifetime of any product or service.

A Reference Architecture

Over the last couple of decades, the most responsive, reliable, renewable systems seem to have platform-application architectures. (The smartphone is the most ubiquitous example.) Platforms such as Linux, android, and Gripen avionics focus on simplicity, low dependency, reliability, and slow evolution. They become the base for swappable applications which are required to operate in isolation, with minimum dependencies. Applications are small (members of a modest sized team can get their heads around a phone app), self-sufficient (apps generally contain their own data or retrieve it through a hardened interface), and easy to change (but every change has to be certified). If an app becomes unwieldy or obsolete, it is often easiest to discard it and create a new one. While this may appear to be a bit wasteful, it is the ability of a platform-app architecture to easily throw out old apps and safely add new ones that keeps the overall ecosystem responsive, fault tolerant, and capable of evolving over time.

So these are the three rules of the DevOps game: Be responsive. Be reliable. Be sure your work is sustainable.

[1] What is a modest sized team? We have found that in hardware-software environments, a team the size of a military platoon (three squads) is often a good size for major sub-systems. Robert Dunbar found in his research that a hunting group (30-40 people) brings the diversity of sills necessary to achieve a major objective. See the essays “Before there was Management and "The Scaling Dilemma."