Lean Essays

When Demand Exceeds Capacity

2022-06-11T13:24:00.019-05:00

We are getting solar panels on our roof. Someday. We signed a contract last November and the original estimate for installation was May. Since no one does outside work in the winter here in Minnesota, this was about two months after the earliest possible installation date. But May has come and gone, and we still do not have an installation date. Why? The electrical design of for our 40-year-old house is complex and the young solar company has limited electrical design capability. So does our mature electrical utility company. Between the two of them, our project has been held up for a long time.

Solar systems have become widely popular since the electrical grid meltdown in Texas and the sharp rise in energy prices. So, it is not a surprise that the nascent solar industry is dealing with far more demand than the young solar companies and staid utility companies are equipped to handle. It is difficult to find enough talented people to do the complex system design, project management, and installation of solar systems. Thus, solar customers experience long delays that cost us months of potential savings, while they suffer from months of delayed revenue.

Our solar company has an eager sales force and a handful of project managers. Once contracts are signed, sales people hand them off to project managers so they can focus on pursuing more sales. Since the rate of signed contracts greatly exceeds the rate of completed installations, the number of customers assigned to each project manager is constantly increasing. It is impossible for a project manager to manage dozens of complex solar projects at the same time, but that is what they are expected to do.

Clearly, our solar company is unwilling to reduce its sales rate to match its project completion rate and unable to increase the completion rate to match the rate at which contracts are signed. Thus, the length of time a project takes continually increases, putting customers and project managers in a lose-lose situation. In fact, the solar industry in hardly the first industry in the world to experience far more demand that it has the capacity to supply; many other industries have navigated these waters. Typically, companies that do not learn how to manage excess demand end up failing because they generate annoyed customers faster than they generate revenue. Companies that became giants in rapidly growing industries – for example, Airbnb, AWS, Google, Netflix, and SpaceX – learned quickly to increase their delivery capacity to match the rate at which they generated sales. Successful solar companies will have to learn the same thing.

How do companies rapidly increase delivery capacity? Hiring more people might help, but it is not nearly enough. Young companies in a growth industry must figure out how to complete more jobs faster while doing less work or they will not be able to keep up with the rapid increase in demand. There are three essential strategies that successful industry giants have used to increase efficiency which other companies would do well to understand.

1. Focus on Customers Outcomes

The first strategy for increasing efficiency is to pay attention to the metrics that drive behavior. Startups tend to measure success by counting the number sales, while more efficient companies tend to measure the number of satisfied customers. When success is measured by the number of satisfied customers rather than booked revenue, several good things happen. The company learns not to accept work at a faster rate than it can complete work, because doing so does little more than stretch out the time it takes for its people to be successful while lengthening the queue of increasingly impatient customers. Work teams also learn to focus on completing installations as quickly as possible, because more installations will be seen as better performance. Finally, teams are unlikely to develop products that customers are not interested in – a very common source of wasted effort in startup companies.

It is extraordinarily difficult for an ambitious young company to turn away people that are beating a path to its door. Everyone’s instinct is to hire more salespeople and let the work pour in. But let’s face it, the most important thing that young company needs to learn is how to COMPETE work quickly, not how to ACCEPT work quickly. If it opens the doors wide and lets everyone in, there will be so many customers milling about that workers will be distracted trying to deal with the chaos rather than focused on getting work done. The most important thing for that young company to do at this point is to stop spending so much time on managing excess work and start concentrating on how to get jobs done as fast as they arrive.

Every growth industry has gone through this learning curve. Once upon a time, mail order catalogs (Sears, for example) accepted orders as fast as they could and shipped them in about two weeks. Because of the delay, they had to deal with people changing their mind and a complex order change system was necessary. Then L.L.Bean had a novel idea – why not ship products within 24 hours of receiving the order? It worked – customers were delighted, and no order change system was needed. Today almost all fulfillment systems work on the same principle – focus on building the capacity to ship product at the same rate orders arrive. Once shipping capacity matches demand, the company can ship immediately after an order arrives, delighting customers and reducing congestion while giving customers little time to change their minds.

2. Manage FLOW, not tasks

The second strategy for increasing efficiency is counterintuitive; it requires a shift in the company’s mental model of how work gets done. Consider our solar company. As a small startup, it thinks of every project as a separate entity and assigns a project manager to sequence the steps in an installation so all necessary work gets done. Unfortunately, this approach is not readily scalable, and even if it were, there are not enough experienced project managers available to hire. Industry leaders do things differently; they manage packages of items as they move through their system, not individual projects or tasks.

What does it mean to manage groups of items? Consider the shipping industry. Until the 1950’s, ships were loaded with individual items packed carefully to keep the ship stable. Each item had to be traced through every port as it moved off the ship and onto another ship or a train or a truck. Then the shipping container was invented, and it was no longer necessary to manage individual items moving through the supply chain. Instead, shippers managed containers as they moved from origin to destination.

Some years ago, we bought a sofa that was heavily discounted because of a missing headrest. A few months later the store called to say it was filling a container of furniture from their supplier in Norway and they could add our missing headrest for a small fee. It would have been too complex (and expensive) to order and track the headrest individually; but it was simply assigned to the container in which many items moved as a package from Norway to Minnesota. The cost of shipping was negligible. It's easy to see how containers created a reduction in complexity that dramatically reduced the cost of shipping goods and led to substantial changes in the global economy.

How do you group and manage workflow in a growing solar company? First, you think of a contract as a number of steps that need to be completed. Certainly, some of the steps can be done in parallel – for example solar panel layout can be done in parallel with electrical design. Other steps have prerequisites that must be completed first. You organize work so that each step has its own responsible team, its own queue, and its own completion rate. As each new contract is signed, it is coded with the various steps it must move through, including any required sequencing. Then you release the contract to the starting step (or steps). When a step is done, work should automatically move to the next step(s) until all work is complete. Your job is to manage the FLOW of work through the system rather that the progress of individual items through their various steps.

How do you manage the flow of work through a system? Start by managing the amount of time a job spends waiting for attention in the queue of each step. Queueing theory is clear: if you limit the length of a queue and send work through in FIFO (first-in-first-out) order, then time in a queue is the number of items in the queue divided by the average completion rate. For example, if you limit the items in a work queue to twenty and each item takes two days, work will wait in the queue – if it is FIFO – for no more than ten days. (Note: If a FIFO queue is not acceptable, then the queue is too long! Make it shorter.)

Once a job reaches the end of a queue, it should flow quickly through the work step. The best approach is for a responsible team begin a job, focus on it until it is done, then go on to the next job. Focusing on one thing at a time improves the quality of the work, eliminates wasted time due to multitasking and gets everything done faster. (See Single Responsibility Teams, below.)

If a step in the workflow is completed inside of your company, you have a lot of control over the rate at which work moves through the step (queue time + work time). Your goal is to create short, predictable completion rates for every internal step. If a step occurs outside your company (the utility company, for example, or the city permitting process) you do not have control of the time that takes, but you can measure it – you should track average time and variation. Note that it is important to separate internal steps from external steps – otherwise variations in the external completion rate will make it impossible to control internal rates of completion.

It is also important to ensure that downstream steps have capacity in their queues to handle work as it is completed upstream. If this sounds complicated, an analogy might be helpful: Consider an order coming into Amazon. Once upon a time, orders were handled by a central database which directed each step in the processing of orders as they moved through the company. But in the early 2000’s Amazon outgrew this process; they could not buy large enough database computers to keep up with surges in demand. Amazon needed a new approach, so it abandoned the central databases and thought of each order as moving through a series of independent services. When an order is placed, it is coded with an initial chain of necessary services (Do we need to do a credit check? Does the product require special shipping? ...) and sent on its way through the various processing steps. A service can divert the order or add additional steps (Oops, product is out of stock! ...) and downstream processes are expected to have the capacity to accept work from upstream processes. (If the product can be picked from a warehouse, it should be possible to package and ship it immediately.)

Armed with a set of steps each contract needs and the time each step will take, you, like Amazon, should be able to predict the expected installation date very early – ideally when the contract is signed. You should also have a handle on variation due to outside queues and dependencies such as parts availability and supply chain delays. You can send the contract into your workflow, track how well it is coming along and flag exceptions as soon as they occur.

In addition, you can spot bottlenecks that need to be improved and generally manage the capacity of your organization. You can be alerted when you need to limit sales to avoid overcommitment: Do not accept work faster than the rate at which work moves through your bottleneck process. Focus on the bottleneck until it is improved, when of course a new bottleneck will appear. Find the next bottleneck and limit sales of projects that use it to the rate the bottleneck can handle. Repeat...

By far the biggest benefit of managing the flow of groups of items is this: You will get a lot more done with a lot less effort. How can this be? When you no longer have half-done work clogging up your system, demanding attention, and distracting workers, everyone can spend much more of their time actually getting useful work done.

3. Form Single-Responsibility Teams

The third big change you should make is to charter single-responsibility teams, organized and led by single-responsibility leaders. What is a single-responsibility team? It is a team that has one and only one thing to do, a team that can focus all its efforts on accomplishing one thing at a time and doing that one thing well. Be careful to minimize dependencies among these teams; each team should be able to begin and complete its work independently; teams should not need to coordinate with or get permission from managers or other teams. Finally, work should not appear in the queue of a single-responsibility team until everything is in place for that work to be done, otherwise the team will have to finish someone else’s work before they can focus on their own.

How is it possible to organize complex work into pieces that can be accomplished by single-responsibility teams? While there is no such thing as the “best” organizational structure for companies in diverse domains, it is easy to find examples that demonstrate the concept. Let’s start with AWS (Amazon Web Services). Over the past decade AWS has replaced most enterprise software companies with an array of independent services, each managed by a single-responsibility team and strung together through hardened interfaces. No one has ever seen an AWS roadmap; instead, teams identify (or are given) a customer problem and they start by writing a “press release” describing how their solution will solve the problem. Then they proceed to develop the solution, testing it with customers as often as possible. AWS has released multiple new enterprise-level services every year since 2012, each aimed at solving a specific customer problem. These services have gradually lured companies away from the complex, integrated packages of traditional vendors, often prompting AWS customers to organize their own work around autonomous, focused teams.

Now let’s look at SpaceX, which is organized very differently than AWS. SpaceX is fundamentally an engineering organization; its teams are organized around the components of booster rockets that launch things into space. There is a team for each stage of the booster (eg. payload, stage 1, stage 2, engine cluster…) with sub-teams for each sub-component of the stage (Merlin engine, landing legs…).

You might wonder: How do teams know what to do and when to do it? How do they coordinate with other teams? How do they know they have done a good job? SpaceX operates on the principle of responsibility, which says that each team / sub-team is expected to understand the role its component must play in the next launch and have it ready by the scheduled launch date (perhaps a couple of months away). The team, led by a responsible engineer, has one job: Ensure that the component works as well as possible during the launch and does its job as part of the overall system.

How does SpaceX manage FLOW? For a large, multi-team development effort involving both hardware and software, managing flow is usually accomplished by scheduling a series of frequent integration events. The purpose of each event is to take a well-defined step forward in the overall development effort to learn what works, locate unintended integration effects, and discover ways to improve the system. A series of integration events sets the cadence for all teams involved in the development effort. In the case of SpaceX a test launch is scheduled every few months throughout a development cycle; importantly, launch dates Do. Not. Change. Each launch is an integration event that moves all teams forward together, creating a steady flow of real progress toward a working system.

Of necessity, based on the structure of a booster, SpaceX teams have more dependencies than AWS teams. They also have a clear responsibility to understand and account for those dependencies. Since SpaceX teams work to meet fixed launch deadlines, their systems sometimes fail during a launch. Failures are part of the process, but teams must carefully document launch behavior to be sure that the cause of any failure is identified and eliminated. Believe it or not, this approach to development is much faster and less costly (and usually safer!) than trying to think through every failure mode and eliminate any possibility of failure before attempting a test launch.

Does SpaceX focus on customer outcomes? The company undertook the difficult challenge of developing a reusable booster because it was clear that sending things into space was too expensive for many potential customers. Reusing boosters was an essential step in dramatically reducing the cost of a launch, significantly increasing the number of customers with access to space. This need to reduce launch costs has driven most major booster development decisions at SpaceX.

SpaceX’s reusable booster has reduced the cost of launching a kilo of payload into space by a factor of 7, while spending an order of magnitude less money on development than previous booster rockets.

Summary

A lot of companies have more work than they can possibly handle – and that is not a good thing. It creates long delays, excess complexity, and eventually, dissatisfied customers. The secret is to limit demand to capacity and then increase capacity to meet additional demand by creating a steady flow of work while teams focus on a single responsibility. You also need to stop doing work that does not contribute to great customer outcomes: Stop multitasking, stop prioritizing, get rid of backlogs, stop shepherding individual tasks. If companies as large and successful as AWS and SpaceX can organize work around customer outcomes, steady flow, and focused, single-responsibility teams – then so can you.

What’s Wrong With Training Wheels?

2020-05-20T09:21:00.000-05:00

The boy looked to be four or five years old. He was laboriously pushing the pedals of his bike, hands gripping the high handlebars. His mom walked slowly beside him.

I bit my tongue as they passed – I wanted to tell the mom that training wheels are so last century! That my grandson is not yet three years old, but he is whizzing around the neighborhood on his balance bike. His dad can barely jog fast enough to keep up.

A balance bike is short, the handlebars are low; my grandson straddles the center bar, pushing the ground to scoot forward. When a slight downhill propels the bike forward, his feet come up and he coasts downhill. When he falls, it’s not very far to the ground. He learned intuitively how to keep his bike balanced and eventually how to steer and stop. When the time comes to add pedals, he will already know the basics of bike riding, which he pretty much learned on his own.

Once you’ve seen a two-year-old buzzing around on a balance bike, you know the four-year-old struggling with training wheels is using the wrong process to learn to ride a bike. It’s much more important to learn balance, steering and stopping than it is to learn how to pedal.

When I observe teams using Scrum ceremonies that are two decades old, when I see roles that effectively put proxies between the engineering team and the problem to be solved, when I see a company struggling with a scaling process – I see training wheels. It’s time to call out these practices for what they are – processes that focus on the wrong thing, that remove feedback from the system, that distract teams from learning simpler, faster, more relevant ways to develop software.

We are living through a Black Swan event – one that has taught many organizations how to turn on a dime and reconfigure their supply chains, their customer interaction models, their product delivery approaches. Today every store near me offers curbside delivery, supported by a lot of rapid software changes. You can bet these changes were not in the plans a couple of months ago, but they happened, and they happened fast. It feels like a lot of local companies gave their teams balance bikes and let them learn how to scoot, coast, fall down, steer, stop. Whatever it takes – just get curbside pickup working – NOW! Eventually they’ll add the pedals.

2020 has dawned as a decade that will be focused on resilience, adaptability, and rapid response. We need to give our teams balance bikes, not training wheels – but what does a balance bike look like?

First of all, our balance bike operates best without dependencies, and we must acknowledge that dependencies are an architectural problem. Too many companies are trying to solve the dependency problem with process solutions, rather than tackle the real problem – their architecture, both the system architecture and the organizational architecture.

Next, our balance bike riders need to learn to take control of their environment – to balance, to fall, to steer, to brake. We need to stop giving our software teams training wheels: tasks and priorities. Its time for them to tackle real problems, make critical trade-offs, recover from mistakes, make adjustments and keep moving.

Eventually our teams will be ready for pedals – so they can move fast, far, and consistently whenever they want. Software engineers need an execution environment that supports a deployment pipeline, staged releases with automatic rollbacks, and feedback directly to the engineering team so it can figure out what to do next.

The fact is, this kind of balance bike exists in the software world; one example would be Amazon Web Services, and there are many others. We need to get a message through to ‘parents’ that there is a better way to implement software changes than the legacy practices of the 1990’s or the dated practices of the early 2000’s. It’s time to recognize that agile training wheels actually get in the way of the resilience, adaptability, and rapid response demanded by this new era.

________________

Thanks to Joshua Kerievsky, who originated the training wheels analogy.

Grown-Up Lean

2019-07-14T11:25:00.001-05:00

Lean was introduced to software a couple of decades ago.
How are they getting along?

This working paper was submitted as a chapter in The International Handbook of Lean Organization, Cambridge University Press, Forthcoming.

The Nature of Software

“Do not go where the path may lead,
go instead where there is no path and leave a trail”

-- Ralph Waldo Emerson

It’s May 27, 1997. The Internet has been open to public for six years. Linux is six years old. Amazon is three. Google doesn’t exist. The dotcom bubble hasn’t happened.

In Würzburg, Germany, Eric Raymond presents an essay called "The Cathedral and the Bazaar"[1] at the Linux Kongress. He describes “some surprising theories about software engineering”:

I discuss these theories in terms of two fundamentally different development styles, the "cathedral" model of most of the commercial world versus the "bazaar" model of the Linux world. I show that these models derive from opposing assumptions about the nature of the software-debugging task. I then make a sustained argument from the Linux experience for the proposition that “Given enough eyeballs, all bugs are shallow”, suggest productive analogies with other self-correcting systems of selfish agents, and conclude with some exploration of the implications of this insight for the future of software.

The implications were clear:

Perhaps in the end the open-source culture will triumph not because cooperation is morally right…. but simply because the commercial world cannot win an evolutionary arms race with open-source communities that can put orders of magnitude more skilled time into a problem.

The democratization of programming arrived with the public Internet in 1991, and within a decade it became clear that the old model for developing software was obsolete. No longer was it practical for experts to write requirements and send them to a support group where programmers wrote code and testers wrote corresponding tests and then reconciled the two versions of the requirements; finally, after weeks, months or even years, a big batch of new code was released to consumers (aka. ‘users’). This ‘process’ never really worked, but the commercial world had not yet found a replacement.

However, the open source world figured out a better way to develop software. Eric Raymond was right – it was not about writing the code, it was about ‘the software-debugging task’. It’s easy to write bug-free code in isolation; most bugs are caused by the way one piece of correct software interacts with another piece of correct software. As a code base grows large, potential interactions grow exponentially, and it quickly becomes impossible to test every interaction, or even predict which interactions might cause defects. In the open source world, staffed completely by volunteers, there was no attempt to test for every potential problem before making a change to a live code base – no one would volunteer to do the work. On the contrary, volunteers were motivated by seeing their contribution working right away. So small changes were submitted, reviewed, and integrated into the live code base as quickly as possible. If a bug surfaced, there were plenty of eyeballs to see the problem, limit the damage, find the cause, and fix it. Plus, the offending code was probably the latest submission, so the person whose code triggered the problem was usually identified and would be embarrassed. Open source was (and is) known to be a brutal but effective training ground for software engineers.

Example: Amazon

One of the earliest commercial companies to figure out the nature of ‘software-debugging’ was Amazon. As the company outgrew its traditional cathedral-style software architecture in the early 2000’s, the leadership team felt that the growing pains could be addressed with better communication between teams. But CEO Jeff Bezos disagreed. He believed that the only way to grow seriously large was to have many independent (selfish) agents making local decisions – essentially a Bazaar-style organizational architecture. Bezos declared that teams should be small enough to be fed with two pizzas, and these teams should operate independently. It took some years to evolve to a software architecture that supported such teams, but eventually small, independent services owned by two-pizza teams made up the core of Amazon’s infrastructure. Customer-focused metrics were used to guide a team’s performance, and teams were expected to work both autonomously and asynchronously to improve customer outcomes. Initially this created havoc in operations, which was responsible for any problems that surfaced once code ‘went live’. But the infrastructure VP invented ways for engineering teams to self-provision hardware and self-deploy software, which made it possible for teams to retain responsibility for any problems their services encountered once it went ‘live’, not just during development.

Once software engineers realized they might be awakened in the middle of the night if their code created a problem, they became very good at keeping bugs out of their code. Three strategies emerged:

Teams hardened their service interfaces, effectively isolating their service from unintended interactions with the rest of the system. These interfaces, called API’s (Application Program Interfaces) were contracts between the service and its consumers or suppliers. No interactions or data exchanges were allowed except through API’s, which reduced the number of possible interactions to a manageable number and provided testing surfaces for every interaction.
If you give software engineers manual work, their first instinct is to automate it. So, when a small team which included software engineers became responsible for testing a service and its interfaces, you can bet the job was quickly automated.
Teams released software early and often. They did this because they could release at any time, so why not now? After all, just as in open source, seeing the results of your work is motivating.

There you have it: ownership, isolation, automation, and fast feedback turn out to be among the best strategies we have for keeping software working correctly.

Once Amazon figured out how to make this all work (which took years), it leveraged the knowledge by selling its internal services under the brand AWS (Amazon Web Services). In 2018 AWS was a $25 billion / year business, growing at very fast clip. Much of this growth comes from large enterprises that discover they cannot win an arms race with an architecture and strategy that manages complex systems orders of magnitude more efficiently than they can.

Example: Google

Another company that learned the nature of ‘software debugging’ early in its life was Google. From the beginning, Google hired ‘software engineers’, because they were looking for people who could figure out how to “organize the world’s information and make it universally accessible and useful”[2] and solve the technical problems that came with such an aggressive mission. The earliest technical problems centered on how to store all that data, and then how to search it. Fast.

In 1988, Berkley scientists David A Patterson, Garth Gibson, and Randy H Katz presented the paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)"[3] at the ACM SIGMOD Conference. They stunned the computer-savvy world by suggesting that a redundant array of inexpensive disks promised “improvements of an order of magnitude in performance, reliability, power consumption, and scalability” over single large expensive disks. (In other words, a bazaar-style hardware architecture was vastly superior to a cathedral-style architecture.) Berkley is a close neighbor of Stanford, where Google was born. In hindsight, it is not surprising that Google started its life using a redundant array of inexpensive hardware to store the data it gathered while crawling the Internet. In 2003 and 2004 Google engineers released three groundbreaking papers: "Web Search for a Planet: The Google Cluster Architecture",[4] "The Google File System",[5] and "MapReduce: Simplified Data Processing on Large Clusters".[6] These papers explained their approach to managing a vast array of inexpensive hardware, decomposing massive amounts of data into clusters, storing it redundantly, searching it in place and returning results almost instantly. At the heart this approach to infrastructure are the core strategies of isolation, redundancy, fault detection, and automation. This was (and is) a truly impressive engineering accomplishment.

Leveraging its core infrastructure principles to complex software systems, Google’s approach to maintaining the quality of its rapidly growing code base used these strategies:

Ownership: Software engineers were responsible for the quality of their code. Test engineers were available to help engineering teams create and use tools that to help engineers debug their software.
Isolation: Google developed an understanding of the boundaries of its systems by creating a dependency map. Testing of a section of code could then be confined to that section and its dependencies.
Redundancy: Engineers created two machine-readable versions of system behavior by writing automated tests (test code should be considered a specification), and then writing the code to pass the tests. This is like double-entry bookkeeping, a practice that uses redundancy to ensure accuracy.
Fault Detection: Tests are put into a test harness that is run automatically whenever code is checked into the code repository to ensure that the new code works and has not broken any tests that used to pass. In addition, real time behavior monitoring is used to detect and respond to anomalous behaviors whether due to software issues, hardware, network, load or some other issue.
Feedback/Learning: From the beginning, Google adopted the open source model of releasing ‘early and often’, using the ‘Beta’ label to signal that things would change frequently. They treated consumers as co-developers, enticing them to explore the site daily to check out new features.
Automation: Google developed a host of automated tools to deploy changes to a limited audience, run A/B tests, monitor the health of its systems, find sources of defects, etc.

Of course, the reality is more complicated than this simplifying description, but you get the idea. See How Google Tests Software,[7] by James Whittaker, Jason Arbon, and Jeff Carollo for more information.

The Lean Approach to Software

It’s hard to count the many times that someone told me “software development is not like manufacturing.” I agree; I’ve developed software and I’ve worked in manufacturing and I assure you, they are very different. Attempts to apply lean production tools and practices to a development process have a dismal track record. Copying practices from one context to another is always problematic, but using operational practices in a development environment is particularly awkward and not recommended.

Software development does have one thing in common with manufacturing; they are both seriously complex systems. (Anyone who thinks manufacturing is 'simple' has never been there.) One reason lean works in manufacturing is because it is an effective way to manage complexity. If you go far enough up the chain to lean’s first principles, you will find that they apply to software complexity also, but they don’t apply in the same way.

I have observed that lean organizations consistently exhibit certain characteristics, which I would consider first principles:

Customer Focus
Rapid Flow
Systematic Learning
Built-in Quality
Respect for People
Long-Term / Whole System Perspective

If you match these principles to the software engineering approaches of Amazon and Google, you will find that they are quite ‘lean’, even if the companies do not use that term. They start with customers. They release early and often, resulting in rapid feedback. They combine this feedback with data-driven approaches to adapt their offerings. They leverage redundancy and automation make sure their code – and data centers – remain stable, secure, and resilient. They have a culture of respect for engineers, and of long-term thinking.

Let’s look at how these principles might be applied differently in the same company. If you think of lean as a learning system, then the principle of systematic learning is a good place to start. At AWS (Amazon Web Services), the most important thing to learn is WHAT to build. They search for answers to questions such as: What matters to customers? What causes friction in the customer experience? What can we do to make customers feel awesome? What current – and future – technologies can we use to lower their costs? Based on the answer to these questions, Amazon introduced a service called Lambda in 2014 that responds to events quickly and inexpensively. Lambda replaced the need for customers to pay for servers sitting around listening for events to occur – reducing the cost (and Amazon’s revenue) for event-driven systems by a factor of 5 to 10 (!). Customers said WOW! and whole new category of cloud services was born.

At an Amazon fulfillment center, systematic learning focuses on HOW to improve the process of packing, shipping, and handling returns. The questions to ask might be: How long does it take from order to shipping and can we shorten this time? How can we package items faster, cheaper, with fewer materials and less stress on people? How arcuately can we predict delivery dates and how well do we keep our delivery promises? How can we improve our delivery predictability? Can we make delivery easier for shippers? Is there a better way to handle returns that would reduce friction for customers and sellers?

Built-in quality at AWS is very different than built-in quality at a fulfillment center because the underlying causes of error are not the same; they are not even similar. A Poka-yoke (mistake-proofing) system in a warehouse might be a scale that weighs each package and checks that it matches the weight of what is supposed to be shipped. A Poka-yoke system at AWS might be the use of Specification by Example to create a way to automatically validate the software’s behavior.

When are ‘requirements’ not required?

One basic principle of lean is that learning through systematic problem-solving is everyone’s job, all the time. Managers are mentors who help people and teams learn how to learn.[8] If you look at the way software used to be developed – somebody came up with a list of ‘requirements’ which were ‘implemented’ by programmers – it’s easy to see that this was not a lean process because the requirements, or ‘scope’ were fixed at the onset; learning was not allowed.

Early attempts to apply lean tools in software development processes often used the mantra ‘Get it Right the First Time’ to insist on a complete, accurate, unchangeable description of ‘Right’ before starting a project. This approach failed to ask the basic question: are we making the right thing? It ignored the fact that for most software projects, ‘requirements’ represented little more than a guess at what needed to be done to achieve the purpose of the project.[9] After all, those ‘requirements’ were often written by someone who had little technical background, limited understanding of the problem domain, and no responsibility for achieving the purpose of the project. In addition, this ‘scope’ was fixed at a time when the least possible information was available. Since no learning was expected, software engineers were required to do a lot of work they suspected was unnecessary, while being asked to make trade-offs that prioritized short term feature delivery over clean, high quality, robust code. This lack of respect for the time and expertise of software engineers discouraged engagement and made retention difficult.

A production view of software development is fundamentally flawed. When you apply lean to a development process, you are looking for ways to learn as much as possible about the customer problem and potential technical solutions, so you finalize product content as late as possible. For software-intensive products and services, the modern approach is to continuously deliver small changes in capability in order to set up short, direct feedback loops between the engineering team and its customers. As a bonus, this is an excellent technique for managing complexity and assuring the quality, resilience, and adaptability of a product over time.

The Roots of Lean Product Development

During the 1980’s, when it became apparent that Japanese automotive companies were making higher quality, lower cost cars than US automotive companies, Boston rivals MIT and Harvard Business School started programs to investigate the situation. MIT established the International Motor Vehicle Program, which produced the 1990 best-seller The Machine that Changed the World: The Story of Lean Production[10] by James P. Womack, Daniel T. Jones, and Daniel Roos. Womack and Jones went on to establish Lean training and consulting organizations in the US and Europe.

Across the Charles River, Harvard Business School was also looking into the automotive industry, and in 1991 it published Product Development Performance: Strategy, Organization, and Management in the World Auto Industry by Kim B. Clark and Takahiro Fujimoto. This book did not become a best-seller, but it did provide a summary of how lean principles work differently in automotive product development.[11] For example, the book equates short production throughput time to short development lead time. Work-in-process inventory in production is comparable to information inventory between development steps. While pull systems in production are triggered by downstream demand, they are triggered by downstream market introduction dates in development. Flexibility to changes in volume and product mix in production are the same as flexibility to changes in design, schedule, and cost targets in development. Continuous improvement in production equates to frequent, incremental innovations in development.

The most important findings in the book were:

The development processes of high performing companies focused on fast time-to-market, excellent product quality, and high engineering productivity (i.e. the number of hours and level of resources needed to move a vehicle from concept to market.)
High performing development programs were led by a strong product manager who started as the product concept champion and then led the development effort (as ‘chief engineer’), continually reinforcing the concept vision with the engineering teams as they designed the vehicle.
High performing development processes were organized by forming integrated product teams – relatively small cross-functional teams with members from product planning, product engineering, and process engineering. These teams engaged in continual problem-solving cycles focused on specific vehicle capabilities. They enabled a high degree of information exchange between upstream and downstream processes throughout the development cycle, which contributed to shorter lead times, higher product quality, and greater engineering productivity – in short, higher development performance.

So, there you have it, a good summary of three important characteristics of ‘lean’ product development, written about 30 years ago, before ‘lean’ became a commonly used term. It turns out that the third characteristic, ‘integrated product teams’, is especially important. Today, most software-intensive products and services are developed by such teams, although they probably have a different name: cross-functional teams or multi-discipline teams or full stack teams. These teams create a rapid flow of high-quality prototypes or deliverables which generate feedback to improve the design – an approach that has proven to be far more productive than sequential development.

The second characteristic has also proven to be important. Most modern software-intensive products and services – including open source projects, startup company products, AWS services, and SpaceX rockets – have a strong (entrepreneurial) leader who champions the product vision.

Example: SpaceX

On September 14, 2017, SpaceX posted a video on YouTube called "How Not to Land an Orbital Rocket Booster;"[12] it shows crash after crash during attempted landings of rocket boosters. As you might guess, the video ends with success – the first successful landing, and later the first successful landing on a drone ship. But think about it: Why would a company showcase so many failures?

SpaceX was founded in 2002 with the goal of making access to space affordable by designing and launching low cost orbital rockets. The company has kept engineering cost low through a rapid design process that values learning by doing. As SpaceX Launch Director John Muratore explained[13] “Because we can design-build-test at low cost. we can afford to learn through experience rather than consuming schedule attempting to anticipate all possible system interactions.”

SpaceX has also kept launch costs low through a program of recovery and reuse of rocket boosters and other parts. The first thing the company had to learn was how to land rocket boosters under their own power so that they could be reused. It took a lot of trial and error before the first booster landed successfully but learning through experience (including crash landings) was much faster and far less expensive that the anticipate-everything-in-advance approach. So, SpaceX is rightfully proud of its engineering approach: it works. It’s faster, better, and cheaper to learn by doing instead of learning before engineering starts. Learning is what engineering is all about.

Muratore says “SpaceX operates on the Philosophy of Responsibility. No engineering process in exitance can replace this for getting things done right, efficiently.”[14] What is the Philosophy of Responsibility? It means that Engineers are responsible for the design and engineering of a component, and for ensuring that their component operates properly and does its job as part of the overall system. So, let’s say a rocket booster crashes into the ocean rather than landing on a drone ship. Engineers know that they have 24 hours to report on what caused the failure and outline a plan to keep that thing from ever happening again. Thus, every launch is heavily instrumented with video and data-transmitting devices; these are not for advertising, they exist to provide detailed feedback to the engineering team so they can improve the design.

When SpaceX was learning how to land rocket boosters, it scheduled a test launch every couple of months. Each responsible engineer knew that the launch would happen, and their part had better be ready. The launch date Would. Not. Move. The next launch date effectively pulled the work of integrated product teams focusing on getting each component ready for the next launch.

Through the principle of responsibility and the practice of frequent integration tests, SpaceX has developed launch capability at a cost that is an order of magnitude lower than the companies who developed rockets under government contract. In addition, SpaceX’s cost to launch a kilo of payload is about an order of magnitude lower than the current cost for other large rockets. It should be no surprise that SpaceX’s low engineering and launch costs are threatening the existence of its competitors.

SpaceX is a good example of the essence of lean product development – small, responsible teams learn through a series of rapid experiments. Perfect launches are not the goal – at least not at first. Crashes are to be expected – but make sure the damage is limited and be prepared to determine the cause and never let it happen again. The goal is not perfect launches, it’s learning. As any good musician knows, practice time is the time to push the limits and make mistakes. If you never make any mistakes, you never learn.

Making the Shift to Digital

If your organization was not born digital, it may be considering a shift toward digital in order to leverage technologies such as artificial intelligence, augmented reality, ubiquitous Internet, and more. If digital startups are entering your market or competitors are making the shift to digital, you may have no choice; the ability to compete in a digital world is becoming necessary for survival. If this sounds familiar, check out Mark Schwartz’s book War and Peace and IT,[15] which summarizes the mindset shift necessary for companies to make the shift to digital. He discusses the ‘contractor model’ of IT – where the IT department is viewed as a contractor receiving specifications from ‘the business’ – and shows why this arm’s length relationship is obsolete in the digital age. In chapter 3 (Agility and Leanness) he introduces DevOps, a set of technical practices based on cross-functional teams and heavy automation that effectively does away with the tradeoff between speed and control – you can have both.

In the Harvard Business Review article “Building the AI-Powered Organization,”[16] Tim Fountaine, Brian McCarthy, and Tamim Saleh contend that a successful move to digital involves aligning a company’s culture, structure, and ways of working. Three fundamental shifts are required:

From siloed work to interdisciplinary collaboration.
From experience-based, leader-driven decision-making to data-driven decision-making at the front line.
From rigid and risk-adverse to agile, experimental, and adaptable.

Both of these references, and many more, confirm Clarke and Fujimoto’s description of a high-performance development organization:

Integrated product teams include product design, product engineering, and process engineering
Product leaders create a product vision that enables teams to make detailed decisions
The product is developed through rapid problem-solving cycles by multi-disciplinary teams

Digital natives like SpaceX, AWS, and Google have always worked this way. You might call this lean; you might call it digital, but in any case, it is the way good software engineering is done these days.

The Nature of Lean

“Friction is the concept which distinguishes real war from war on paper.”

-- Carl von Clausewitz

The operational focus of lean is to eliminate ‘waste’ – all the extra work that does not add value. For software engineering, we prefer to use the word ‘friction’ (instead of ‘waste’) to describe the stuff that annoys people and slows down processes, but it’s essentially the same thing. We spend our time trying to reduce friction in the consumer experience, friction within our products, and friction in our processes. But if you prefer the word ‘waste’, feel free to substitute it for ‘friction’.

Last winter we had an ice storm here in Minnesota, and it was impossible to walk down our driveway to get the mail. It was impossible to drive our car into our garage. We were surrounded by a moat of glare ice until we spread sand on the driveway to add some friction. It’s easy to understand that there are times when friction is necessary; it is also easy to realize that in general, the less friction the better.

During its early years, Amazon focused intently on removing friction from the customer experience and from the experience of third-party sellers. Amazon’s most enduring innovations as a young company came from imagining ways to give customers and merchants ‘superpowers’, making their experience with Amazon as friction-free as possible.[17]

This is what good product design is all about: walk in the shoes of customers, learn to see the friction in their journey, and find ways to reduce that friction. Lean product development adds one more dimension: focus on reducing customer friction as rapidly and smoothly as possible. In order to do this, the development process needs to be low friction also. That means looking inside the development workflow to find and reduce any friction that slows things down, reduces quality, or incurs unnecessary cost. In this section we’ll look at the four biggest sources of friction when creating software-intensive products and services.

Friction #1: Inefficient Flow

A huge source of friction for many people is rush hour traffic. They know how long it would take them to get to and from work if the roads were empty, and a commute time that’s much longer is annoying. To get a measure of how efficient a commute is, divide the ideal commute time (with empty roads) by the actual commute time. If the commute has no delays, it is 100% efficient. A 50% efficiency means the commute takes twice as long as it needs to. 10% efficiency means it takes 10 times longer to get home than it would without rush hour traffic. A look at Google Maps during the 10% efficient commute would show a lot of red. That’s friction.

How long does it take a market opportunity to commute through your development process, from concept to cash? How fast could it travel if there were no backlogs, no loopbacks, no red spots on the process map? To measure the flow efficiency of your process, divide the fastest possible commute time by the typical commute time; that tells you how much of the time you are actually working on a problem as it moves through your process. The flow efficiency of a typical software development process is around 10%. The flow efficiency of a lean development process should be over 50%.

Most companies measure how efficiently they use their resources rather than how efficiently they chase a market opportunity. But that’s like a city measuring the efficiency of its road system by counting how many cars it can fit on its roads rather than how fast the traffic moves. What’s more important – full roads or faster commutes? What’s more important – busy engineers or the capacity to rapidly seize a market opportunity?

In the book This is Lean,[18] Niklas Modig and Pär Åhlström make the case that the essence of Lean is a bias for flow efficiency over resource efficiency. When a company competes in the digital world this makes a lot of sense, because technology changes so fast and opportunities are so fleeting that time to market is critical. But a couple of decades ago, the typical software development process emphasized resource efficiency (keep people and equipment fully utilized to minimize cost) because time-to-market did not seem particularly important.

Then in the early 2000’s, agile and lean ideas began making inroads into the way software was designed, created, and maintained. Extreme Programming[19] contained the roots of technical disciplines such as continuous integration and automated testing. Scrum[20] emphasized iterations. Kanban[21] improved flow management by limiting work-in-process. Twenty years is a long time in a rapidly moving field such as software, and in those two decades Extreme Programming has faded from sight even as its practices became widely accepted and expanded. Kanban charts continue to be used by many teams to visualize and manage their workflow. But Scrum has failed to evolve fast enough. Shortcomings such as unlimited backlogs, relatively long iterations, product owners as proxies, and silence on technical disciplines earn Scrum a ‘not recommended’ rating for lean practitioners today.

Reduce Friction: Continuous Delivery / DevOps

Dramatic advances in software engineering workflow can be traced to the 2010 book Continuous Delivery[22] by Jez Humble and David Farley. This is arguably one of the most influential books in changing the workflow paradigm from a focus on resource efficiency to a focus on flow efficiency. It laid out in detail the technologies and processes that would enable large enterprises to safely change their production code very frequently, even continuously.

At the time (2010), Amazon was deploying changes to production an average of every 11 seconds. Google was deploying changes multiple times a day. Most digital startups were using similar rapid processes and not experiencing much difficulty. The cloud was gaining traction. And yet, most enterprises thought rapid delivery was an anomaly – certainly serious enterprises that valued stability would not engage in such dangerous practices. But Continuous Delivery debunked the myth that speed was the same as sloppy and introduced the concept that high speed and high discipline went hand in hand.

The basic idea of continuous delivery is to create a workflow that has no interruptions from the time a development team chooses to work on a feature until it is ready to be ‘deployed’ (‘go live’); in many cases it actually goes live immediately and automatically. Of course, this means the operations people who used to receive software releases ‘over the wall’ must be continually involved – they must be part of the team. The combined team is called a DevOps team, and the term DevOps has become almost synonymous with continuous delivery.[23]

A second important practice of continuous delivery is this: batches of code are no longer accumulated (as they used to be) on branches that must be merged later, because merging batches of code invariably exposes interaction problems that must found and fixed. Collecting a batch of software prior to testing – even a two-week batch – makes finding the cause of defects much too hard.

Instead, the practice of trunk-based development[24] has replaced branches. All code resides on main branch (trunk), where continuous integration with the entire code base is possible. Code under development is checked into a repository very frequently, triggering an automated test harness to run; if the tests don’t pass the new code is rolled back or reverted, leaving the trunk in an error-free state. If the tests pass, the new code moves down a continuous integration / continuous deployment (CI/CD) pipeline which applies increasing layers of integration and more sophisticated automated tests (for example, security tests).

The objective is to be sure that the trunk is always ready for deployment, and if it is not, a virtual ‘Andon cord’ is pulled and the highest priority of the team is to return it to a deployable state. Depending on the context, the code may be deployed as soon as it reaches the end of the pipeline (typical of an online environment), or deployment may be delayed for domain reasons. A compromise practice is to deploy code as soon as it reaches the end of the pipeline, but with new features turned off. This provides a final robustness test, and at a convenient later time individual features can be turned on (and off) with a software switch. As an added benefit, the switches enable A/B testing as well as targeted ‘canary’ releases that limit the impact of any problems and allow rapid rollback if necessary.

Developing a robust CI/CD pipeline and managing controlled roll-outs is challenging, high discipline work. It involves a lot of automation and is usually accompanied by a change in system architecture, organizational structure, and incentives (more on that later). It is the technical enabler of lean in software engineering, and today, a decade after the Continuous Delivery book was published, it is the way modern software is built.

Reduce Friction: Limit Work to Capacity

Integrating and releasing software in big batches is not the only practice that slows down software engineering workflow. Another significant source of friction is the failure to limit work to capacity. Just about every software engineering organization on the planet has more requests for work than it can accommodate, so this is a universal problem.

If an organization wants to limit work to capacity, the first question it should ask is: On average, how many ‘things’ get released to production in a unit of time (quarter, month, week, day)? Most software engineering organizations can answer this question or easily find the data. And most organizations find that for small or medium sized efforts, their output rate is more or less the same over time. However, even when they know their output rate, many organizations fail to limit amount of work they accept to the amount of work they finish. Instead they accept work and then put it in a ‘backlog’ that is subject to endless prioritization. It’s clear that they can’t do all the work, they just don’t want to say “no”.

See Friction: Backlogs

The obvious way to limit work to capacity is to use a pull system that accepts work items at the same rate as they are completed. For example, if an average of two items are deployed every day, then no more than two items per day should be accepted. Work items should not be put on a backlog for later prioritization, they should be accepted or rejected as quickly as practical. Teams do not need a long list of work to be done, and requestors do not need to be left wondering whether (and when) their problem will be resolved. A small, limited buffer may be necessary to absorb variation in input flow, and some capacity may be held open for urgent work, but that small amount of friction is like putting sand on ice.

Backlogs, on the other hand, tend to generate a huge amount of friction at many points in the development process, unnecessary friction that dramatically slows down every item that goes through the process. A better approach is to respond to work requests immediately with one of two responses: “Yes, we can do what you requested, and you can expect delivery by [insert a valid promise date].” or “Unfortunately, we do not have the capacity to do what you are requesting.” That’s it. Learn to say “no.” Immediately. Customers appreciate it. Teams love it. Everything gets done a lot faster. It works.

Friction #2: Dependencies

Arguably the biggest friction-generator in software systems is dependencies – one part of the code depends on another part of the code, or most likely many other parts of the code. These dependencies create a complex web of interactions that quickly become impossible to trace. Bugs show up as insidious unintended consequences of these interactions after the software is deployed. Experience has taught us that finding all the unintended consequences is impossible, no matter how much testing is done. It is well understood that software systems are inherently complex,[25] and that tightly coupled complex systems will eventually fail.[26]

The key to solving this problem lies in the words ‘tightly coupled.’ Loosely coupled systems can be very robust. Consider the Internet. No one worries about my web site accidentally changing the balance of my bank account, even though I can display them both on my computer at the same time. Or consider a smart phone. My weather app cannot accidentally add things to my shopping list.

The key is to eliminate dependencies rather than cater to them. For decades, enterprises have attempted to coordinate their desperate applications through an enterprise database, but that database became a dependency generator. Changes to the data format in one application – even small changes – meant changing every application that used the same data. Then each of these applications had to be tested both separately and together in a newly built system. If an error was found the process had to be repeated, often many times. Since we’re talking about slow and expensive manual testing, this could go on for a long time.

Reduce Friction: Federated Architecture

Theoretically, we know that a federated architecture will address this problem, but practically speaking, enterprises did not do an effective job of adopting federated architectures until around the year 2000. That’s when newly minted internet companies tried to grow systems many times larger than any enterprise could manage. Without a new paradigm for system architecture, scaling was extraordinarily difficult, so many failed. It wasn’t until Google began publishing papers on scalable infrastructure and AWS started selling it that practical ways to break the crippling dependencies of our enterprise systems began to emerge.

The strategy for breaking dependencies, in a nutshell, is to take the bazaar approach to both system architecture and development teams. Small, independent teams own a small service – called a microservice these days. Services communicate with other services through hard boundaries with API (Application Program Interface) contracts prescribing their interaction. Integration testing is done at service boundaries, dramatically limiting the number of interactions that need testing, while simultaneously clarifying responsibility for correct performance.

Do not think of a microservice architecture as a flat layer of tiny services. Consolidator services aggregate smaller services into higher-level services, usually resulting in layers of consolidated services. Related services that are likely to change together are often grouped together (this is called a ‘bounded context’). All teams are expected to understand their role in the larger context and work toward shared goals. Very rapid releases through a CI/CD pipeline provide continuous feedback to the teams on their progress, pulling any needed adjustments from each team.

Reduce Friction: Sync and Stabilize

Hardware systems that rely heavily on software also use bounded contexts, but they are usually defined by the hardware components. Consider SpaceX’s Falcon Heavy sitting on the launch pad about to send satellites into space. At the top is the payload, next is the second stage, and at the bottom are three first stage rocket boosters. Each first stage booster has nine Merlin engines, a fuel storage/dispensing system, and four landing legs. A landing leg is a component; it has a team made up of both hardware and software engineers, led by a responsible engineer. Their job is to make sure the landing legs work properly and do their job as part of the overall system. Let’s say they are working on an improved design that will hook the rocket into place after landing on a drone ship. They know when the next launch is scheduled, and they know the date will not move. This launch date ‘pulls’ their work as well as their coordination with the drone ship landing pad team. They perform static tests of the hooking system, which go well, but the true test happens when as the rocket attempts a landing on the drone ship landing after the launch. Each launch is a test to synchronize all the components and stabilize their combined performance.

This ‘Sync and Stabilize’[27] approach has long been used in the development of software-intensive hardware systems, and it works. Here is an email I received recently that shows its benefits:[28]

You may recall that you spent a day with us last June. In one of the sessions I described to you a big challenge we had: which was to achieve success in a critical warehouse project involving dozens of software development teams (30+). The project had extremely tight timescales, complex integration requirements, additional software to be developed and a large number of unknowns. The good news was that a test facility was being prepared for us to use, including the relevant automation /robotics. But how should we use it?

In the session you recommended we use a 'Synch and Stabilize' demo approach, which you described to those assembled. This email is to let you know that we did indeed do what you suggested - and it has proved revolutionary for us. Pretty much the next day we started organizing our first planning session, which involved 50-60 teams. Our first demo was in September. We have now run 6 demos and we have been making excellent progress. There is no going back! As you would have predicted, the approach has yielded many benefits e.g. alignment, communication, sense of commitment, teams helping other teams plus teams inspiring other teams etc.

Friction #3: Cost Centers

In the 1960’s, IT was largely an in-house back-office function focused on process automation and cost reduction. Today, digital technology plays a significant strategic and revenue role in most companies and is deeply integrated with business functions. Digital natives (companies born in the last two decades) typically do not have IT departments, but in industries that were born before the Internet, IT departments are still commonly found. And where they exist, IT departments are usually cost centers; that is, performance is measured by cost containment and/or reduction. Since a key cost driver of IT departments is salaries, good performance usually means doing more work with fewer (or less expensive) people. Thus, IT incentives have historically been stacked in favor of resource efficiency: keep everyone fully utilized and outsource work to lower salaried regions. The fact that these are two of the best ways to decrease flow efficiency carried little weight.

Back in the mid 1980’s, before ‘lean’ came into our lexicon, Just-in-Time (JIT) was gaining traction in manufacturing companies. JIT always drove inventories down sharply, giving companies a much faster response time when demand changed. However, accounting systems count inventory as an asset, so any significant reduction in inventory had a negative impact on the balance sheet. Balance sheet metrics made their way into senior management metrics, so successful JIT efforts tended to make senior managers look bad. Often senior management metrics made their way down into the metrics of manufacturing organizations, and when they did, efforts to reduce inventory were half-hearted at best. A generation of accountants had to retire before serious inventory reduction was widely accepted as a good thing.[29]

Returning to the present, being a cost center means that IT performance is judged – from an accounting perspective – solely on cost management. Frequently these accounting metrics make their way into the performance metrics of senior managers, while contributions to business performance tend to be deemphasized or absent. As the metrics of senior managers make their way down through the organization, a culture of cost control develops, with scant attention paid to improving overall business performance. Help in delivering business results is appreciated, of course, but is rarely rewarded, and rarer still is the cost center that voluntarily accepts responsibility for business results.

In addition, cost center projects are normally capitalized until they are “done” (they reach “final operating capability”) and are turned over to production and maintenance.[30] But when an organization adopts modern software practices such as continuous delivery (or DevOps), the concept of final operating capability – not to mention maintenance – disappears. This creates a big dilemma because it's no longer clear when, or even if, software development should be capitalized. Moving expenditures from capitalized to expensed not only changes whose budget the money comes from; it can have tax consequences as well. And what happens when all that capitalized software (which, by the way, is an asset) vanishes? Just as in the days when JIT was young, continuous delivery has introduced a paradigm shift that messes up the balance sheet.

But the balance sheet problem is not the only issue; depreciation of capitalized software can wreak havoc as well. In manufacturing, the depreciation of a piece of process equipment is charged against the unit cost of products made on that equipment. The more products that are made on the equipment, the less cost each product has to bear. So, there is strong incentive to keep machines running, flooding the plant with inventory that is not currently needed. In a similar manner, the depreciation of software makes it almost impossible to ignore its sunk cost, which often drives sub-optimal usage, maintenance and replacement decisions.

Capitalization of development creates a hidden bias toward large projects over incremental or continuous delivery, making it difficult to look favorably upon lean development practices. Hopefully we don't have to wait for another generation of accountants to retire before delivering software rapidly is considered a good thing.

See Friction: Life in a Cost Center

Cost Centers have another problem: they can be demoralizing. You aren’t on the A team that creates awesome customer journeys and brings in revenue, you’re on the B team that writes code and consumes resources. No matter how well the business performs, you’ll never get credit. Your budget is unlikely to increase when times are good, but when times are tight, it will be the first to be cut. Should you have a good idea, it had better not cost anything, because you can’t spend money to make money. If you think that a bigger monitor would make you more efficient, good luck making your case. Yet if your colleagues in trading suggest larger monitors will help them generate more revenue, the big screens will show up in a flash.[31] It’s no wonder that IT departments have found it challenging to attract and retain good people, especially in the face of a world-wide shortage of software engineers.

Reduce Friction: Cost of Delay

There are two sides to an investment – how much it costs and how much benefit (cost reduction or added revenue) it will generate. The fact that these are guesses about the future doesn’t stop them from being used to make decisions. So, we may as well assume that the cost and benefit projections used to justify an investment are correct and use them to calculate a third number: the cost of delay.[32] How much is the cost being increased and how much of the benefit is being lost for each day of delay? What would be the cost savings and added benefits if a valuable feature were delivered early?

If accounting is going to drive decisions, then why not calculate the time value of money along with cost and benefit calculations, and use the result to invest in flow efficiency? A development team – even one in a cost center – should be able to spend the cost of a day’s delay in order to deliver the benefit a day earlier. Unfortunately, it’s a rare event when a development team gets to spend even one day’s cost of delay.

Friction #4: Proxies

We make a mistake when we put proxies between an engineering team and its customers, and yet we do it all the time. When colleagues in “The Business” request new features from IT, they do not ask for improved business outcomes, they request capabilities that may or may not produce the desirable business results. These proxies for business outcomes detach engineering team members from the purpose of their work.

Consider this: Jeff Dean, co-inventor of Google’s amazing data storage and search capabilities (mentioned earlier), left DEC Research labs in 1999 to join a startup called Google. "Ultimately, it was this frustration of being one level removed from real users using my work that led me to want to go to a startup," Dean says.[33] Good software engineers share that desire to have an immediate connection with customers, but such connections are rarely found in IT departments, or in the contracting organizations they mimic.

There are a lot of proxies in our development processes. “The Business” is a proxy. A product owner is described as a proxy in Scrum Guides. Projects typically start after the deliverables have been specified and are considered successful if cost, schedule, and scope targets are met; thus, project metrics are proxies for the outcomes envisioned by those who funded the project. Project teams are often not told about the desired project outcomes, generally have no way to influence those outcomes, and are almost never responsible for them. In most projects, team members never hear about the actual outcomes after the project is ‘delivered’.

Proxies create friction in many ways in in a development process. Multiple handovers slow things down and lose a lot of information. The engineering team lacks firsthand experience with the problem to be solved so amateurs end up designing technical solutions to technical problems. Their guesses at solutions are turned into requirements with no attempt to validate them. Feedback loops – should they exist at all – are far too long.

And that is perhaps the worst thing about proxies. We know that proxy metrics drive teams to excel at what is measured – feature delivery, for example – rather than what is desired – business outcomes. We also know that proxy metrics are almost never validated against the desired business outcomes, and that the majority of the features and functions in a bespoke software system are neither needed nor likely to be used.[34] We know that building the wrong thing is the biggest waste in software engineering and the best way to build the right thing is to validate the business impact of each feature as we deploy it. And we have all of the tools in our toolbox to be able to create the rapid feedback loops necessary for such validation. So why would anyone use proxy metrics rather than business outcomes to measure development performance?

Reduce Friction: Full Stack Teams

In 2002, John Rossman was hired by Amazon to lead the launch of third-party services. He began by using Amazon’s standard approach – write a press release set in the future which describes the experience of future customers. His press release read: “A seller, in the middle of the night without talking to anyone, can register, list an item, fulfill an order, and delight a customer as though Amazon the retailer had done it.”[35] That’s it. No requirements or other proxies, just a powerful statement that succinctly defined the responsibility, constraints, and expected outcomes of the team that would work on this service. It also defined the composition of the team that would bring the service to life: everyone necessary to start up a new line of business.

The most successful technology companies today establish relatively small, full stack teams and challenge them with interesting problems. These teams have:

A clear description of the team’s mission (responsibility), constraints, and expected outcomes.
A leader (responsible engineer, product manager) who guides the team toward good decisions.
An immediate connection with their consumers, minimum dependencies on other teams, freedom to act autonomously and asynchronously within constraints, and full responsibility for outcomes.

Full stack teams develop a product, component, or service, while maintaining a clear understanding of their role and responsibility within the larger system. They are supported by experts or leaders in competency areas that are particularly important in their industry or market.

Case Study: ING Netherlands

In 2015 the employees at ING Netherlands headquarters – over 3,000 people from marketing, product management, channel management, and IT development – were told that their jobs had disappeared. Their old departments would no longer exist; small squads would replace them, each with end-to-end responsibility for making an impact on a focused area of the business. Existing employees would fill the new jobs, but they needed to apply for the positions.[36]

It was a bold move for the Netherlands bank. The leaders were giving up their traditional hierarchy, detailed planning and “input steering” (giving directions). Instead they would trust empowered teams, informal networks, and “output steering” (responding to feedback) to move the bank forward. The bank was not in trouble; it did not really need to go through such a dramatic change. What prompted this bet-your-company experiment?

The change had been years in the making. After initial experiments in 2010, the IT organization put aside waterfall development in favor of agile teams. As successful as this change was, it did not make much difference to the bank, so Continuous Delivery and DevOps teams were added to increase feedback and stability. But still, there was not enough impact on business results. Although there were ample opportunities for business involvement on the agile teams and input into development priorities, the businesses were not organized to take full advantage of the agile IT organization. Eventually, according to CIO Ron van Kemenade (CIO of ING Netherlands from 2010 until he became CIO of ING Bank in 2013):[37]

The business took it upon itself to reorganize in ways that broke down silos and fostered the necessary end-to-end ownership and accountability. Making this transition … proved highly challenging for our business colleagues, especially culturally. But I tip my hat to them. They had the guts to do it.

The leadership team at ING Netherlands had examined its business model and come to an interesting conclusion: their bank was no longer a financial services company; it was a technology company in the financial services business. The days of segmenting customers by channel were over. The days of push marketing were over. Thinking forward, they understood that winning companies would use technology to provide simple, attractive customer journeys across multiple channels. This was true for companies in the media business, the search business, most retail businesses, and it was certainly true for companies in the financial services business. Moreover, expectations for engaging customer interactions were not being set by banks – they were being set by media and search and retail companies. Banks had to meet these expectations just to stay in the online game.

ING Netherlands’ leadership team decided to look to other technology companies, rather than banks, for inspiration. For example, on a trip to the Google IO developers conference Ron van Kemenade was impressed by the amazing number of enthusiastic, engaged engineers at Google. He realized that such enthusiasm could not surface in his company, because the culture did not value good engineering.

The leaders at ING Netherlands decided to investigate how top technology companies attract talented people and come up with engaging products. Through concentrated visits to some of the most attractive technology companies, they saw a common theme – these companies did not have traditional enterprise IT departments even though they were much bigger than any bank. Nor did they have much of a hierarchical structure. Instead, they were organized in teams – or squads – that had a common purpose, worked closely with customers, and decided for themselves how they would accomplish their purpose.

ING Netherlands decided that if it was going to be a successful technology company and attract talented engineers, it had to be organized like a technology company. Studying the best technology companies convinced them that they needed to change – and the change had to include the whole company, not just IT. The bank had already modularized its architecture, streamlined and automated provisioning and deployment, moved to frequent deployments, and formed agile teams. But this was done within the IT department rather than across the organization, and the results were not exceptional. Now it was time to create a digital company across all functions.

They chose to adopt an organizational structure in which small teams – ING calls them squads – accept end-to-end responsibility for a consumer-focused mission. Squads are expected to make their own decisions based on a shared purpose, the insight of their members, and rapid feedback from their work. Squads are grouped into tribes of perhaps 150 people that share a value stream (e.g. mortgages), and within each tribe, chapter leads provide competency leadership. Along with the new organizational structure, ING’s leadership team worked to create a culture that values technical excellence, experimentation, and customer-centricity.

We visited ING Netherlands in 2017. We found a small group of very experienced lean ‘sensi’ who had worked at the bank for many years. They showed us a strategy deployment room that had the bank’s long-term strategy at the top, with about a dozen focus areas just below. Each focus area was connected to one or more strategic initiatives, and each strategic initiative had a few problems below it, typically with an A3 document attached. Teams were given these problems to solve, and they could walk into the room at any time to see its context in the company’s overall strategy.

We talked to members of several teams and found them uniformly delighted at their new way of working. We heard repeatedly: “You should have seen us a year ago – it’s so much better now!” One team showed us how they decided what to work on. They had a list of customer frustrations gleaned from various sources, including artificial intelligence tools scanning social media. When they were ready to attack a new problem, they picked the top frustration on the list. Their leader, a designer, helped the team design candidates for a better experience, test them, and implement the best ones.

Spark New Zealand

About the same time, managers from Spark New Zealand also visited ING. Spark’s managing director, Simon Moutter, says that ING’s success helped guide his company through a similar change:[38]

I was impressed when we visited ING; I thought ING’s model was structured, performance driven, and very applicable in our context – “agile for grown-ups,” if you like. It was less about beanbags and foosball tables and more about real delivery action, and that gave me confidence that there was an outcome that – if we could deliver it – would make a big and enduring difference.

Spark New Zealand subsequently made the switch to integrated teams with impressive results:

Spark is seen as a positive company, an innovative company, …. Over the past two years or so, we’ve been winning a range of business awards, a number of which we weren’t even getting nominated for before. We also have a degree of execution excellence now that has been noticed by investors. We say it, we do it.

The success is showing up in the “hard” numbers; our mobile market share is up eight percentage points, to 40 percent, since 2013—a huge turnaround.

20/20 Vision

“The future is here, it’s just not evenly distributed.”
-- William Gibson

You do not have to look far to see lean principles being applied – or being pursued – in the design and engineering of software-intensive products. The term ‘lean’ may not be used, but the shift to digital has become a strategic necessity for a large number of companies. If you look closely at successful digital companies, they look rather ‘lean’. They obsess over customers. They create an engaging engineering culture. Full stack teams deliver early and often and learn from feedback. Infrastructure and products are stable, secure, and resilient. This is lean. This is the future. It’s just not (yet) evenly distributed.

__________

Footnotes

[1] The text of Eric Raymond’s presentation can be found here: https://firstmonday.org/article/view/578/499. A later version of The Cathedral and the Bazaar was published by O’Reilly Media in 1999.

[2] https://www.google.com/search/howsearchworks/mission/

[3] “A Case for Redundant Arrays of Inexpensive Disks (RAID)” by David A Patterson, Garth Gibson, and Randy H Katz ACM SIGMOD Conference, June 1988

[4] “Web Search for a Planet: The Google Cluster Architecture” by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, Published by IEEE MICRO, March-April 2003. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/googlecluster-ieee.pdf

[5] “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, presented at ACM SOSP’03, October 19–22, 2003. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

[6] “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat, presented at ACM/USENIX Symposium on Operating System Design and Implementation (OSDI), 2004. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

[7] How Google Tests Software by James A. Whittaker, Jason Arbon, and Jeff Carollo, Addison-Wesley Professional, March, 2012

[8] See Managing to Learn: Using the A3 Management Process to Solve Problems, Gain Agreement, Mentor and Lead by John Shook and Jim Womack, Lean Enterprises Inst Inc, June, 2008

[9] See “Online Experimentation at Microsoft” by Ron Kohavi, Thomas Crook, and Roger Longbotham, presented at the ACM Knowledge Discovery & Data Mining (KDD) Conference, 2009. Available at https://exp-platform.com/Documents/ExPThinkWeek2009Public.pdf

[10] The Machine That Changed the World; the Story of Lean Production by James P. Womack, Daniel T. Jones, and Daniel Roos. Rawson & Associates, 1990

[11] Product Development Performance: Strategy, Organization, and Management in the World Auto Industry by Kim B. Clark and Takahiro Fujimoto, Harvard Business School Press, 1990. See pg. 172.

[12] How not to land an orbital rocket booster https://www.youtube.com/watch?v=bvim4rsNHkQ published by SpaceX, Sept 14, 2017

[13] John Muratore, SpaceX launch Director, American Institute of Aeronautics and Astronautics (AIAA) 2012 Complex Aerospace Systems Exchange, Available at http://store.xitricity.skydreams.ws/s3j95uj8a.pdf

[14] Ibid

[15] War and Peace and IT: Business Leadership, Technology, and Success in the Digital Age by Mark Schwartz, IT Revolution Press, May, 2019

[16] “Building the AI-Powered Organization: Technology isn’t the biggest challenge. Culture is.” by Tim Fountaine, Brian McCarthy, and Tamim Saleh, Harvard Business Review, July-August, 2019

[17] Think Like Amazon: 50 1/2 Ideas to Become a Digital Leader by John Rossman, McGraw-Hill Education April, 2019. See Idea 26: Innovate by Reducing Friction.

[18] This is Lean: Resolving the Efficiency Paradox by Niklas Modig and Pär Åhlström, Rheologica Publishing, November, 2012

[19] Extreme Programming Explained, by Kent Beck. Addison-Wesley, 2000

[20] Agile Software Development with SCRUM by Ken Schwaber and Mike Beedle, Pearson, October 2001

[21] Kanban: Successful Evolutionary Change for Your Technology Business by David J. Andreson, Blue Hole Press; April, 2010

[22] Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and David Farley. , Addison-Wesley Professional, 2010

[23] The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, and John Willis, IT Revolution Press, October, 2016

[24] See https://trunkbaseddevelopment.com/

[25] “No Silver Bullet. Essence and Accident in Software Engineering” by Frederick Brooks, IFIP Tenth World Computing Conference, Amsterdam, NL, 1986

[26] Normal Accidents: Living With High-risk Technologies by Charles Perrow, Basic Books, 1985

[27] Michael Cusumano popularized the term Sync and Stabilize in his book Microsoft Secrets: How the World's Most Powerful Software Company Creates Technology, Shapes Markets and Manages People, by Michael A. Cusumano, Free Press, December 1998.

[28] Used with permission.

[29] The 1962 book “The Structure of Scientific Revolutions” by Thomas Kuhn discussed how significant paradigm shifts in science do not take hold until a generation of scientists brought up with the old paradigm finally retire.

[30] From “What is Digital Intelligence” by Sunil Mithas and F. Warren McFarlan, IEEE Computing Edge, November 2017. Pg.9.

[31] Thanks to Nick Larsen. Does Your Employer See Software Development as a Cost Center or a Profit Center? https://stackoverflow.blog/2017/02/27/employer-see-software-development-cost-center-profit-center/

[32] The idea of “Cost of Delay” was introduced by Preston G. Smith and Donald G. Reinertsen in their book Developing Products in Half the Time, Van Nostrand Reinhold, 1991, Second edition, Wiley, 1997.

[33] “If Xerox Parc Invented the PC, Google Invented the Internet” by Cade Metz, Wired, August, 2012. Available at https://www.wired.com/2012/08/google-as-xerox-parc/

[34] Standish Group Study Reported at XP2002 by Jim Johnson, Chairman

[35] Think Like Amazon: 50 1/2 Ideas to Become a Digital Leader by John Rossman, McGraw-Hill Education April, 2019, See Idea 45 The Future Press Release.

[36] From: ING’s agile transformation, an interview with Peter Jacobs, CIO of ING Netherlands, and Bart Schlatmann, former COO of ING Netherlands, in McKinsey Quarterly, January 2017. See also: Software Circus Cloudnative Conference keynote by Peter Jacobs. (Peter Jacobs, replaced Ron van Kemenade as CIO of ING Netherlands in 2013.)

[37] From: Building a Cutting-Edge Banking IT Function, An Interview with Ron van Kemenade, CIO ING Bank, by Boston Consulting Group. See also talks by Ron van Kemenade: Nothing Beats Engineering Talent…The AGILE Transformation at ING and The End of Traditional IT.

[38] “All in: From recovery to agility at Spark New Zealand” From interviews with Spark New Zealand’s Simon Moutter, Jolie Hodson, and Joe McCollum by McKinsey’s David Pralong, Jason Inacio, and Tom Fleming, McKinsey Quarterly, June 2019

Lean Lunch Lines

2019-06-07T17:54:00.001-05:00

Budapest
“This doesn’t look good,” Tom said, pointing out six signs high on the wall, side-by-side. Three said “Pork and Beef.” Three said “Fish and Vegan.” Below each sign a serving station was being set up. Lunch break started in a half an hour. There were over 2000 people to feed, and the pouring rain meant any outdoor food options were unlikely to attract much traffic.

We weren’t the only ones who noticed. Lines began to form at the food stations; the queues for meat grew especially long. We joined one of the three long lines and moved slowly to the front. At the food station we had several choices to make – so it took a while to be served.

I did the math. Each meat station was serving about 4 people per minute, so the three meat stations might serve roughly 750 people in an hour – maybe 1000 if the service got faster. From the short lines at the fish/vegan stations, I inferred that they might account for 15-20% of the demand, leaving over 1500 people to be served by the three lines offering meat. The 90-minute lunch break was probably not going to be long enough.

Sure enough, just before the sessions were set to resume, the following tweet announced that the afternoon talks would be delayed by a half hour:

At a conference whose attendees pride themselves on agility and thus should understand queuing theory, this was a big disappointment. In Lean we have a mantra: “Go and See.” It means go to the place where delays are happening and see for yourself what is going on. I can’t help thinking that if an observant person with authority to change things had taken a close look at the lines on the first day, it might have been possible to improve the process and keep the conference on schedule. For example, they might have switched two of the three fish/vegan lines to meat, or perhaps they could have served all types of food from all six food stations, or shortened the time it took to serve a meal.

There was a second day to the conference; and I assumed that lessons had been learned, and queues would be shorter. As lunch time approached, there were four more signs posted high on a wall at the far end of the room, perpendicular to the first six signs. Two said “Pork and Beef.” Two said “Fish and Vegan.” So now there were five long lines for the 80% or so of the attendees who preferred meat, and five short lines for the others. An observant caterer would not have added more fish/vegan lines; with a total of ten lines, no more than two should have been devoted to the meatless meals preferred by perhaps 20% of the attendees. Of course, that ratio was not precise; for example, although I preferred meat, I opted for the very short line serving fish, and I’m sure I was not alone.

The real test of lunch line flow is this: How long do attendees have to wait in line to obtain the food of their choice? At a conference where organizers understand queuing theory, attendees should not have to wait in a food queue for more than 10 minutes – or maybe up to 15 minutes at peak times. Asking people to stand in line for longer periods shows a lack of respect for their time.

Zürich
The following week we attended DevOps Days in Zürich. There were roughly 400 attendees, and as lunch approached I noticed two stations being stocked with food. I wondered how many people a station might serve per minute. If it was four per minute (as in Budapest) and there were only two stations serving lunch, I speculated that it could take 50 minutes to serve everyone – and that’s a long time for anyone to stand in line.

But I was wrong. Julia Faust, who helped plan the lunch, explained to me: “DevOps is about FLOW, so we want lunch lines that flow. We have four food stations – not two – and we have the same food at each station so anyone can get in any line. We limited the number of food offerings to be sure service is fast; we expect each station to serve about ten people per minute. We think that the four food stations can feed up to 40 people per minute, so 400 people could be fed in ten to fifteen minutes. We have also placed appetizers around on the tables so people can eat something before getting in line. We are hoping that the lines will form gradually and remain very short.”

Sure enough, there were almost no lines at the food stations, and everyone was served within fifteen minutes. Perhaps no one noticed that there was more time for networking and lunch-time gatherings, but to me it was clear that the organizers of this conference understood queuing theory and respected the time of attendees.

Lyon
We went to the MiXiT conference in Lyon the following week, where about 1000 attendees were expecting lunch. I was delighted to see that people were able to help themselves to food rather than have it served. I never could understand why most European conferences I attend find it necessary to have someone serve food and pour coffee. After all, just about every European hotel I stay at seems to have a breakfast buffet, so there's nothing inherently difficult about self-service meals.

There were two long food tables, one on each side of the lunchroom. Again, I did the math. To feed 1000 people in 15 minutes, the tact time (or output rate) for each table would have to be about 32 people per minute. With a line on each side of each table, the tact time should be about 16 people per minute for each of the four lines. But my calculation was wrong – there would be only one line per table, not two, so a 15 minute line would require that each line serve 32 people per minute. Clearly this was not going to happen.

“Why not pull the tables away from the wall a bit further and allow people to get food from both sides?” I asked the gentleman in charge of lunch.

“That’s not possible,” he replied. “We need access to one side of each table for replenishment.”

“But,” I said, “Then the lines would move twice as fast.”

“They are fast enough,” he said. “Yesterday the lines were 35 minutes long; they don't need to be faster.”

I saw tables stacked high with boxes and bags of food, with a long line of people moving past each table, picking up three or four individually packaged items. On the other side, a few people watched and occasionally replenished one or the other stacks of food. They could have easily interrupted a line to add a depleted item – this happens all the time at breakfast buffets. There was no reason (other than habit) to limit each table to one line. A conference that respects its attendees should optimize lunch lines for the convenience of attendees, and find other ways to optimize the convenience of the people serving food.

A Footnote on Diversity
Every software conference I attend broadcasts a policy encouraging diversity. I welcome that because I am different than most attendees – I am 75 years old (and proud of it). But somehow, my kind of diversity has not been considered at most conferences. Consider the first conference we stopped at this spring – Agile Lean Ireland. There were virtually no chairs except in rooms where talks were held; everyone was expected to stand during coffee and lunch breaks. So Tom and I ate – usually alone – in a conference room.

Lunch was served at multiple locations strung out down a long hallway. The station furthest from the conference rooms opened first, to encourage people to move to the most remote station. This might have been a good idea, except for one thing – I found myself swept up in a swarm of attendees racing down the hallway to get served first. After a short time, I just stopped – I had gone far enough. I turned to the servers at a nearby lunch station (which was not yet opened) and said “Give me food. Now. I can’t go any further.”

The gentleman I spoke to was about to refuse, but his wiser companion indicated he should go ahead. As he served lunch for Tom and me, I smiled gratefully at the woman who had broken the rules. Then she said to the nearby people hoping to get some food, “This location is not yet open. You have to keep moving to the end of the hall.” Sigh.

Long walks, long lines, no chairs, and toilets up or down stairs are all indications that a conference does not really welcome older, less agile attendees. Of the four conferences that Tom and I attended in April and May, DevOps Days in Zürich was the only one which had none of these limitations, and thus made us feel the most welcome.

What If Your Team Wrote the Code for the 737 MCAS System?

2019-04-04T20:45:00.000-05:00

The 737 has been around for a half century, and over that time airplanes have evolved from manual controls to fly-by-wire systems. As each new generation of 737 appeared, the control system became more automated, but there was a concerted effort to maintain the “feel” of the previous system so pilots did not have to adapt to dramatically different mental models of how to control a plane. When electronic signals replaced manual controls an “Elevator Feel Shift System” was added to simulate the resistance pilots felt when using manual controls and provide feedback through the feel of the control stick (yoke). A stall warning mechanism was also added – it was designed to catch the pilot’s attention if a stall seemed imminent, alerting the pilot to push forward on the yoke and thus increase the thrust and lower the nose a bit.

Enter a new version of the 737 (the 737 MAX) – rushed to market to counter a serious competitive threat. To make the plane more energy efficient, new (larger) engines were added. Since the landing gear could not reasonably be extended to allow for the larger engines, they were positioned a bit further forward and higher on the wing – causing instability problems under certain flight conditions. Boeing addressed this instability with the MCAS system – a modification of the (already certified and proven) Elevator Feel Shift System that would automatically lower the nose when an imminent stall is detected, rather than alerting the pilot. Of course, airplanes have been running on auto-pilot for years, so a little bit of automatic correction while in manual mode is not a radical concept. The critical safety requirement here is not redundancy, because the pilot is expected to override an autopilot system if warranted. The critical safety requirement is that if an autopilot system goes haywire, the pilots are alerted in time to use a very obvious and practiced process to override it.

Two 737 MAX airplanes have crashed, and the MCAS system has been implicated as a potential cause of each one. Based on preliminary reports, it appears that the MCRS system, operating with a combination of a (single) faulty sensor and a persistent reversal of the pilot override, may eventually be implicated as contributing to the disasters.

Hindsight is always much clearer than foresight, and we know that predicting all possible behaviors of complex systems is impossible. And yet, I wonder if the people who wrote the code for the MCRS system were involved in the systems engineering that led to its design. More to the point, as driverless vehicles and sophisticated automated equipment become increasingly practical, what is the role of software engineering in assuring that these systems are safe?

Back in the day, I wrote software which controlled large roll-goods manufacturing processes. I worked in an engineering department where no one entertained the idea of separating design from implementation. WE were the engineers responsible for understanding, designing, and installing control systems. A suggestion that someone else might specify the engineering details of our systems would not have been tolerated. One thing we knew for sure – we were responsible for designing safe systems, and we were not going to delegate that responsibility to anyone else. Another thing we knew for sure was that anything that could go wrong would eventually go wrong – so every element of our systems had to be designed to fail safely; every input to our system was suspect; and no output could be guaranteed to reach its destination. And because my seasoned engineering colleagues were suspicious of automation, they added manual (and very visible) emergency stop systems that could easily and quickly override my automated controls.

But then something funny happened to software. Managers (often lacking coding experience or an engineering background) decided that it would be more efficient if one group of people focused on designing software systems while another group of people actually wrote the code. I have never understood how this could possibly work, and quite frankly, I have never seen it succeed in a seriously complex environment. But there you have it – for a couple of decades, common software practice has been to separate design from implementation, distancing software engineers from the design of the systems they are supposed to implement.

Returning to the 737 MAX MCRS System, while its not useful to speculate how the MCRS software was designed, it’s useful to imagine how your team – your organization – would approach a similar problem. Suppose your team was tasked with modifying the code of a well-proven, existing system to add a modest change to overcome a design problem in a new generation of the product. How would it work in your environment? Would you receive a spec that said: When a stall is detected (something the software already does), send an adjustment signal to bring the nose down? And would you write the code as specified, or would you ask some questions – such as “What if the stall signal is wrong, and there really isn’t a stall?” Or “Under what conditions do we NOT send an adjustment signal?” Or “When and how can the system be disabled?”

If you use the title “Software Engineer,” the right answer should be obvious, because one of the primary responsibilities of engineers is the safety of people using the systems they design. Any professional engineer knows that they are responsible for understanding how their part of a system interacts with the overall system and being alert to anything that might compromise safety. So if you call yourself an engineer, you should be asking questions about the safety of the system before you write the code.

It doesn’t matter that the Elevator Feel Shift System has been working well for many years – the fact is that this system has always depended on the reading from a single sensor, and that sensor can – and WILL – malfunction. In the earlier versions of the Elevator Feel Shift System, a single sensor was not critical, because the system provided a warning to pilots, who then took corrective action if needed; pilots can detect and ignore a false signal. But when there is no pilot in the loop and the software is supposed to automatically correct upon sensing a stall, it would be a good idea to make sure that the stall is real before a correction is made. At the very least, there should be an easy, intuitive, and PERMANENT override of the system if it malfunctions. And yes, this override should leave the plane in a manageable state. If this were your team, would you dig deep enough to discover that the stall signal depended on a single sensor? Would you discuss whether there were limits to the extent of its response or conditions under which the system should not respond?

Possibly a more serious problem with the MCAS system is that it apparently resets after five seconds of normal operation, and thus can push the nose down repeatedly. Would your team have considered such a scenario? Would you have thought through the conditions under which a nose down command would be dangerous, and how the system could be disabled? It might not be your job to train pilots on how to use a system, but it the job of engineers to build systems that are easy and intuitive to override when things go wrong.

The demand for control software is going to increase significantly as we move into an era of driverless vehicles and increasingly automated equipment. These systems must be safe – and there are many opinions on how to make sure they are safe. I would propose that although good processes are important for safety, good engineering is the fundamental discipline that makes systems increasingly safe. Processes that separate design from implementation get in the way of good engineering and are not appropriate for complex technical systems.

Software engineers need to understand what civil engineers learn as undergraduates – safety is not someone else’s job; it is the responsibility of every engineer involved in the design and implementation of a system whose failure might cause harm. If your team is not ready to accept this responsibility, then call yourselves developers or programmers or technicians – but not engineers.

An Interview

2019-01-19T17:19:00.001-06:00

Recently I was asked to complete an interview via e-mail. I found the questions quite interesting - so I decided to post them here.

Caution: The answers are brief and lack context. Some of them are probably controversial, and the the interview format didn't provide space for going below the surface. Send me an e-mail (mary@poppendieck.com) if you'd like to explore any of these topics further.

When did you first start applying Lean to your software development work? Where did you get the inspiration from?

I think its important to set the record straight – most early software engineering was done in a manner we now call ‘Lean.’ My first job as a programmer was working on the Number 2 Electronic Switching System when it was under development at Bell Telephone Labs. Not long after that, I was assisting a physicist do research into high energy particle tracing. The computer I worked on was a minicomputer that he scrounged up from a company that had gone bankrupt. With a buggy FORTRAN compiler and a lot of assembly language, we controlled a film scanner that digitized thousands of frames of bubble chamber film, projected the results into three dimensional space, and identified unique events for further study.

My next job was designing automated vehicle controls in an advanced engineering department of General Motors. From there I moved to an engineering department in 3M where we developed control systems for the big machines that make tape. In every case, we used good engineering practices to solve challenging technical problems.

In a very real sense, I believe that lean ideas are simply good engineering practices, and since I began writing code in good engineering departments, I have always used lean ideas when developing software.

From the organizations you've worked with, what have been some of the most common challenges associated with Lean transformations?

Far and away the most common problem occurs when companies head into a transformation for the sake of the transformation, instead of clearly and crisply identifying the business outcomes that are expected as a result of the transformation. You don’t do agile to do agile. You don’t do lean to do lean. You don’t do digital to do digital. You do these things to create a more engaging work environment, earn enough money to support that environment, and build products or services that truly delight customers.

So an organization that sets out on a transformation should be looking at these questions:

Is the transformation unlocking the potential of everyone who works here? How do we know?
Are we creating products and services that customers love and will pay for? How do we know?
Are we creating the reputation and revenue necessary to sustain our business over the long run?

There's lots of talk now around scaled Agile frameworks such as SAFe, Nexus, LESS, etc. with mixed results. How do you approach the challenge of scaling this way of working?

Every large agile framework that I know of is an excuse to avoid the difficult and challenging work of sorting out the organization’s system architecture so that small agile teams can work independently. You do not create smart, innovative teams by adding more process, you create them by breaking dependencies.

What we have learned from the Internet and from the Cloud is very simple – really serious scale can only happen when small teams independently leverage local intelligence and creativity. Companies that think scaled agile processes will help them scale will discover that these processes are not the right path to truly serious scale.

One of the common complaints from developers on Agile teams have is they don't feel connected to customers, and there is sometimes a feeling of working on outputs, rather than customer outcomes. How might this be changed?

This is the essential problem of organizations that consider agile a process rather than a way to empower teams to do their best work. The best way to fix the problem is to create a direct line of sight from each team to its consumers.

When the Apple iPhone was being developed, small engineering teams worked in short cycles that were aimed at a demo of a new feature. Even though the demo group was limited due to security, it was representative of future consumers. Each team was completely focused on making the next demo more pleasing and comfortable for their audience than the last one. These quick feedback loops over two and a half years led directly to a device that pretty much everyone loved. [1]

At our meetup last year, you spoke about resisting proxies, and one of those proxies is the Product Owner. What alternative approaches have you seen work for Lean or Agile teams, as opposed to having a Product Owner?

Why do software engineers need someone to come up with ideas for them? Ken Kocienda was a software engineer who ‘signed up’ to be responsible for developing the iPhone’s keypad. In the book Creative Selection [1], he describes how he developed the design, algorithms, and heuristics that created a seamless experience when typing on the iPhone keyboard, even though it was too small for most people’s fingers.

Similarly, at SpaceX, every component has a ‘responsible engineer’ who figures out how to make that component do its proper job as part of the launch system. John Muratore, SpaceX Launch Director, says “SpaceX operates on a philosophy of Responsibility – no engineering process in existence can replace this for getting things done right, efficiently.” [2]

The Chief Engineer approach is common in engineering departments from Toyota to GE Healthcare. It works very well. There is nothing about software that would exempt it from the excellent results you get when you give engineers the responsibility of understanding and solving challenging problems.

What is the most common thing you've seen recently which is slowing down organizations' concept-to-cash loop?

Friction. For example, the dependencies generated by the big back end of a banking system are a huge source of friction for product teams. The first thing organizations need to do is to learn how to recognize friction and stop thinking of it as necessary. When Amazon moved to microservices (from 2001 to 2006) the company had to abandon the idea that transactions are managed by a central database – which was an extremely novel idea at the time.

Over time, Amazon learned how to recognize friction and reduce it. Today, Amazon Web Services (AWS) launches a new enterprise-level service perhaps once a month and about two new features per day. Even more remarkable, AWS has worked at a similar pace for over a decade. If you look closely, an Amazon service is owned by a small team led by someone who has 'signed up' to be responsible for delivering and supporting a service that addresses a distinct customer need at a cost that is both extremely attractive and provides enough revenue to sustain the service over time.
_____
Footnotes:

1. See Creative Selection by Ken Kocienda.

2. John Muratore, System Engineering: A Traditional Discipline in a Non-traditional Organization, Talk at 2012 Complex Aerospace Systems Exchange Event.

Official Intelligence

2018-01-22T17:07:00.001-06:00

Every morning I pick up a small black remote, push a button and quietly say, “Alexa, turn on Mary’s Desk.” In the distance, I hear “Ok” and my desk lights come on. I never imagined that I would use voice control. I don’t like shouting at devices and don’t like announcing what I’m doing to those around me. My experience with voice in cars and smartphones has been mediocre. An Echo sat in our house for three years before I began talking to it.

Our home has been highly automated since the 1980’s, but my low voltage desk lamps are not compatible with our automation system. Last fall we connected our system to Alexa and bought an Alexa-compatible power strip for the lamps. Viola! we could control everything by voice – in fact, voice was the only way to control my desk lamps remotely. So I had to talk to a device. After I got used to it, I tried Alexa’s shopping list and found it convenient. Then I discovered that Alexa’s timers are well suited to cooking, especially with full hands. Soon I got an Echo Spot so I could see the timers as they counted down – I was hooked.

A Killer App for IoT

When devices are scattered throughout a physical space, controlling them by voice is a killer app for the Internet of Things [1] – but only if a simple control standard is embedded in every device. For the past couple of years, Amazon has been making it very easy for just about any Wi-Fi enabled device to be controlled through an Alexa skill. Better yet, taking a page out of the old Intel playbook, Amazon sells inexpensive kits and supplies testing support, so designers can easily embed microphones and Alexa intelligence inside their devices. True, these devices compete directly with Amazon Echos, but as a platform company, Amazon understands that this is a good thing.

Amazon was once considered the weakest of the voice assistant competitors, which include Google, Apple, Samsung, Microsoft, and Facebook. Most voice assistants are, well, assistants. They reserve movie tickets, arrange transportation, find answers to questions. Amazon’s Echo has a different focus: always-on listening and hands-free control of an exploding array of internet devices. This turned out to be a good choice. Google, playing fast follower, quickly introduced Google Home, while Microsoft’s Cortana is being integrated with Alexa. With its first mover advantage, Alexa has captured a commanding share of the voice control market, which could eventually become a fourth ‘pillar’ of Amazon’s success.[2]

I have to wonder: Is it an accident that Amazon discovered the most attractive use of voice while its biggest competitors were heading in a different direction? Or is there something about Amazon that gives it an edge, something that we might learn from? How does such a massively large company foster the kind of innovation that leads to completely new markets?

Day 1

In his letter to shareholders last spring, Jeff Bezos explained his longstanding mantra “It’s still Day 1” by describing Day 2: “Day 2 is stasis. Followed by irrelevance. Followed by excruciating, painful decline. Followed by death. And that is why it is always Day 1.” Then Bezos laid out his principles for keeping the vitality of Day 1 alive:

True Customer Obsession
A Skeptical View of Proxies
Eager Adoption of External Trends
High-Velocity Decision Making

True Customer Obsession

We’ve heard so much about Amazon’s customer obsession that it can get boring. After all, doesn’t every company focus on customers?

Actually, no. Most executives lose a lot more sleep over profits, or shareholders, or competitors than they do worrying about customers. Imagine you worked at an airline, for example, and you had an idea of how to make customers really happy: “Let’s eliminate the baggage fees!” Your manager frowns at you and says: “Have you any idea how much money those baggage fees bring into this airline every month?” And that would be the last customer-obsessed suggestion you would make.

But consider the Amazon team that came up with Lambda. Some customers report up to an order of magnitude reduction in cost when they switch to Lambda. Yet the Lambda team did not have to answer the sobering question: “Do you know how much revenue Lambda might cannibalize?” Everyone understood that lower prices are good for customers, so they are good for Amazon.

How does customer obsession get all the way from a statement in a shareholder letter to the actions of front line employees? Amazon does this by creating a direct line of sight between small teams and the customers they are supposed to be obsessed with, then making the teams responsible for improving the lives of those customers in some way.

Too Big to Communicate

Around 2001, Amazon’s growth was outstripping the capability of its internal systems to keep up. The leadership team came to a pretty standard conclusion – better communication was needed. Jeff Bezos was wise enough to realize that if communication was the problem, the solution had to be less communication, not more. He wanted the company to grow much larger, and if communication was impeding growth at this early stage, they had better figure out how to operate with a lot less of it.

How did the Internet grow so large? Through a lot of independent agents following their own agendas. How does Open Source software grow? The same way. Bezos decided that Amazon should transition to the independent agent model by organizing into small, independent teams. “If you can arrange to do big things with a multitude of small teams – that takes a lot of effort to organize, but if you can figure that out – the communication on those small teams will be very natural and easy,”[3] Bezos observed.

What, exactly, does Bezos mean by a team? At Amazon: [3,4]

Teams are groups of 6-12 people with a leader who acts something like a team CEO. The leader often recruits the rest of the team, and members usually stay with a team for two or more years.
Teams are ‘separable’ [separated organizationally] and ‘single-threaded’ [work on a single thing].
Teams are responsible for a measurable set of external outcomes, usually focused on customers.
Team decide internally both what they will work on and how they do the work.
Dependencies between teams are kept to an absolute minimum.

Once Amazon decided to structure a company composed of small, autonomous teams responsible for small, independent services, it had to figure out how to build an extensible infrastructure with these teams. That took a lot of time and experimentation, but in the end, it worked. In fact, it worked so well that Amazon decided to sell the infrastructure rather than keep it proprietary. And thus, we have Amazon Web Services (AWS), also known as ‘the Cloud’. As more and more companies move to the cloud they would be wise to understand that before it was a system architecture, the Cloud was an organizational architecture designed to streamline communication.

The Cathedral and the Bazaar

You are probably saying to yourself about now, “Cloud architectures are fine for a digital world, but how can they possibly work for a large company?” When I first heard about AWS, I asked the same question. To echo Eric Raymond’s “The Cathedral and The Bazar,” [5]

I used to believe there was a certain critical complexity above which a centralized, a priori approach to running a company was required. I thought that successful large companies were built like cathedrals, carefully crafted by individual wizards or small bands of magicians who orchestrated successful strategies.

Amazon’s style of organization – assemble hundreds upon hundreds of autonomous teams that decide for themselves what they are going to work on – seemed to resemble a great babbling bazaar of differing agendas and approaches out of which a coherent and stable business could seemingly emerge only by a succession of miracles.

The fact that this bazaar style seems to work, and work well, came as a distinct shock. As I learned more about AWS, I worked hard at trying to understand why the Amazon world not only didn’t fly apart in confusion but seemed to go from strength to strength at a speed barely imaginable to cathedral-builders.

Knowledge Workers

In 1999, Peter Drucker published the paper "Knowledge-Worker Productivity: The Biggest Challenge." The productivity gains of the 20th century, he noted, applied to manual labor. In the 21st century, the challenge will be increasing knowledge worker productivity, and this requires an approach opposite to the one we have been using for manual labor productivity. He wrote: [6]

“Knowledge-worker productivity requires that the knowledge worker is both seen and treated as an ‘asset’ rather than a ‘cost.’ It requires that knowledge workers want to work for the organization in preference to all other opportunities.”

“Knowledge workers [unlike manual workers] own the means of production. That knowledge between their ears is a totally portable and enormous capital asset.”

“Economic theory and most business practice sees manual workers as a cost. To be productive, knowledge workers must be considered a capital asset. Costs need to be controlled and reduced. Assets need to be made to grow.”

Drucker pointed out that there are many jobs which we might consider manual labor that involve a lot of knowledge work. One good example would be retail clerks, the subject of Zeynep Ton’s book "The Good Jobs Strategy." Ton illustrates how several retail chains have generated strong growth and higher than normal profits by paying well, training extensively, and expecting their intelligent employees to create a great experience for customers.

Knowledge workers are everywhere in our companies, and they represent a huge opportunity for improved performance, if only we learn how to see them as assets and grow their potential.

Volunteers

In his 2001 book "Management Challenges for the 21st Century," Peter Drucker pointed out that knowledge workers must be managed as if they were volunteers, because in fact, they are volunteers. During my last three years at 3M, I used the then-classic approach at 3M of ‘bootlegging’ the efforts of dozens of scientists and engineers, and led volunteer team developing a product called ‘Light Fiber’ and the process needed to manufacture it. I learned a lot about what it takes to lead a large team of volunteer knowledge workers, and it boils down to this: Understand what energizes every person on the team and arrange for each person to do as much of what energizes them as possible. This works particularly well because people tend to be energized by what they are good at, and focusing people’s work on what they are good at creates a win-win situation.

The Light Fiber team met every Wednesday morning before regular work hours (when everyone was free). I supplied breakfast and a couple dozen people showed up to coordinate their efforts – every week, for three years! The meeting was essentially a forum where everyone got brag about their accomplishments and make promises to each other about what they would do in the future. Even though there were no work assignments, only promises, the team accomplished amazing things.

Promises

I was reminded of this experience by the book "Thinking in Promises" by Mark Burgess. He defines a promise as a public declaration of intention by an agent. Agents can only make promises for themselves – they cannot make promises for (impose intentions on) other agents. Agents communicate about what is necessary to achieve shared goals, and then make promises to each other about their intention to contribute to the shared goal. Trust develops when agents are observed routinely keeping promises. Agents can make promises contingent on the trusted behavior of other agents, but they need a fallback plan, since the best of intentions can go awry. This sounded very familiar to me – it is a good description of what happened every week at our Light Fiber meeting. And I agree with Burgess that a system built on promises can be very reliable and robust.

Think of the Bazaar approach as a marketplace where knowledge workers can find the best places to utilize their strengths. The currency of the Bazaar is the promises made to colleagues and the trust built up by promises that are kept. Companies that function as Bazaars have discovered a secret: “Peer pressure is much more powerful than a concept of a boss. Many, many times more powerful.”[7] When people make promises to colleagues and customers, they feel a personal commitment to keep the promise. When managers impose obligations on their teams, there are many points of failure.

A Skeptical View of Proxies

Jeff Bezos’ second principle for vitality is to take a skeptical view of proxies. What are proxies? Bezos cites process as a typical example – he doesn’t think “I followed the process” is a good excuse for poor results.

The most vexing proxies in the development world are the project metrics of cost, schedule, and scope. Teams that can focus directly on the desired outcome usually perform a lot better than teams constrained by these proxies. For IT departments, ‘The Business’ is a proxy. For many businesses, profits are a proxy for delighted customers. [8] Be skeptical.

Even if someone does their homework and proves that delivering specific proxy results will surely deliver the desired end results, a direct line of sight to the desired outcomes is much better. Why? 1) Things change, but proxies prevent the team from changing accordingly. 2) Proxies tend to mask the intent or purpose of the work, diminishing engagement. 3) Proxies interfere with feedback and therefore slow things down.

High-Velocity Decision Making

This brings us to another of Bezos’ principles for maintaining company vitality: high-velocity decision making. Fast decisions are local decisions, because when a decision must be made immediately, there is no time to push it up the chain of command. In military organizations, where high-velocity decision making is a matter of life and death, front line units make local decisions based on situational awareness and their understanding of command intent. Wise commanders get very good at communicating their intent and the desired end state, so that the rapid decisions made on the front lines will be good decisions. This well-tested approach is probably the best model we have for making fast decisions in rapidly changing environments.

There are three things that get in the way of high-velocity decisions at the local (team) level:

Proxies rather than a clear understanding of the desired end state.
Permission required from management or other teams.
Punishment if the decision is wrong.

We’ve already discussed proxies.

Permission

The State of DevOps report in 2017 found that the best performing teams are those that operate without the need to obtain permission from outside the team. Obviously, wise managers should try to step back and let teams make their own decisions. But there is a bigger issue here. All too often, there are significant dependencies between teams, requiring multiple teams to coordinate their actions across a large set of interconnections. This was once the reason for long delays between software releases, but we now know that breaking dependencies is a far better strategy than catering to them.

Dependencies can be subtle, and are usually based on the system architecture. In the late 20th century, companies spent years pursuing the holy grail of integration, only to discover that integrated systems create a legacy of intertwined processes. This interconnectedness makes it almost impossible for teams to make changes without getting permission from a lot of other teams. So much for high-velocity decision making.

Punishment

Some years ago, 3M’s visionary CEO, William McKnight, made clear what he thought about punishing mistakes: [9]

“As our business grows, it becomes increasingly necessary to delegate responsibility and to encourage men and women to exercise their initiative. This requires considerable tolerance. Those men and women, to whom we delegate authority and responsibility, if they are good people, are going to want to do their jobs in their own way.”

“Mistakes will be made. But if a person is essentially right, the mistakes he or she makes are not as serious in the long run as the mistakes management will make if it undertakes to tell those in authority exactly how they must do their jobs.”

“Management that is destructively critical when mistakes are made kills initiative. And it’s essential that we have many people with initiative if we are to continue to grow.”

Eager Adoption of External Trends

To wrap up our discussion of Day 1 principles, let’s consider why Bezos thinks it’s important to adopt external trends. He says that if you pay attention to how external trends are likely to affect you and your customers, you will have a tail wind that can push you toward some interesting opportunities.

Here is an example: Today’s big trends are artificial intelligence (AI) and machine learning – voice recognition is a trendy use of artificial intelligence. Verbal interaction has not been a large part of Amazon’s retail and web services businesses, but that did not keep the company from making very strong investments in voice technologies. By late 2017, the AWS re:Invent conference centered on voice recognition and machine learning embedded in devices at the edge of the cloud. You could feel the tail wind.

Let’s assume you want to take Bezos’ advice and embrace today’s big trend – artificial intelligence. You might start by asking: Can artificial intelligence be used to help understand customers better? [We know of a company using Watson to filter social media comments in order to find key customer frustrations.] Could it be used to improve the development process? [Think automation on steroids.] What could machine learning do to increase the reliability of deployed systems? [All the data needed to discover the causes of crashes is probably in logs somewhere.] If you are not asking yourself these questions, you’re asking for a headwind.

Official Intelligence

When (not if) you find uses for artificial intelligence, it’s time to hit the pause button. If you find jobs that could be replaced by smart machines, you should ask yourself: Why? Why aren’t the intelligent people who currently do those jobs being challenged to think, to find innovative solutions to problems, to be obsessed with customers? Your challenge, should you accept it, is not to embrace artificial intelligence, it is to uncover the official intelligence that is going to waste in your organization. Don’t worry about how AI might reduce the cost of development, focus on how it might be used to leverage the knowledge and creativity of all those intelligent people in your organization.

As Peter Drucker pointed out, we know how to make manual labor more productive – in fact, artificial intelligence can be quite helpful there. But the real advances of the 21st Century will come when we figure out how to make sure that everyone in our organization is officially considered an intelligent person. Officially intelligent people don’t ask permission, they make promises. Officially intelligent people don’t need proxies, they are challenged with the end game. Officially intelligent people have jobs that are augmented by artificial intelligence not replaced by it.

Unleashing the potential of all the bright, creative people in our organizations is the central challenge of the digital age.

_________________________
Footnotes:

See Alexa, the Killer App
The first three pillars are Marketplace, Prime, and AWS. See Alexa Could Be Amazon's "Fourth Pillar" and Why Amazon is the new Microsoft
See Leadership advice: How Amazon maintains focus while competing in so many industries at once. The video is worth watching.
See Amazon’s “two-pizza teams”: The ultimate divisional organization
I take great liberties paraphrasing Eric Raymond’s “The Cathedral and the Bazaar”
Knowledge-Worker Productivity: The Biggest Challenge by Peter F. Drucker, California Management Review vol. 41,no. 2 winter 1999. Italics from the original.
In The Tipping Point, Malcolm Gladwell wrote about Gore and Associates, a well-known Bazaar company. Gladwell attributes this quote to Jim Buckley of Gore and Associates.
I wrote more about proxies in this post: The Cost Center Trap
McKnight Principles

The Cost Center Trap

2017-11-05T01:44:00.001-05:00

In the 1960’s, IT was largely an in-house back-office function focused on process automation and cost reduction. Today, IT plays a significant strategic and revenue role in most companies, and is deeply integrated with business functions. By 2010, over 50% of firms’ capital spending was going to IT, up from 10-15% in the 1960’s.[1] But one thing hasn't changed since the 1960’s: IT has always been considered a cost center. You are probably thinking "Why does this matter?" Trust me, cost center accounting can be a big trap.

Back in the mid 1980’s Just-in-Time (JIT) was gaining traction in manufacturing companies. JIT always drove inventories down sharply, giving companies a much faster response time when demand changed. However, accounting systems count inventory as an asset, so and any significant reduction in inventory had a negative impact on the balance sheet. Balance sheet metrics made their way into senior management metrics, so successful JIT efforts tended to make senior managers look bad. Often senior management metrics made their way down into the metrics of manufacturing organizations, and when they did, efforts to reduce inventory were half-hearted at best. A generation of accountants had to retire before serious inventory reduction was widely accepted as a good thing.[2]

Returning to the present, being a cost center means that IT performance is judged – from an accounting perspective – solely on cost management. Frequently these accounting metrics make their way into the performance metrics of senior managers, while contributions to business performance tend to be deemphasized or absent. As the metrics of senior managers make their way down through the organization, a culture of cost control develops, with scant attention paid to improving overall business performance. Help in delivering business results is appreciated, of course, but rarely is it rewarded, and rarer still is the cost center that voluntarily accepts responsibility for business results.

Now let’s add an Agile transformation to this cost center culture. Let’s assume that the transformation is supposed to bring benefits such as faster time to market, more relevant products, better customer experiences. And let’s assume that the cost center metrics do not change, or if they do change, process metrics such as number of agile teams and speed of deployment are added. I’ll wager that very few of those agile teams are likely to focus on improving overall business performance. The incentives send a clear message: business performance is not the responsibility of a cost center.

Being in a cost center can be demoralizing. You aren’t on the A team that brings in revenue, you’re on the B team that consumes resources. No matter how well the business performs, you’ll never get credit. Your budget is unlikely to increase when times are good, but when times are tight, it will be the first to be cut. Should you have a good idea, it had better not cost anything, because you can’t spend money to make money. If you think that a bigger monitor would make you more efficient, good luck making your case. Yet if your colleagues in trading suggest larger monitors will help them generate more revenue, the big screens will show up in a flash.[3]

Let’s face it, unless there are mitigating circumstances, IT departments that started out as cost centers are going to remain cost centers even when the company attempts a digital transformation. What kind of mitigating circumstances might help IT escape the cost center trap?

There is serious competition from startups.
Startups develop their software in profit centers; they haven’t learned about cost centers yet. And in a competitive battle, a profit center will beat a cost center every time.
IT is recognized as a strategic business driver.
You would think that a digital transformation would be undertaken only after a company has come to realize the strategic value of digital technology, but this is not the case. IT has been treated as if it were an outside contractor for so long that it is difficult for company leaders to think of IT as a strategic business driver, integral to the company's success going forward.
A serious IT failure has had a huge impact on business results.
When it becomes clear exactly how dependent a profit center is on a so-called cost center, people in the profit center are often motivated to share their pain with IT. Smart IT departments will use this opportunity to share the gain also.

Many people in the Agile movement preach that teams should have responsibility for the outcomes they produce and the impact of those outcomes. But responsibility starts at the top and is passed down to teams. When IT is managed as a cost center with cost objectives passed down through the hierarchy, it is almost impossible for team members from IT to assume responsibility for the business outcomes of their work. When IT metrics focus on cost control, digital transformations tend to stall.

Every ‘full stack team’ working on a digital problem should have ‘full stack responsibility’ for results, and that responsibility should percolate up to the highest managers of every person on the team. Business results, not cost, should receive the focused attention of every member of the team, and every incentive that matters should be aimed at reinforcing this focus.

The Capitalization Dilemma

Let’s return to the surprising assertion that in 2010, over 50% of firms’ capital spending was going to IT.[1] One has to wonder what was being capitalized. Yes, there were plenty of big data centers that were no doubt capitalized, since the movement to the cloud was just beginning. But in addition to that, a whole lot of spending on software development was also being capitalized. And herein lies the seeds of another undue influence of accounting policies over IT practices.

Software development projects are normally capitalized until they are “done” – that is they reach "final operating capability" and are turned over to production and maintenance.[1] But when an organization adopts continuous delivery practices, the concept of final operating capability – not to mention maintenance – disappears. This creates a big dilemma because it's no longer clear when, or even if, software development should be capitalized. Moving expenditures from capitalized to expensed not only changes whose budget the money comes from, it can have tax consequences as well. And what happens when all that capitalized software (which, by the way, is an asset) vanishes? Just as in the days when JIT was young, continuous delivery has introduced a paradigm shift that messes up the balance sheet.

But the balance sheet problem is not the only issue; depreciation of capitalized software can wreck havoc as well. In manufacturing, the depreciation of a piece of process equipment is charged against the unit cost of products made on that equipment. The more products that are made on the equipment, the less cost each product has to bear. So there is strong incentive to keep machines running, flooding the plant with inventory that is not currently needed. In a similar manner, the depreciation of software makes it almost impossible to ignore its sunk cost, which often drives sub-optimal usage, maintenance and replacement decisions.

Capitalization of development creates a hidden bias toward large projects over incremental delivery, making it difficult to look favorably upon agile practices. Hopefully we don't have to wait for another generation of accountants to retire before delivering software rapidly, in small increments, is considered a good thing.

To summarize, the cost center trap and the capitalization dilemma both create a chain reaction:

Accounting drives metrics.
⇩
Metrics drive culture.
⇩
Culture eats process for lunch.

The best way to avoid this is to break the chain at the top – in step 1. Stop letting accounting drive metrics. Alternatively, if accounting metrics persist at the senior management level, then break the chain at step 2 – do not pass accounting metrics down the reporting chain; do not let them drive culture. When teams focus on improving the performance of the overall business, accounting metrics should move in the right direction on their own; if they don't then clearly something is wrong with the accounting metrics.

Beware of Proxies

This year Jeff Bezos's annual letter to Amazon shareholders[4] listed four essentials that help big companies preserve the vitality of a startup: customer obsession, a skeptical view of proxies, the eager adoption of external trends, and high-velocity decision making. These seem pretty clear, except maybe the second one: a skeptical view of proxies. Just what are proxies? Bezos explains:

“A common example is process as proxy. Good process serves you so you can serve customers. But if you’re not watchful, the process can become the thing. This can happen very easily in large organizations. The process becomes the proxy for the result you want. You stop looking at outcomes and just make sure you’re doing the process right. Gulp.”

“Another example: market research and customer surveys can become proxies for customers – something that’s especially dangerous when you’re inventing and designing products.”

Here are some common proxies we find in software development:

Accounting metrics are proxies, and not very good ones at that, because they encourage local sub-optimization.

Project metrics – cost, schedule, and scope – are proxies. Worse, these proxies are rarely validated against actual outcomes.

“The Business” is a proxy for customers. Generally speaking, so is the product owner.

Proxies should be resisted, Bezos argues, if you want a vibrant startup culture in your company. But without proxies, how do you manage the dynamic and increasingly important IT organization? You make a habit of measuring what really matters - skip the proxies and focus on outcomes and impact.

In his excellent book, “a Seat at the Table,”[5] Mark Schwartz proposes that IT governance and oversight should begin with strategic business objectives and produce investment themes that accomplish these objectives. IT leaders fund teams to produce desirable outcomes that will have impact on the strategic objectives. Note that these outcomes are not proxies, they are real, measurable progress toward the strategic objective. Regular reviews of teams’ progress -- quantified by these measurable outcomes -- provides leaders with insight, flexibility and an appropriate level of control. At the same time, detailed decisions are made by the people closest to customers after careful investigation, experimentation and learning.

Schwartz concludes: "this approach can focus IT planning, reduce risk, eliminate waste, and provide a supportive environment for teams engaged in creating value."[5] What's not to like?

______________________
Footnotes:

[1] From “What is Digital Intelligence” by Sunil Mithas and F. Warren McFarlan, IEEE Computing Edge, November 2017. Pg.9.

[2] The 1962 book “The Structure of Scientific Revolutions” by Thomas Kuhn discussed how significant paradigm shifts in science do not take hold until a generation of scientists brought up with the old paradigm finally retire.

[3] Thanks to Nick Larsen. Does Your Employer See Software Development as a Cost Center or a Profit Center?

[4] Jeff Bezos - Letter to Shareholders - April 12, 2017

[5] "A Seat at the Table" by Mark Schwartz

The Only Country in the World

2017-09-23T04:30:00.000-05:00

Software systems that interact with people speak volumes about the people who designed them. In particular, software systems used by travelers often send a clear message: “This is the only country in the world. If you are an outsider, you are not welcome.”

Let’s start with the US. If you want to buy gas at the pump – as almost anyone who travels by car needs to do occasionally – and you don’t have a US zip code, you are out of luck. The gas pumps will require a five digit zip code that matches your home zip code, which, of course, you don’t have. US software systems for purchasing gas are very clear: If you don’t live in the US, you can’t buy gas here.

Not that it’s easy for me to buy gas in Europe, because there I need a chip-and-pin card. But credit card companies in the US have settled on chip-and-signature cards, effectively preventing me from purchasing gas at a pump in Europe. My European friends have no sympathy, since they can’t purchase gas in the US either.

Of course, the problems with my chip-and-signature card do not lie in the gas pump software, but in the choice made by US credit card issuers to use signature as the authentication method. We all know that signature authentication is a joke which leads to a far less secure credit card, but in addition, it prevents me from using the pin authentication systems that are common outside the US. My credit card company has issued me a chip-and-signature card that they claim is a “travel card” – which would be true if the US were the only country in the world. But should I happen to travel to another country, not only are gas pumps off limits to my chip-and-signature card, but I can’t purchase train or bus tickets – or anything sold at a kiosk.

There are other countries where software systems used by travelers are limited to residents. For example, in the Netherlands, train tickets are typically purchased through a bank account which – you guessed it – must be at a Netherlands bank. Earlier this year, I was unable to purchase NS train tickets online with a credit card; I had to get a colleague in the Netherlands to purchase online tickets and email them to me. I didn't want to chance getting tickets once I arrived, since I understand there are very few NS ticket kiosks usable by outsiders.

In the UK, there are very nice train discount schemes; for example, two people traveling together can get serious discounts. The catch is that they must first purchase a discount card with pictures of the two travelers, which can easily be obtained online, but must be mailed to a UK address. True, it is possible to obtain a discount card at a train station, but not at Heathrow – and where do you suppose most travelers arrive? Unlucky travelers without a UK address must pay full price for (expensive) Heathrow Express tickets, and then stand in line at Paddington with the proper paper applications and photographs to get a discount card.

Attention UK software designers: did it occur to you that some people don’t have a UK address? How hard would it be to charge a bit more for shipping to addresses outside the UK?

You would not think that Sweden would belong to the club of countries with software systems designed as if it were the only country in the world. But when we arrived at Arlanda airport on a Friday night and tried to buy the special weekend two-for-one ticket on the Arlanda Express, it was not on the kiosk menu. (Yep, I was using a chip-and-pin card – my debit card!). I searched and searched and finally saw the message stuck to the kiosk below the screen: A recent change had been made: now the special discount ticket could only be purchased through the Arlanda Express app or online, not at the kiosk.

Reading between the lines, this is clearly an attempt to limit the best Arlanda Express ticket pricing to Swedish residents. "Not so!" the software designers probably argued. "Anyone can load the app and buy a ticket." How am I supposed to load an app, validate my payment method, and buy a ticket before the train leaves – all without internet access? I complained to the train conductor, who said he thought the scheme was terrible – he has listened to complaints from countless deeply annoyed visitors to Sweden – would I please complain directly to customer service? As I was composing my complaint email on the train, I had to listen to messages about how hard Arlanda Express was working to make our experience wonderful. Yes, but only if you happen to live in Sweden.

We took a taxi to our Stockholm hotel from the train station and tried to pay the driver in cash, only to learn that our Swedish money was out of date and no longer legal tender. So I asked at the hotel desk how to change the old notes into new ones. The person at reception was very helpful – she told me that I could mail the cash in with an on-line form and the money would be deposited in my bank account, even if it were a “foreign” account. I was skeptical. And sure enough, when we looked at the form (which was in Swedish) we found that only bank accounts with IBAN numbers would work. Those of us from countries without IBAN numbers are apparently too foreign to merit a convenient way to get our money back, even though we are the most likely people to have the old currency.

Clearly there are far too many software systems in the travel industry that are built as if the local country were the only country in the world. This is a plea to all the software teams designing systems that might be used by travelers from another country – or might be used by your customers when they travel to another country – have you built your system as if your country were the only country in the world? Why not try a few use cases for travelers from / and to / other countries? We exist, you know, and we’re getting tired of arrogance embedded in software.

The End of Enterprise IT

2017-01-14T13:26:00.001-06:00

In 2015, the employees at ING Netherlands headquarters – over 3,000 people from marketing, product management, channel management, and IT development – were told that their jobs had disappeared. Their old departments would no longer exist; small squads would replace them, each with end-to-end responsibility for making an impact on a focused area of the business. Existing employees would fill the new jobs, but they needed to apply for the positions.[1]

It was a bold move for the Netherlands bank. The leaders were giving up their traditional hierarchy, detailed planning and “input steering” (giving directions). Instead they would trust empowered teams, informal networks, and “output steering” (responding to feedback) to move the bank forward. The bank was not in trouble; it did not really need to go through such a dramatic change. What prompted this bet-your-company experiment?

The change had been years in the making. After initial experiments in 2010, the IT organization put aside waterfall development in favor of agile teams. As successful as this change was, it did not make much difference to the bank, so Continuous Delivery and DevOps teams were added to increase feedback and stability. But still, there was not enough impact on business results. Although there were ample opportunities for business involvement on the agile teams and input into development priorities, the businesses were not organized to take full advantage of the agile IT organization. Eventually, according to CIO Ron van Kemenade (CIO of ING Netherlands from 2010 until he became CIO of ING Bank in 2013):[2]

The business took it upon itself to reorganize in ways that broke down silos and fostered the necessary end-to-end ownership and accountability. Making this transition … proved highly challenging for our business colleagues, especially culturally. But I tip my hat to them. They had the guts to do it.

The leadership team at ING Netherlands had examined its business model and come to an interesting conclusion: their bank was no longer a financial services company, it was a technology company in the financial services business. The days of segmenting customers by channel were over. The days of push marketing were over. Thinking forward, they understood that winning companies would use technology to provide simple, attractive customer journeys across multiple channels. This was true for companies in the media business, the search business, most retail businesses, and it was certainly true for companies in the financial services business. Moreover, expectations for engaging customer interactions were not being set by banks – they were being set by media and search and retail companies. Banks had to meet these expectations just to stay in the online game.

ING Netherlands’ leadership team decided to look to other technology companies, rather than banks, for inspiration. For example, on a trip to the Google IO developers conference Ron van Kemenade was impressed by the amazing number of enthusiastic, engaged engineers at Google. He realized that such enthusiasm could not surface in his company, because the culture did not value good engineering.

Let’s be clear, engineering is about using technology to solve tough problems; problems like how can we process a mortgage with a minimum of hassle for customers? How can we reduce the cost of currency exchange and still make a profit? How might we leverage Europe’s movement to open API’s for our customers’ advantage? These are the kinds of questions that are best answered by a small team of crack engineers working closely with people who deeply understand the customer journey. But at ING Netherlands, technology improvements were being worked out by people in the commercial business who would then tell the engineers what to develop. Not only is this a poor way to attract top engineers, it is the wrong way to create innovative solutions inspired by the latest technology.

The leaders at ING Netherlands decided to investigate how top technology companies attract talented people and come up with engaging products. Through concentrated visits to some of the most attractive technology companies, they saw a common theme – these companies did not have traditional enterprise IT departments even though they were much bigger than any bank. Nor did they have much of a hierarchical structure. Instead, they were organized in teams – or squads – that had a common purpose, worked closely with customers, and decided for themselves how they would accomplish their purpose.

ING Netherlands decided that if it was going to be a successful technology company and attract talented engineers, it had to be organized like a technology company. Studying the best technology companies convinced them that they needed to change – and the change had to include the whole company, not just IT. The bank had already modularized its architecture, streamlined and automated provisioning and deployment, moved to frequent deployments, and formed agile teams. But this was done within the IT department rather than across the organization, and the results were not exceptional. Now it was time to create a digital company across all functions.

They chose to adopt an organizational structure in which small teams – ING calls them squads – accept end-to-end responsibility for a consumer-focused mission. Squads are expected to make their own decisions based on a shared purpose, the insight of their members, and rapid feedback from their work. Squads are grouped into tribes of perhaps 150 people that share a value stream (e.g. mortgages), and within each tribe, chapter leads provide functional leadership. Along with the new organizational structure, ING’s leadership team worked to create a culture that values technical excellence, experimentation, and customer-centricity.

So how well did this major organizational change work? Certainly, it was not without problems. Some people did not want to work in the new environment, and there was not necessarily a role for everyone. So there were layoffs. But the people who stayed were intrigued by the new way of working and quickly became acclimated to their new jobs.

Another problem involved answering the question “What makes a ‘good’ engineer?” For this the bank adopted the Dreyfus model of skill acquisition (novice, advanced beginner, competent, proficient, and expert). It set up an internal ‘academy’ of classes – usually taught by senior engineers – to help everyone develop the skills needed for a future in a technology company.

Perhaps the biggest issue is one that anyone with a background in organizational development would expect – creating alignment across the many autonomous teams has been a formidable challenge. The bank needs to make major changes and develop breakthrough innovations; but these require coordinated action across multiple, supposedly autonomous, teams. Even the top technology companies ING bank studied have not really solved this problem. Ron van Kemenade summarized the problem this way:[2]

We had assumed that alignment would occur naturally because teams would view things from an enterprise-wide perspective rather than solely through the lens of their own team. But we’ve learned that this only happens in a mature organization, which we’re still in the process of becoming.

Centrally driven program management is now used to arbitrate priority conflicts and create alignment, while standardization of back end systems (e.g. data centers) and support functions helps maintain the operational excellence and regulatory compliance necessary at a large bank.

Despite the challenges, ING Netherlands views its new organizational structure as a significant success with sizable benefits to the company. The strategy adopted by ING Netherlands – an organizational structure composed of small, integrated teams, along with an emphasis on simple customer journeys, automated processes, and highly skilled engineers – is expected to spread to other parts of ING Bank.

The moral of this story is simple: agile transformations are not about transforming IT, they are about transforming organizations. If you are going through an agile transformation in your IT department, you are thinking too narrowly. Digitization must be an organization-wide experience.
_________
Footnotes:

[1] From: ING’s agile transformation, an interview with Peter Jacobs, CIO of ING Netherlands, and Bart Schlatmann, former COO of ING Netherlands, in McKinsey Quarterly, January 2017. See also: Software Circus Cloudnative Conference keynote by Peter Jacobs. (Peter Jacobs, replaced Ron van Kemenade as CIO of ING Netherlands in 2013.)

[2] From: Building a Cutting-Edge Banking IT Function, An Interview with Ron van Kemenade, CIO ING Bank, by Boston Consulting Group. See also talks by Ron van Kemenade: Nothing Beats Engineering Talent…The AGILE Transformation at ING and The End of Traditional IT.

The Two Sides of Teams

2016-09-30T15:51:00.000-05:00

Collective wisdom outweighs individual insights

Most of us believe that collective wisdom outweighs individual insights – or do we?

Perhaps the biggest shortcoming of agile development practices is the way in which teams decide what to do. What product should be built? What features are most important? What consumer experiences will work best? These are the most important questions for the success of any product, and yet for the longest time, answering these questions have not been considered the responsibility of the development team or the DevOps team.

Historically, someone with the role of business analyst, project manager, or product manager made the critical decisions about what to build. Or maybe some third party wrote a specification. While the technical team might question or push back on product decisions, too often the ideas and priorities were expected to come from outside. For example, the Scrum Product Owner role is often implemented in a way that favors individual insight over collective wisdom when it comes to critical product ideas and priorities.

Until recently, there hasn’t been a practical process for tapping into the collective wisdom of everyone on the development team when making key product decisions. But now there is: it’s called the “Design Sprint.”[1] Combining a design thinking approach with the timeboxing of an agile sprint, this is a process that captures the collective wisdom of a diverse group of people. During the five-day process, the group not only makes critical product decisions, it creates prototypes and validates hypotheses with real customers as part of the process.

Design sprints were developed by Google Ventures to help the companies in its portfolio uncover a variety of product ideas and quickly sort out the good ideas out from the mediocre ones. Design sprints have been used at hundreds of companies with amazing success. While the Lean Startup approach starts by building a Minimum Viable Product (MVP) to test ideas, design sprints are a way to avoid building the MVP until you are sure you are starting with a good idea. They help you sort through a lot more ideas before starting to code.

Where do all those good ideas come from? Design sprints do not depend on individuals or roles to generate ideas; the ideas are generated and validated by a diverse team tackling a tough problem. The insights of engineering and operations and support are combined with those of product and business and marketing to create true collective wisdom.

There are a couple of roles in design sprint; one is a “decider.” The decider generally avoids making any decisions unless called upon by teams that do not have enough information to make the decision themselves, yet need to make a choice in order to proceed. In a small company, this might be the CEO; in a larger company it is more likely a product manager. But let’s be clear – the decider is a leader who articulates a vision and strategy, but she does not usually come up with ideas, set priorities or select features. That is what teams do.

Another recommended role is someone Google calls a “sprintmaster” – a facilitator who plans, leads, and follows up on a five-day design sprint. This person is almost always a designer, because the facilitator’s job is to help teams use design thinking and design techniques to answer key product questions. For example, on the second day of the sprint, everyone develops their own ideas through a series of individual sketches; on the third day, teams review the sketches jointly and create a storyboard for a prototype – or maybe a few prototypes. On the fourth day, the prototypes are created, usually with design tools. On the fifth day, the prototypes are tested with real consumers as the team observes. When most of the people on a team have no design experience, it helps to have a designer lead them through the design process.

Really good teams generate a lot of ideas. These ideas are quickly validated with real consumers and perhaps 10 or 20% of the ideas survive. This low survival rate is a good thing; investigating a lot of ideas dramatically increases the chances that one of them will be a winner. The trick is to have a very fast way for teams to generate, validate, and select the ideas that are worth pursuing – and the design sprint provides one good option.

Of course, success requires a lot more than a diverse team and a good process.

Deliberation Makes a Group Dumber

Most of us would be surprised by the idea that deliberation makes a group dumber. But that is the conclusion reached by respected authors Cass Sunstein and Reid Hastie in their sobering book Wiser: Getting Beyond Groupthink to Make Groups Smarter. The two set out to study the cognitive biases of teams, and found that groupthink plays a bigger role in group decision-making than most of us realize.

There is no advantage in diversity on a team if those who are in the minority – those who are different or soft-spoken or are working in their second language – do not feel comfortable about sharing their unique perspective. Yet Sunstein and Hastie note that in most groups, deliberation is very likely to suppress insights that diverge from the first ideas expressed (anchoring bias) or the majority viewpoint (conformity bias).

Brainstorming has come under criticism – for good reason – as a technique that favors talkative and confident team members over thoughtful members and those with undeveloped hunches. Brainwriting[2] is an alternative to brainstorming that gives individuals time to think individually about the problem at hand and come up with ideas based on their unique background. Brainwriting is used during on the second day of a design sprint, when individuals sketch their solution to the chosen problem. This gives everyone the time and space to develop their ideas, as well as a way to have these ideas anonymously presented to and discussed by the group.

After a brainwriting exercise, a group will have generated maybe 40% more ideas than brainstorming. Typically, a technique such as dot voting is used to prioritize the many ideas and select the best ones to pursue. Unfortunately, this is another technique that favors groupthink. Voting is likely to weed out hunches and fragile ideas before they have time to develop, so outlier ideas that come from those who think differently tend to be lost in a voting process.

The lean approach to product development is pretty much the opposite of voting. Instead of narrowing options early, the lean strategy is to pursue multiple ideas that span the design space, gradually eliminating the ones that do not work. In a lean world, teams would not prioritize and select the most popular ideas after brainwriting – selection at this stage would be premature. Instead, teams would identify several very different ideas to explore, making sure to include outliers.

It is important to ensure that the ideas which survive the selection process span a wide range of possibilities – otherwise much of the benefit of brainwriting is lost. One way to do this is to select ideas that have a champion eager to pursue the idea and one or two people interested in joining her. If small sub-teams are encouraged to explore the possibilities of outlier ideas, the group is more likely to benefit from its diversity. By giving those with minority opinions not only the opportunity to present their ideas but also the time and space to give their ideas a try, a much wider variety of ideas will be seriously considered.

Consider this example: Matthew Ogle joined Spotify’s New York office in early 2015. For years he had been working on the problem of helping people discover appealing music, most recently in his own startup. He joined a Spotify team developing a discovery page, but he thought the process involved too much work – he thought discovery should be automatic. This was a radical idea at Spotify – so luckily, Ogle’s team did not vote on whether it ought to be pursued, because it would probably have died.

Instead, Ogle joined Edward Newett, an engineer and expert at deep learning who was experimenting with the idea of a discovery playlist, to explore the possibility. When Ogle realized that algorithms could generate a playlist that was uncannily well matched to his tastes, he knew they were on to a good idea. The next step was to find a way to check out these magic playlists with more people.

They tried an unusual approach – they generated playlists matched to Spotify employees’ tastes and sent them out with an email asking for feedback. Almost everyone loved their playlist, and it became clear that this idea was a winner. Through a lot of quick experiments, the idea was improved, and soon playlists were delivered to a few customers under the name “Discover Weekly.” As it scaled up, Discover Weekly proved to be wildly popular and has become a dramatic success.

The Two Sides of Teams

There are two sides to teams. There is the side that needs to make its own decisions and the side that can turn decision-making into groupthink. There is the side that wants to leverage diversity and the side that tends to ignore the input from team members who are different. The point is this: if you believe in collective wisdom, be sure to collect all of the wisdom that is available. If you look closely and honestly at your current processes and team dynamics, you might be surprised at how much wisdom is locked in the minds of individuals who don’t feel comfortable participating in the give and take of a dynamic team.

____________________________
Footnotes:

[1] See: Sprint: How to Solve Big Problems and Test New Ideas in Just Five Days by Jake Knapp, John Zeratsky, and Braden Kowitz. For a quick “how to” summary, see: https://developers.google.com/design-sprint/downloads/DesignSprintMethods.pdf

[2] Brainstorming Doesn't Work; Try This Technique Instead

Integration Does. Not. Scale.

2016-06-16T11:55:00.002-05:00

In times past, there was a difference between the front office of a business – designed to make a good impression – and the back office – a utilitarian place where most of the routine work got done. The first (and for a long time the predominant) use of computers in business centered around automating back office processes, so of course, IT was relegated to the back office.

As businesses grew, various back office functions developed their own computer systems – one for purchasing, one for payroll, one for manufacturing, and so on. The manufacturing system in vogue when I was in a factory was called MRP – Material Requirements Planning. As time went on, MRP systems were expanded to the supply chain, and then to the rest of the business, where they acquired the name ERP – Enterprise Resource Planning.

Over time it became obvious that the disparate systems for each function were handling the same data in different ways, making it difficult to coordinate across functions. So IT departments worked to create a single data repository, which quite often resided in the ERP system. The ERP suite of tools expanded to include most back office processes, including customer relationship management, order processing, human resources, and financial management.

The good news was that now all the enterprise data could be found in the single database managed by the ERP system. The bad news was that the ERP system became complex and slow. Even worse, enterprise processes had to either conform to “best practices” supported by the ERP suite or the ERP system had to be customized to support unique processes. In either case, these changes took a long time.

ERP Systems Meet Digital Organizations

As enterprise IT focused on implementing ERP suites and developing an authoritative system of record, the Internet became a platform for a whole new category of software, spawning new business models that did not fit into the traditional processes managed by ERP systems. Here are a few examples:

Many software offerings that used to be sold as products are now being sold “as a service”. However, ERP systems were designed to manage the manufacture and distribution of physical products; they don’t generally manage subscription services.
Some companies (Google for example) give away their services and sell advertising. Other companies (such as EBay and Airbnb) create platforms that unite consumers with suppliers, often disrupting traditional industries. In a platform business, the most critical processes focus on driving network effects by facilitating interactions between buyers and sellers. Although ERP systems can manage both suppliers and customers, they usually do not focus on the interactions between them.
The Internet of Things (IoT) brings real time data into many processes, changing the way they are best executed. For example, predictive maintenance of heavy equipment can be scheduled based on sensor data, resulting in better outcomes for customers and thus for the enterprise. ERP suites are intended to support standard practices; they struggle to support processes that change dynamically in response to digital input.
Capitalizing on the availability of data generated by products, companies are moving to selling business outcomes rather than individual products (GE is an example). When you are selling engine thrust or lighting costs, rather than engines or lightbulbs, processes need to be focused on the customer context. ERP systems generally focus on internal processes.
ERP systems are supposed to provide a single, integrated record of important enterprise data, but that data rarely includes dynamic product performance data, information about consumer characteristics and preferences, or other information that has come to be called “Big Data”. This kind of information is becoming an extremely valuable resource, but there isn’t room in ERP databases to store and manage the massive amount of interesting data that is available.

In summary, digitization is bringing the back office much closer to the front office, providing the data for dynamic decision-making, and substituting short feedback loops and data-driven interactions for “best practices.” Since enterprise ERP suites were not built for speed or rapidly changing processes, they are increasingly being supplemented with other systems that manage critical enterprise processes.

Postmodern ERP

In the last few years, in the wake of the success of Salesforce.com, many cloud-based software services have become available. Some target the entire enterprise (NetSuite for example), but many are focused on particular areas (e.g. human resources) or particular industries (e.g. construction). These services are finding an eager audience – even in companies that have existing ERP systems. Today, about 30% of the spend for IT systems is coming from business units outside of IT [1]. If they cannot get the software they need from their IT departments, business leaders are likely to purchase cloud-based services instead.

The cloud reduces dependence on a company’s IT department, so it has become quite easy for various areas of the enterprise to independently adopt “best-of-breed” solutions specifically targeted at their needs, rather than use a single ERP suite across the enterprise. These best-of-breed systems are usually selected by line business leaders and hosted in the cloud. They tend to be faster to implement and more responsive to changing business situations than the enterprise ERP suite – partly because they are decoupled from the rest of the enterprise. Gartner calls the movement from a single ERP suite to a collection of ERP modules from multiple vendors “Postmodern ERP”[2].

Gartner warns that a multi-vendor ERP approach can lead to significant integration problems, and recommends that multiple vendors should not be used until the integration issues are sorted out. Of course, business leaders want to know why integration is important. IT departments typically respond that the ERP’s central database is the enterprise system-of-record; other ERP modules – financial reporting, for example – depend on this database for critical data. Without an integrated database, how will the rest of the enterprise be able to operate? How will the accounting department produce its required financial reports?

Integration Does. Not. Scale

But hold on. There are plenty of very large companies that work remarkably well – and produce financial reports on time – without an integrated system-of-record. In fact, internet-scale companies have discovered that integration does not scale. If we go back to the year 2000, we find that Amazon.com had a traditional architecture – a big front end and a big back end – which got slower and slower as volume grew. Eventually Amazon abandoned its integrated backend database in the early 2000’s, in favor of independent services that manage their own data and communicate with each other exclusively through clearly defined interfaces.

If we have learned one thing from internet-scale players, it’s that true scale is not about integration, it is about federation. Amazon runs a massive order fulfillment business on a platform built out of small, independently deployable, horizontally scalable services. Each service is owned by a responsible team that decides what data the service will maintain and how that data will be exposed to other services. Netflix operates with the same architecture, as do many other internet-scale companies. In fact, adopting federated services is a proven approach for organizations that wish to scale to beyond their current limitations.

Let’s revisit the enterprise where business units prefer to run best-of-breed ERP modules to handle the specific needs of their business. This enterprise has two choices:

Integrate the various ERP modules and store their data in a single ERP database.
Coordinate independently-maintained enterprise data through API contracts.

The problem with the first option is that integration creates dependencies across the enterprise. Each time a data definition in the central database is added or changed, every software module that uses the database must be updated to match the new schema. This makes the integrated database a massive dependency generator; the result is a monolithic code base where changes are slow and painful.

Enterprises that want to move fast will select the second option. They will move to a federated architecture in which each module owns and maintains its own data, with data moving between modules via very well defined and stable interfaces. As radical as this approach may seem, internet-scale businesses have been living with services and local data stores for quite a while now, and they have found that managing interface contracts is no more difficult than managing a single, integrated database.

What Scales

Assume that every team responsible for a process can choose its own best-of-breed software module and is responsible for maintaining its own data in appropriately secure data stores. Then maintaining an authoritative source of data becomes an API problem, not a database problem. When the system-of-record for each process is contained within its own modules, new modules can be added for handling software-as-a-service, two-sided platforms, data from IoT sensors, customer outcomes or other new business model that may evolve. These modules will exchange a limited amount of data through well-defined API’s with the credit, order fulfillment, human resources, and financial modules. Internally, the new modules will collect, store, and act upon as much unstructured data and real time information as may be useful. More importantly, these modules can be updated at any time, independent of other modules in the system. In addition, they can be replicated horizontally as scale demands.

It is the API contract, not the central database, that assures each part of the company looks at the same data in the same way. Make no mistake, these API contracts are extremely important and must be carefully vetted by each data provider with all of its consumers. API contracts take the place of database schema, and data providers must ensure that their data meets the standards of a valid system-of-record. However, changes to an API contract are handled differently than most database schema changes. Each change creates a new version of the API; both old and new versions remain valid while other software modules are gradually updated to use the new version. A wise API versioning strategy eliminates the tight coupling that makes database changes so slow and cumbersome. The reason why federation scales – while a central database approach does not scale – is because with a well-defined API’s strategy, individual modules are not dependent on other modules, so each module can be deployed independently and (usually) scaled horizontally.

When you think of Enterprise ERP as a federation of independent modules communicating via API’s (rather than a database), the problems with multi-vendor ERP systems fade because the system-of-record is no longer a massive dependency-generator that requires lockstep deployments. With a federated approach, business leaders can move fast and experiment with different systems as they become available, and still synchronize critical enterprise data with the rest of the company. In addition, similar processes in different parts of the enterprise can use different applications to meet their unique needs without the significant tailoring expense encountered when a single ERP suite is imposed on the entire enterprise.

What about Standardization?

Won’t separate ERP modules lead to different processes in different parts of the enterprise? Yes, certainly. But the question is – under what circumstances are standard processes important? In the days of manual back office processes, there was lot of labor-intensive work: drafting, accounting, phone calls, people moving paperwork from one desk to another. Standardization in this kind of operating environment made sense and could lead to significant efficiencies. But in a digitized world, the important thing is not uniformity; it is rapid and continuous improvement in each business area. Different processes for different problems in different contexts can be a very good thing.

Jeff Bezos agrees; he believes that the only path to serious scale is to have a lot of independent agents making their own decisions about the best way to do things. This belief was a key factor in the birth of Amazon Web Services, a $10 billion business that keeps on growing. Amazon began its journey away from a big back end by creating small, cross-functional teams with end-to-end responsibility for a service. These teams designed their own processes to fit their particular environment. Amazon then developed a software architecture and data center infrastructure that allowed these teams to operate and deploy independently. The rest is history.

In Conclusion

It is time for enterprise processes become federated instead of integrated. This is not a new path – embedded software has used a similar architecture for decades. Today, almost every successful internet-scale business has adopted some type of federated approach because it is the only way to scale beyond the limitations of the enterprise.

As digitization brings back-office teams closer to consumers and providers, they must join with their front-office colleagues and form teams that are fully capable of designing and improving a process or a line of business. These “full stack” teams should be responsible for managing their own practices, technology and data, meeting industry standards for their particular areas. They should communicate with other areas of the enterprise on demand through well-defined interfaces.

The good news is that you can gradually migrate to a federation from almost any starting point, including an enterprise-wide ERP system. Even better, as IT moves from enforcing compliance with the company’s ERP system to brokering interface contracts and ensuring data security, it becomes a business enabler rather than a bottleneck. And best of all, responsible full stack teams that solve their own problems will create attractive jobs for talented engineers and give business units control over their own digital destiny.

The New Technology Stack

2016-02-10T17:06:00.000-06:00

Over the last two decades, the software technology stack has undergone a rapid evolution, as this diagram from Docker.io lays out.

The evolution continues. Today’s world of smart phones is giving way to tomorrow’s world of smart devices with sensors and actuators and not much more. The app layer will only get thinner.

If you think this trend will not affect your organization, think again. Tony Scott, CIO of the US federal government, advised CIO’s throughout the country to move to the cloud as fast as possible. Why? Because the large cloud providers can provide more secure, less expensive, and more reliable infrastructure than most organizations can provide for themselves. Major industries, from banking to health care, are discovering the benefits of moving to the cloud. Thin apps and assembled services running on off-premises hardware will soon become the norm for most organizations, probably even yours.

What does the cloud have to do with software development? Quite a bit, it turns out. In the cloud:

1. The development team is responsible for product design.
Assembling services is a dynamic process, not a one-time affair.
The thin app is often the only differentiator in the stack.

2. The development team is responsible for its own infrastructure.
When infrastructure is code, one team does it all:
design/code/test/deploy/monitor/maintain.
Keeping things running is a new challenge for many software engineers.

3. Apps must be immune to infrastructure and service failure.
Stateless designs replace object-oriented designs.
Distributed, immutable data sets replace databases.
Things get done through producer/consumer chains.

So here’s the point: Practices designed for the problems of 1995 are not going to work for the problems of 2020. We need to frame today’s and tomorrow’s problems in a way that helps us to identify and tackle them effectively; we need to use fundamental principles to help us ask the right questions. [1]

What are the right questions? Consider this guidance from Taiichi Ohno, the father of Lean:

All we are doing is looking at the time line, from the moment the customer gives us an order to the point when we collect the cash. And we are reducing the time line by reducing the non-value adding wastes.

In the product development world, our timeline starts with a consumer problem instead of a customer order:

We look at the time line from the moment our consumers experience a problem until that problem is resolved. And we reduce the time line by reducing the non-value adding friction.

The technology stack of 1995 generated different kinds of friction than you will find in a modern technology stack. When banks moved to mobile apps a few years ago, they discovered that app development requires an agile approach because the underlying platforms change all the time. While the old technology stack resisted agile practices, the cloud demands them. There is no place for large projects or long release cycles in the new technology stack; agile development is simply table stakes - you need it to play the cloud game.

The new technology stack produces its own friction, a different kind of friction than was typically found in the old stack. This friction is particularly strong in organizations moving from the old to the new technology stack because the transition brings a lot of change to software development. Unfortunately, that change is not always well supported by the organization or welcomed by the software engineers.

Friction Generator #1: Since the new technology stack virtually requires small deployments, the development team can - and should - become deeply involved in designing differentiated products using tight feedback loops. In short, the development team becomes a product team. But frequently this product team does not have the right people (designers, for example), the authority, or the process to make dynamic product decisions. Too often development teams are told what to develop, rather than being asked to move business measures in the right direction. A lot of friction can occur if the organizational structure does not support the concept of fully responsible product teams.

Friction Generator #2: The development team must engineer solutions to quality, reliability and resilience issues that arise after deployment. This requires a different mindset than was common with the old technology stack, when the development team sent their code to the ops department, whose job it was to keep the system running. In the cloud, a team procures and releases to its own infrastructure, and there is no one else to deal with the inevitable problems that occur. Product teams must have the capability, the charter, and the mindset to accept 24/7 responsibility for their deployed code.

Friction Generator #3: The new technology stack is designed to be fault tolerant, not failure proof. This means that any service or app must be able to fail and get restarted at any time, and not produce problems due to these interruptions. But writing "restartable" code [idempotent modules with immutable data sets] is new to most software engineers and is rarely taught in schools. Software engineers skilled at writing code for the new technology stack are in short supply and demand is intense. Good leadership, training, and support are required to help interested software engineers transition to the new languages and paradigms needed to thrive in the cloud.

Friction Generator #4: The old technology stack and associated batch processes encouraged extensive outsourcing, leaving many IT departments without software engineers or even data centers. Today, as software drives differentiation, many firms are attempting to bring software technology back in-house. But they often lack the management experience, organizational structure and personnel policies necessary to attract and retain the skilled software and reliability engineers they need for the new technology stack.

Today, almost every business has to face the fact that their most serious competition is likely to come from companies living in the new technology stack, unencumbered by the old way of doing things. Governments and non-profits must realize that the people they serve have their expectations set by experiences with the cloud. If your organization is living in the old paradigm, it’s time to move on; big back end systems are rapidly becoming the COBOL of the 21st century.

To assess the current situation, take a look at the value stream – the stream of activities that deliver value to customers – and identify areas of friction. In the modern technology stack, friction generators tend to be either deeply technical or highly organizational in nature, as you can see from the discussion above. Unfortunately, these are not usually the problems that companies tackle when they move to modern software development. Why? Quite often the organizational structure is so entrenched that changing it is not considered. Or perhaps the people leading the transition do not understand the underlying technology and the problems presented by the new stack. In either case, the underlying problem becomes an elephant in the room that everyone ignores, while easier challenges - like adopting agile processes - are taken up.

It is important to confront the deep-seated friction generators that people would rather ignore. Start by talking about the elephant, and then actively imagine what your world would be like without that elephant. Once you have a clear vision of the future, you can work out how to move constantly toward that vision by eliminating the most pernicious friction generators, one step at a time. This approach has helped teams and organizations around the world make steady progress in the right direction, and eventually the steady progress adds up to amazing accomplishments.

Identifying, addressing, and overcoming challenging problems is one of the most engaging activities there is. People thrive when their day-to-day work involves getting good at conquering meaningful challenges. Companies do much better when they wake up the sleeping giant in each employee by encouraging them to reduce the friction that gets in the way of delivering value to customers.

If your company is not the highly successful leader-in-its-field that you hoped it would be (and no company ever is), then waiting around for things to change is not likely to make the situation better. Round up your colleagues and assess the situation. Find the elephant in the room and imagine what things would be like if it were gone. And then – since you are smart engineers – you need to engineer a way to get that elephant out of the room. Quit waiting for someone else to do this for you. You’re on.
______________________
Footnote:
1. One proven set of principles for tackling tough technology problems are the Lean principles: Focus on Customers, Energize Workers, Reduce Friction, Enhance Learning, Increase Flow, Build Quality In, Keep Getting Better.

Five World-Changing Software Innovations

2016-01-22T00:35:00.001-06:00

On the 15th anniversary of the Agile Manifesto, let's look at what else was happening while we were focused on spreading the Manifesto’s ideals. There have been some impressive advances in software technology since Y2K:
1. The Cloud
2. Big Data
3. Antifragile Systems
4. Content Platforms
5. Mobile Apps

The Cloud

In 2003 Nicholas Carr’s controversial article “IT Doesn’t Matter” was published in Harvard Business Review. He claimed that “the core functions of IT– data storage, data processing, and data transport” had become commodities, just like electricity, and they no longer provided differentiation. It’s amazing how right – and how wrong – that article turned out to be. At the time, perhaps 70% of an IT budget was allocated to infrastructure, and that infrastructure rarely offered a competitive advantage. On the other hand, since there was nowhere to purchase IT infrastructure as if it were electricity, there was a huge competitive advantage awaiting the company that figured out how package and sell such infrastructure.

At the time, IT infrastructure was a big problem – especially for rapidly growing companies like Amazon.com. Amazon had started out with the standard enterprise architecture: a big front end coupled to a big back end. But the company was growing much faster than this architecture could support. CEO Jeff Bezos believed that the only way to scale to the level he had in mind was to create small autonomous teams. Thus by 2003, Amazon had restructured its digital organization into small (two-pizza) teams, each with end-to-end responsibility for a service. Individual teams were responsible for their own data, code, infrastructure, reliability, and customer satisfaction.

Amazon’s infrastructure was not set up to deal with the constant demands of multiple small teams, so things got chaotic for the operations department. This led Chris Pinkham, head of Amazon’s global infrastructure, to propose developing a capability that would let teams manage their own infrastructure – a capability that might eventually be sold to outside companies. As the proposal was being considered, Pinkham decided to return to South Africa where he had gone to school, so in 2004 Amazon gave him the funding to hire a team in South Africa and work on his idea. By 2006 the team’s product, Elastic Compute Cloud (EC2), was ready for release. It formed the kernel of what would become Amazon Web Services (AWS), which has since grown into a multi-billion-dollar business.

Amazon has consistently added software services on top of the hardware infrastructure – services like databases, analytics, access control, content delivery, containers, data streaming, and many others. It’s sort of like an IT department in a box, where almost everything you might need is readily available. Of course Amazon isn’t the only cloud company – it has several competitors.

So back to Carr’s article – Does IT matter? Clearly the portion of a company’s IT that could be provided by AWS or similar cloud services does not provide differentiation, so from a competitive perspective, it doesn’t matter. If a company can’t provide infrastructure that matches the capability, cost, accessibility, reliability, and scalability of the cloud, then it may as well outsource its infrastructure to the cloud.

Outsourcing used to be considered a good cost reduction strategy, but often there was no clear distinction between undifferentiated context (that didn’t matter) and core competencies (that did). So companies frequently outsourced the wrong things – critical capabilities that nurtured innovation and provided competitive advantage. Today it is easier to tell the difference between core and context: if a cloud service provides it then anybody can buy it, so it’s probably context; what’s left is all that's available to provide differentiation. In fact, one reason why “outsourcing” as we once knew it has fallen into disfavor is that today, much of the outsourcing is handled by cloud providers.

The idea that infrastructure is context and the rest is core helps explain why internet companies do not have IT departments. For the last two decades, technology startups have chosen to divide their businesses along core and infrastructure lines rather than along technology lines. They put differentiating capabilities in the line business units rather than relegating them to cost centers, which generally works a lot better. In fact, many IT organizations might work better if they were split into two sections, one (infrastructure) treated as a commodity and the rest moved into (or changed into) a line organization.

Big Data

In 2001 Doug Cutting released Lucene, a text indexing and search program, under the Apache software license. Cutting and Mike Cafarella then wrote a web crawler called Nutch to collect interesting data for Lucerne to index. But now they had a problem – the web crawler could index 100 million pages before it filled up the terabyte of data they could easily fit on one machine. At the time, managing large amounts of data across multiple machines was not a solved problem; most large enterprises stored their critical data in a single database running on a very large computer.

But the web was growing exponentially, and when companies like Google and Yahoo set out to collect all of the information available on the web, currently available computers and databases were not even close to big enough to store and analyze all of that data. So they had to solve the problem of using multiple machines for data storage and analysis.

One of the bigger problems with using multiple machines is the increased probability that one of machines will fail. Early in its history, Google decided to accept the fact that at its scale, hardware failure was inevitable, so it should be managed rather than avoided. This was accomplished by software which monitored each computer and disk drive in a data center, detected failure, kicked the failed component out of the system, and replaced it with a new component. This process required keeping multiple copies of all data, so when hardware failed the data it held was available in another location. Since recovering from a big failure carried more risk than recovering from a small failure, the data centers were stocked with inexpensive PC components that would experience many small failures. The software needed to detect and quickly recover from these “normal” hardware failures was perfected as the company grew.

In 2003 Google employees published two seminal papers describing how the company dealt with the massive amounts of data it collected and managed. Web Search for a Planet: The Google Cluster Architecture by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle described how Google managed it’s data centers with their inexpensive components. The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung described how the data was managed by dividing it into small chunks and maintaining multiple copies (typically three) of each chunk across the hardware. I remember that my reaction to these papers was “So that’s how they do it!” And I admired Google for sharing these sophisticated technical insights.

Cutting and Cafarella had approximately the same reaction. Using the Google File System as a model, they spent 2004 working on a distributed file system for Nutch. The system abstracted a cluster of storage into a single file system running on commodity hardware, used relaxed consistency, and hid the complexity of load balancing and failure recovery from users.

In fall, 2004, the next piece of the puzzle – analyzing massive amounts of stored data – was addressed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. Cutting and Cafarella spent 2005 rewriting Nutch and adding MapReduce, which they released as Apache Hadoop in 2006. At the same time, Yahoo decided it needed to develop something like MapReduce, and settled on hiring Cutting and building Apache Hadoop into software that could handle its massive scale. Over the next couple of years, Yahoo devoted a lot of effort to converting Apache Hadoop – open source software – from a system that could handle a few servers to a system capable of dealing with web-scale databases. In the process, their data scientists and business people discovered that Hadoop was as useful for business analysis as it was for web search.

By 2008, most web scale companies in Silicon Valley – Twitter, Facebook, LinkedIn, etc. – were using Apache Hadoop and contributing their improvements. Then startups like Cloudera were founded to help enterprises use Hadoop to analyze their data. What made Hadoop so attractive? Until that time, useful data had to be structured in a relational database and stored on one computer. Space was limited, so you only kept the current value of any data element. Hadoop could take unlimited quantities of unstructured data stored on multiple servers and make it available for data scientists and software programs to analyze. It was like moving from a small village to a megalopolis – Hadoop opened up a vast array of possibilities that are just beginning to be explored.

In 2011 Yahoo found that its Hadoop engineers were being courted by the emerging Big Data companies, so it spun off Hortonworks to give the Hadoop engineering team their own Big Data startup to grow. By 2012, Apache Hadoop (still open source) had so many data processing appendages built on top of the core software that MapReduce was split off from the underlying distributed file system. The cluster resource management that used to be in MapReduce was replaced by YARN (Yet Another Resource Negotiator). This gave Apache Hadoop another growth spurt, as MapReduce joined a growing number of analytical capabilities that run on top of YARN. Apache Spark is one of those analytical layers which supports data analysis tools that are more sophisticated and easier to use than MapReduce. Machine learning and analytics on data streams are just two of the many capabilities that Spark offers – and there are certainly more Hadoop tools to come. The potential of Big Data is just beginning to be tapped.

In the early 1990’s Tim Burners Lee worked to ensure that CERN made his underlying code for HTML, HTTP and URL’s available on a royalty free basis, and because of that we have the world wide web. Ever since, software engineers have understood that the most influential technical advances come from sharing ideas across organizations, allowing the best minds in the industry to come together and solve tough technical problems. Big Data is as capable as it is because Google and Yahoo and many others companies were willing to share their technical breakthroughs rather than keep them proprietary. In the software industry we understand that we do far better as individual companies when the industry as a whole experiences major technical advances.

Antifragile Systems

It used to be considered unavoidable that as software systems grew in age and complexity, they became increasingly fragile. Every new release was accompanied by fear of unintended consequences, which triggered extensive testing and longer periods between releases. However, the “failure is not an option” approach is not viable at internet scale – because things will go wrong in any very large system. Ignoring the possibility of failure – and focusing on trying to prevent it – simply makes the system fragile. When the inevitable failure occurs, a fragile system is likely to break down catastrophically.[1]

Rather than prevent failure, it is much more important to identify and contain failure, then recover with a minimum of inconvenience for consumers. Every large internet company has figured this out. Amazon, Google, Esty, Facebook, Netflix and many others have written or spoken about their approach to failure. Each of these companies has devoted a lot of effort to creating robust systems that can deal gracefully with unexpected and unpredictable situations.

Perhaps the most striking among these is Netflix, which has a good number of reliability engineers despite the fact that it has no data centers. Netflix’s approach was described in 2013 by Ariel Tseitlin in the article The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability. The main way Netflix increases the resilience of its systems is by regularly inducing failure with a “Simian Army” of monkeys: Chaos Monkey does some damage twice an hour, Latency Monkey simulates instances that are sick but still working, Conformity Monkey shuts down instances that don’t adhere to best practices, Security Monkey looks for security holes, Janitor Monkey cleans up clutter, Chaos Gorilla simulates failure of an AWS availability zone and Chaos Kong might take a whole Amazon region off line. I was not surprised to hear that during a recent failure of an Amazon region, Netflix customers experienced very little disruption.

A Simian Army isn’t the only way to induce failure. Facebook’s motto “Move Fast and Break Things” is another approach to stressing a system. In 2015, Ben Maurer of Facebook published Fail at Scale – a good summary of how internet companies keep very large systems reliable despite failure induced by constant change, traffic surges, and hardware failures.

Maurer notes that the primary goal for very large systems is not to prevent failure – this is both impossible and dangerous. The objective is to find the pathologies that amplify failure and keep them from occurring. Facebook has identified three failure-amplifying pathologies:

1. Rapidly deployed configuration changes
Human error is amplified by rapid changes, but rather than decrease the number of deployments, companies with antifragile systems move small changes through a release pipeline. Here changes are checked for known errors and run in a limited environment. The system quickly reverts to a known good configuration if (when) problems are found. Because the changes are small and gradually introduced into the overall system under constant surveillance, catastrophic failures are unlikely. In fact, the pipeline increases the robustness of the system over time.

2. Hard dependencies on core services
Core services fail just like anything else, so code has to be written with that in mind. Generally hardened API’s that include best practices are used to invoke these services. Core services and their API’s are gradually improved by intentionally injecting failure into a core service to expose weaknesses that are then corrected as failure modes are identified.

3. Increased latency and resource exhaustion
Best practices for avoiding the well-known problem of resource exhaustion include managing server queues wisely and having clients track outstanding requests. It’s not that these strategies are unknown, it’s that they must become common practice for all software engineers in the organization.

Well-designed dashboards, effective incident response, and after-action reviews that implement countermeasures to prevent re-occurrence round out Facebook's toolkit for keeping its very large systems reliable.

We now know that fault tolerant systems are not only more robust, but also less risky than systems which we attempt to make failure-free. Therefore, common practice for assuring the reliability of large-scale software systems is moving toward software-managed release pipelines which orchestrate frequent small releases, in conjunction with failure induction and incident analysis to produce hardened infrastructure.

Content Platforms

Video is not new; television has been around for a long time, film for even longer. As revolutionary as film and TV have been, they push content to a mass audience; they do not inspire engagement. An early attempt at visual engagement was the PicturePhone of the 1970’s – a textbook example of a technical success and a commercial disaster. They got the PicturePhone use case wrong – not many people really wanted to be seen during a phone call. Videoconferencing did not fare much better – because few people understood that video is not about improving communication, it’s about sharing experience.

In 2005, amidst a perfect storm of increasing bandwidth, decreasing cost of storage, and emerging video standards, three entrepreneurs – Chad Hurley, Steve Chen, and Jawed Karim – tried out an interesting use case for video: a dating site. But they couldn’t get anyone to submit “dating videos,” so they accepted any videos clips people wanted to upload. They were surprised at the videos they got: interesting experiences, impressive skills, how-to lessons – not what they expected, but at least it was something. The YouTube founders quickly added a search capability. This time they got the use case right and the rest is history. Video is the printing press of experience, and YouTube became the distributor of experience. Today, if you want to learn the latest unicycle tricks or how to get the back seat out of your car, you can find it on YouTube.

YouTube was not the first successful content platform. Blogs date back to the late 1990’s where they began as diaries on personal web sites shared with friends and family. Then media companies began posting breaking news on their web sites to get their stories out before their competitors. Blogger, one of the earliest blog platforms, was launched just before Y2K and acquired by Google in 2003 – the same year WordPress was launched. As blogging popularity grew over the next few years, the use case shifted from diaries and news articles to ideas and opinions – and blogs increasingly resembled magazine articles. Those short diary entries meant for friends were more like scrapbooks; they came to be called tumbleblogs or microblogs. And – no surprise – separate platforms for these microblogs emerged: Tumblr in 2006 and Twitter in 2007.

One reason why blogs drifted away from diaries and scrapbooks is that alternative platforms emerged aimed at a very similar use case – which came to be called social networking. MySpace was launched in 2003 and became wildly popular over the next few years, only to be overtaken by Facebook, which was launched in 2004.

Many other public content platforms have come (and gone) over the last decade; after all, a successful platform can usually be turned into a significant revenue stream. But the lessons learned by the founders of those early content platforms remain best practices for two-sided platforms today:

Get the use case right on both sides of the platform. Very few founders got both use cases exactly right to begin with, but the successful ones learned fast and adapted quickly.
Attract a critical mass to both sides of the platform. Attracting enough traffic to generate network effects requires a dead simple contributor experience and an addictive consumer experience, plus a receptive audience for the initial release.
Take responsibility for content even if you don’t own it. In 2007 YouTube developed ContentID to identify copyrighted audio clips embedded in videos and make it easy for contributors to comply with attribution and licensing requirements.
Be prepared for and deal effectively with stress. Some of the best antifragile patterns came from platform providers coping with extreme stress such as the massive traffic spikes at Twitter during natural disasters or hectic political events.

In short, successful platforms require insight, flexibility, discipline, and a lot of luck. Of course, this is the formula for most innovation. But don't forget – no matter how good your process is, you still need the luck part.

Mobile Apps

It’s hard to imagine what life was like without mobile apps, but they did not exist a mere eight years ago. In 2008 both Apple and Google released content platforms that allowed developers to get apps directly into the hands of smart phone owners with very little investment and few intermediaries. By 2014 (give or take a year, depending on whose data you look at) mobile apps had surpassed desktops as the path people take to the internet. It is impossible to ignore the importance of the platforms that make mobile apps possible, or the importance of the paradigm shift those apps have brought about in software engineering.

Mobile apps tend to be small and focused on doing one thing well – after all, a consumer has to quickly understand what the app does. By and large, mobile apps do not communicate with each other, and when they do it is through a disciplined exchange mediated by the platform. Their relatively small size and isolation make it natural for each individual app to be owned by a single, relatively small team that accepts the responsibility for its success. As we saw earlier, Amazon moved to small autonomous teams a long time ago, but it took a significant architectural shift for those teams to be effective. Mobile apps provide a critical architectural shift that makes small independent teams practical, even in monolithic organizations. And they provide an ecosystem that allows small startups to compete effectively with those organizations.

The nature of mobile apps changes the software development paradigm in other ways as well. As one bank manager told me, “We did our first mobile app as a project, so we thought that when the app was released, it was done. But every time there was an operating system update, we had to update the app. That was a surprise! There are so many phones to test and new features coming out that our apps are in a constant state of development. There is no such thing as maintenance – or maybe it's all maintenance.”

The small teams, constant updates, and direct access to the deployed app have created a new dynamic in the IT world: software engineers have an immediate connection with the results of their work. App teams can track usage, observe failures and track metrics – then make changes accordingly. More than any other technology, mobile platforms have fostered the growth of small, independent product teams – with end-to-end responsibility – that use short feedback loops to constantly improve their offering.

Let’s return to luck. If you have a large innovation effort, it probably has a 20% chance of success at best. If you have five small, separate innovation efforts, each with 20% chance of success, you have a much better chance that one of them will succeed – as long as they are truly autonomous and are not tied to an inflexible back end or flawed use case. Mobile apps create an environment where it can be both practical and advisable to break products into small, independent experiments, each owned by its own “full stack” team.[2] The more of these teams you have pursuing interesting ideas, the more likely you are that some of the ideas will become the innovative offerings that propel your company into the future.

What about “Agile”?

You might notice that “Agile” is not on my list of innovations. And yet, agile values are found in every major software innovation since the Agile Manifesto was articulated in 2001. Agile development does not cause innovation; it is meant to create the conditions necessary for innovation: flexibility and discipline, customer understanding and rapid feedback, small teams with end-to-end responsibility. Agile processes do not manufacture insight and they do not create luck. That is what people do.
____________________________
Footnotes:
1. “the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks… Such environments eventually experience massive blowups… catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state. Indeed, the longer it takes for the blowup to occur, the worse the resulting harm…” Antifragile, Nassim Taleb p 106

2. A full stack team contains all the people necessary to make things happen in not only the full technology stack, but also in the full stack of business capabilities necessary for the team to be successful.

Friction

2015-08-19T22:02:00.001-05:00

One third of the fuel that goes into a car is spent overcoming friction. By comparison, an electric car loses half as much energy - one sixth - to friction. Who knew electric cars had such an advantage?

Friction is the force that resists motion when the surface of one object comes into contact with the surface of another. You can imagine parts moving against one another in cars, but do you ever wonder what happens when your products and services come in contact with customers? Might this create some friction? Could there be competing offerings that create considerably less friction? If so, you can be sure your customers will find the low friction offering more attractive than yours.

Friction in the Customer Journey

Think of friction as the cognitive overhead that a system places on those who use it. Let’s consider the friction involved in taking a taxi from an airport to a hotel. When I arrive at most airports, I get in a taxi queue, heeding the conspicuous warnings not to ride with unauthorized drivers. When I reach the front of the queue, I take the next taxi in line, and I assume that the cost is the same no matter which taxi I take. But this is not true in Stockholm, where taxis can charge any rate they wish simply by posting it in the window. Nor is it true in many other locations, so I have learned to research the taxi systems in every city I visit. That’s cognitive load. I also bring enough local currency to pay for a taxi ride to my hotel and check on whether a tip is expected. More cognitive load.

Uber set out to remove the friction from taking a taxi by reimagining the entire experience, from hailing to routing to paying; from drivers and cars to insurance and regulations. By removing as many points of friction as possible for riders, Uber has become wildly popular in a very short time. In January 2015, four years after launch, Uber reported that its revenue its home city of San Francisco had grown to three times the size of the entire taxi market in that city. Uber has recently opened a robotics center in Pittsburgh and joins Google in working to create a practical driverless car. Its intent is to bring the cost and convenience of ride services to a point where owning a car becomes the expensive option.

Full Stack Startups

Uber is among the largest of a new crop of startups – investor Chris Dixon calls them full stack startups – that bypass incumbents and reinvent the entire customer experience from start to finish with the aim of making it as frictionless as possible. Full stack startups focus on creating a world that works the way it should work, given today’s technology, rather than optimizing the way it does work, given yesterday’s mental models. Because these companies are creating a new end-to-end experience, they rarely leverage existing capabilities aimed at serving their market; they develop a “full stack” of new capabilities.

"The challenge with the full stack approach is you need to get good at many different things: software, hardware, design, consumer marketing, supply chain management, sales, partnerships, regulation, etc. The good news is that if you can pull this off, it is very hard for competitors to replicate so many interlocking pieces." Chris Dixon

Large companies have the same full stack of capabilities as startups, but these capabilities lie in different departments and there is friction at every department boundary. Moreover, incumbents are deeply invested in the way things work today, so large incumbent companies are usually incapable of truly reinventing a customer journey. As hard as they try to be innovative, incumbents tend to be blind to the friction embedded in the customer journey that they provide today.

Consider banks. They have huge, complex back end systems-of-record that are expensive to maintain and keep secure. But customers expect mobile access to their bank accounts, so banks have added “front end teams” to build portals (mobile apps) to access the back end systems. Typically banks end up with what Gartner calls “Bimodal IT.” One part of IT handles the backend systems using traditional processes, while a different group uses different processes to deliver web and mobile apps. As a result, the front end teams are not able to reimagine the customer journey; they are locked into the practices and revenue models embedded in the back end systems. So in the end, banks have done little to change the customer journey, the fee structure, or anything else fundamental to banking.

For example, in the US it is nearly impossible for me to transfer money to my granddaughter’s bank account without physically mailing a check or paying an exorbitant wire transfer fee. Not only that, but I cannot use my chip-and-pin card at many places in Europe because US banks don’t let me enter a pin with my card (they still depend on signatures!), while unstaffed European kiosks always require a pin. Anyone who banks in Europe would find US banking practices archaic. I find that they generate a lot of friction and I expect that they cost a lot of money.

Creative Friction

Why do banks adopt Bimodal IT? According to Gartner, “There is an inherent tension between doing IT right and doing IT fast.” I respectfully disagree; there is nothing inherently wrong about being fast. In fact, when software development is done right, speed, quality and low cost are fully compatible. Hundreds of enterprises, including Amazon and Google (whose systems manage billions of dollars of revenue every month), have demonstrated that the safest approach to software development is automated, it is adaptive, and it is fast.

It is true that there is tension between different disciplines: front end and back end; dev and ops; product and technology. But the best way to leverage these tensions is not to separate the parties, but to put them together on the same team with a common goal. You will never have a great product, or a great process, without making tradeoffs – that is the nature of difficult engineering problems. If your teams lack multiple perspectives on a problem, they will be unable to make consistently good tradeoff decisions, and their results will be mediocre.

Friction in the Code

The Prussian general and military theorist Carl von Clausewitz (1780-1831) thought of friction as the thing which tempers the good intentions of generals with the reality of the battlefield. He was thinking of the friction caused by boggy terrain that horses cannot cross, soldiers exhausted by heat and heavy burdens, fog that obscures enemy positions, supplies that don’t keep pace with military movements. He noted that battalions are made up of many individuals moving at a different rates with different amounts of confusion and fear, each one affecting the others around him in unpredictable ways. It is impossible for the thousands of individual agents on the battlefield to behave exactly according to a theoretical plan, Clausewitz wrote. Unless generals have actually experienced war, he said, they will not be able to account for the accumulated friction created by all of these agents interacting with each other and their environment.

Anyone who has ever looked closely at a large code base would be forgiven for thinking that Clausewitz was writing about software systems. Over time, any code base acquires lots of moving parts and increasing amounts of friction develops between these parts, until eventually the situation becomes hopeless and the system is either replaced or abandoned. Unless, of course, the messy parts are systematically cleaned up and friction is kept in check. But who is allowed to take time for this sort of refactoring if the decision-makers have never written any code, never been surprised by hidden dependencies, never been bitten by the unintended consequences of seemingly innocuous changes?

Failure

Not long ago the New York Stock Exchange was shut down for half a day due to “computer problems.” It’s not uncommon for an airline reservation systems suffer from “computer problems” so severe that planes are grounded. But we don’t expect to hear about “computer problems” at Twitter or Dropbox or Netflix or similar systems – maybe they had problems a few years ago, but they seem to be reasonably reliable these days. The truth is, cloud-based systems fail all the time, because they are built on unreliable hardware running over unreliable communication links. So they are designed to fail, to detect failure, and to recover quickly, without interrupting or corrupting the services they provide. They appear to be reliable because their robust failure detection and recovery mechanisms isolate users from the unreliable infrastructure.

The first hint of this approach was Google’s early strategy for building a server farm. They used cheap off-the-shelf components that would fail at a known rate, and then they automated failure detection and recovery. They replicated server contents so nothing was lost during a failure, and they automated the monitoring, detection, and recovery process. Amazon built its cloud with the same philosophy – they knew that at the scale they intended to pursue, everything would fail sooner rather than later, so automated failure detection and recovery had to be designed into the system.

Designing failure recovery into a system requires a special kind of software architecture and approach to development. To compensate for unreliable communication channels, messaging is usually asynchronous and on a best-efforts basis. Because servers are expected to fail, interfaces are idempotent so you get the same results on a retry as you get the first time. Since distributed data may not always match, software is written to deal with the ambiguities and produce eventual consistency.

Fault tolerance is not a new concept. Back in the days before solid state components, computer hardware was expected to fail, so vast amounts of time and energy were dedicated to failure detection and recovery. My first job was programming the Number 2 ESS (Electronic Switching System) being built in Naperville, IL by Bell Labs about the time I got out of college. This system, built out of discrete components prior to the days of integrated circuits, had a design goal of a maximum downtime of two hours in forty years. The hardware was completely duplicated and easily half of the software was dedicated to detecting faults, switching out the bad hardware, and identifying the defective discrete component so it could be replaced. This allowed a system built on unreliable electronic components to match the reliability of the electro-mechanical switching systems that were commonly in use at the time.

Situational Awareness

Successful cloud-based systems have a LOT of moving parts – that pretty much comes as a byproduct of success. With all of these parts moving around, designing for failure hardly seems like an adequate explanation for the robustness of these systems. And it isn’t. At the heart of a reliable cloud-based system are small teams (you might call them “full stack” teams) of people who are fully responsible for their piece of the system. They pay attention to how their service is performing, they fix it when it fails, and they continuously improve it to better serve its consumers.

Full stack teams that maintain end-to-end responsibility for a software service do not fit the model we used to have of the “right” way to develop software. These are not project teams that write code according to spec, turn it over to testing, and disband once it’s tossed over the wall to operations. They are engineering teams that solve problems and make frequent changes to the code they are responsible for. Code bases created and maintained by full stack teams are much more resilient than the large and calcified code bases created by the project model precisely because people pay attention to (and change!) the internal workings of “their” code on an on-going basis.

Limited Surface Area

Clearly, many small teams making independent changes to a large code base can generate a lot of Clausewitzian friction. But since friction occurs when the surfaces of two objects come in contact with each other, strictly limiting the surface area of the code exposed by each team can dramatically reduce friction. In cloud-based systems, services are designed to be as self-contained as possible and interactions with other services are strictly limited to hardened interfaces. Teams are expected to limit changes to the surface area (interfaces) of their code and proactively test any changes that might make it through that surface to other services.

Modern software development includes automated testing strategies and automated deployment pipelines that take the friction out of the deployment process, making it practical and safe to independently deploy small services. Containers are used to standardize the surface area that services expose to their environment, reducing the friction that comes from unpredictable surroundings. Finally, when small changes are made to a live system, the impact of each change is monitored and measured. Changes are typically deployed to a small percentage of users (limiting the deployment surface area), and if any problems are detected small changes can be rolled back quickly. We know that the best way to change a complex system is to probe and adapt, and we know that software systems are inherently complex. This explains why the small rapid deployments common in cloud-based systems turn out to be much safer and more robust than the large releases that we used to think were the “right” way to deliver software.

Shared Learning

Do you ever wonder how the sophisticated testing and deployment tools used at companies like Netflix actually work? Would you like to know how Netflix stores and analyzes data or how it monitors the performance of its platform? Just head over to the Netflix Open Source Center on GitHub; it’s all there for you to see – and use if you’d like. Want to analyze a lot of data? You will undoubtedly consider Hadoop, originally developed at Yahoo! based on Google research papers, open sourced through Apache, and now at the core of many open source tools that abstract its interface and extend its capability.

The world-wide software engineering community has developed a culture of sharing intellectual property, in stark contrast to the more common practice of keeping innovative ideas and novel tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues. Because of this, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

Friction in the Process

Between 2004 and 2010, the FBI tried twice to develop an electronic case management system, and it failed both times, squandering hundreds of millions of dollars. UK’s National Health system lost similar amounts of money on a patient booking system that was eventually abandoned, and multiple billions of pounds on a patient record system that never worked. In 2012 Sweden decided to scrap and rewrite PUST, a police automation system that actually worked quite well, but not well enough for those who chose to have it rewritten the “right” way. The rewrite never worked and was eventually abandoned, an expensive fiasco that left the police without any system at all.

I could go on and on – just about every country has its story about an expensive government-funded computer system that cost extraordinary amounts of money and never actually worked. The reason? Broadly speaking, these fiascoes are caused by the process most governments use to procure software systems – a high friction process with a very high rate of failure.

One country that does not have an IT fiasco story is Estonia, probably the most automated country in the world. A few years ago British MP Francis Maude visited Estonia to find out how they managed to implement such sophisticated automation on a small budget. He discovered that Estonia automated its government because it had such a small budget, and properly automated government services are much less expensive than their manual counterparts.

Estonia’s process is simple: small internal teams work directly with consumers, understand their journey, and remove the friction. Working software is delivered in small increments to a small number of consumers, adjustments are made to make it work better, and once things work well the new capability is rolled out more broadly. Then another capability is added in the same manner, and thus the system grows steadily in small steps over time. (Incidentally, when this process is used, it is almost impossible to spend a lot of money only to find out the system doesn’t work.)

The UK government formed a consortium with Estonia and three other countries (called the Digital 5) to “provide a focused forum to share best practice [and] identify how to improve the participants’ digital services.” Maude started up the UK’s Government Digital Services, where small internal teams focus on making the process of obtaining government information and services as frictionless as possible. If you want to see how the UK Government Digital Services actually works, check out its Design Principles which summarize a new mental model for creating digital services, and the Governance approach, which outlines an effective, low friction software development process.

The HealthCare.gov fiasco in the US in 2013 led to the creation of US Digital Services, which is working in partnership with UK Digital Services to rethink government software development and delivery strategies. The US Digital Services Playbook is a great place for any organization to find advice on implementing a low friction development process.

DIGITAL SERVICE PLAYS:

Understand what people need

Address the whole experience, from start to finish

Make it simple and intuitive

Build the service using agile and iterative practices

Structure budgets and contracts to support delivery

Assign one leader and hold that person accountable

Bring in experienced teams

Choose a modern technology stack

Deploy in a flexible hosting environment

Automate testing and deployments

Manage security and privacy through reusable processes

Use data to drive decisions

Default to open

US Digital Services Playbook

The New Mental Model

The UK government changed – seemingly overnight – from high friction processes orchestrated by procurement departments to small internal teams governed by simple metrics. Instead of delivering “requirements” that someone else thinks up, teams are required to track four key performance indicators and figure out how to move these metrics in the right direction over time.

UK Digital Service’s four core KPIs:

Cost per transaction

User satisfaction

Completion rate

Digital take-up

See Gov.UK’s Performance Dashboard.

This is an entirely new mental model about how to develop effective software – one that removes all of the intermediaries between an engineering team and its consumers. It is a model that makes no attempt to define requirements, make estimates, or limit changes; instead it assumes that digital services are best developed through experimentation and require on-going improvement.

This is the mental model used by those who developed the first PUST system in Sweden, the one that was successful and appreciated by the police officers who used it. But unfortunately, conventional wisdom said it was not developed the “right” way, so the working system was shut down and rebuilt using the old mental model. And thus Sweden snatched failure from the jaws of success, proving once again that when it comes to developing interactive services, the old mental model simply Does. Not. Work.

Unexpected Points of Friction

It turns out that when governments move from the old mental model to the new mental model, many of the things that were considered “good” or “essential” in the past turn out to be “questionable” or “to be avoided” going forward. It’s a bit jarring to look at the list of good ideas that should be abandoned, but when you consider the friction that these ideas generate, it’s easier to see why forward-looking governments have eliminated them.

1. Requirements generate friction. The concept that requirements are specified by [someone] and implemented by “the team” has to be abandoned. Rather a team of engineers should explore hypotheses, testing and modifying ideas until they are proven or abandoned. Engineering teams should be expected to figure out how to make a positive impact on business metrics within valid constraints.

2. Handovers generate friction. The engineering team should have direct contact with at least a representative sample of the people whose journey they are automating. Just about any intermediary is problematic, whether the go-between is a procurement officer, business analyst, or product owner.

3. Organizational boundaries generate friction. There is a reason why the UK and US use internal teams to develop Digital Services. Going through a procurement office creates insurmountable friction – especially when procurement is governed by laws passed in the days of the old mental model. The IT departments of enterprises often generate similar friction, especially when they are cost centers.

4. Estimates generate friction. Very little useful purpose is served by estimates at the task level. Teams should have a good idea of their capacity by measuring the rate at which they complete their current work or the time it takes work to move through their workflow. Teams should be asked "What can be completed in this time-frame?" rather than "How long will this take?" The UK Digital Services funds service development incrementally, with a general time limit for each phase. If a service does not fall within the general time boundaries, it is usually broken down into smaller services.

5. Multitasking generates friction. Teams should do one thing at a time and get it done, because task switching burns up a lot of cognitive overhead. Moreover, partially done work that has been put aside gums up the workflow and slows things down.

6. Backlogs generate friction. A long "to do" list takes time to compile and time to prioritize, while everything on the list grows old and whoever put it there grows impatient. Don't prioritize - decide! Either the capacity exists to do the work, or it doesn't. Teams need only three lists: Now, Next, and Never. There is no try.

If Governments can do it, so can Enterprises

If governments can figure out how to design award-winning services [GOV.UK won the Design Museum Design of the Year Award in 2013] while moving quickly and saving money, surely enterprises can do the same. But first there is a lot of inertia to overcome. Once upon a time, governments assumed that obtaining software systems through a procurement process was essential, because it would be impossible to hire the people needed to design and develop these systems internally. They were wrong. They assumed that having teams scattered about in various government agencies would lead to a bunch of unconnected one-of systems. They were wrong. They were afraid that without detailed requirements and people accountable for estimates, there would be no governance. They were wrong. Once they abandoned these assumptions and switched to the low friction approach pioneered by Estonia, governments got better designs, more satisfied consumers, lower cost, and far more predictable results.

Your organization can reap the same benefits, but first you will have to check your assumptions at the door and question some comforting things like requirements, estimates, IT departments, contracts, backlogs – you get the idea. Read the US Digital Services Playbook. Could you run those 13 plays in your organization? If not, you need to uncover the assumptions that are keeping you in the grasp of the old mental model.

Pitfalls of Agile Transformations

2015-07-16T15:48:00.001-05:00

“We are a conservative company, so we are just starting our agile transformation,” the manager told me. “But we expect big things from it: faster delivery, easier recruiting, happier customers.”

“Interesting objectives,” I thought to myself. “Something I might have heard ten years ago.” It struck me that the reason an organization opts for late adoption is to learn from those who go first – from the companies that bushwhacked through the agile swamp a decade ago, or the organizations that followed a few years later. I wondered how much of what we have learned in the last decade will inform this budding agile transformation. I sensed that the answer was “not enough.”

Once you get past the sales pitches and confirmation biases, it doesn’t take much research to discover that agile and Scrum don’t have such a great track record. In the First Round Review article I'm Sorry, But Agile Won't Fix Your Products, Adam Pisoni, co-founder and former CTO of Yammer, contends that “While SCRUM did manage to rein in impulsive managers, it ended up being used more to exert tighter control over engineers’ work.” In The Failure of Agile, Andy Hunt, an original signatory of the Agile Manifesto, writes “Agile methods themselves have not been agile. Now there‘s an irony for you.” Both of these pieces complain that agile does not provide real empowerment – one of several persistent problems we have observed in many organizations as they adopt agile practices.

Every organization undertaking an agile transformation imagines that the problems with other agile implementations will not plague THEIR transformation. If they hire the right consultants and use the best practices, they assume they will be fine. This kind of wishful thinking only lengthens the list of mediocre agile transformations. It would be more useful to understand the most predictable problems with agile implementations and actively help your organization avoid them.

With this in mind, I offer three questions you might ask to expose some of the typical ways in which agile disappoints, along with the best current approaches for avoiding these common agile pitfalls.

Question 1: Should you use Scrum or Continuous Delivery?

This may come as a surprise, but quite frankly, Scrum says nothing about how to develop software, nothing about how to deliver defect-free code and nothing about techniques for faster production releases. Other agile methodologies – especially the long lost Extreme Programming – have more to say on these topics, but most agile transformations reserve little time for improving the actual work involved in generating top notch software. Yet without a solid foundation in the technology that produces great systems, agile is pretty hollow.

The technical heart of agile is embodied in the practices articulated by Jez Humble and Dave Farley in Continuous Delivery: acceptance test-driven development; automated builds, automated testing, automated database migration, and automated deployment; everyone checks their code into the mainline at least daily (there are no branches!); the mainline is ALWAYS production ready and is deployed very frequently (daily is slow); release is by switch rather than by deployment. If you aren’t heading toward these or similar technical practices and you think you are doing an agile transformation, think again. Agile without a strong technology base is usually a mistake.

Start your agile transformation by acknowledging that software development is a deeply technical endeavor leading to highly complex systems. These systems behave like all complex systems – if you smash them with a big change, all bets are off – you cannot predict the results. The only way to have predictable, stable code bases is to modify them with small probes, observe the results, modify the code and probe again. [Incidentally, a small probe is not two weeks of work; it’s more like two hours of work.] If deploying small probes to live systems is not at the core of your agile transformation strategy, you are missing today’s most reliable tools for delivering stable systems with predictable results.

Yes, this means writing a lot more code. It means tests as code, infrastructure as code, deployment as code. It means no one writes production code until there is an acceptance test for it, written in an executable language. It means teams can pretend they are working in a cloud because the infrastructure they need is always available and can be provisioned as needed. It means that whole teams (which include everyone from product to operations) retain responsibility for their code even after it goes live. And it means that the most common way teams decide what to do next is to examine feedback from the effects of their work in actual use.

The technology enabling Continuous Delivery should be at the core of any modern agile transformation because it has proven to be the safest way for an organization to gain and maintain control of complex software systems. If your agile transition team does not understand this technology, then you are probably trying to switch to agile without adequate technical leadership. This is not a good strategy.

Admittedly, Continuous Delivery is technically challenging, but no more so than the many other challenges that technical teams deal with every day. In fact, we have found that almost without exception, software engineers love to work in a Continuous Delivery environment because of the challenge, the discipline, the clarity, and the immediate feedback. One financial services company told us that in the three years since their (large) IT department switched to Continuous Delivery, they have had zero turnover, except for emigration. Their transformation resulted in the most desirable jobs in the area.

Question 2: Do you hire Developers or Engineers?

What title do you use for people who solve problems with software? Years upon years ago, I was called a programmer and that was a high status job. But once waterfall processes placed analysts between programmers and their customers, the programmers were no longer expected to analyze customer problems and solve them. The title “programmer” was downgraded to a second class job which mostly involved coding what someone else wrote in a specification. Over time a new term – developers – came into use and referred to a more holistic job. But then, agile processes placed a product owner between developers and customers, so developers were no longer expected to analyze customer problems and solve them. Instead, they were given a prioritized list of relatively small stories to estimate, code, and (hopefully) test.

If you visit Silicon Valley these days you will find that software developers have been replaced by software engineers. We can only hope that those smart people who have this title will be presented with complete problems and expected to engineer a solution. They will not be given specs, because whoever wrote the spec designed the solution. They will not be given stories, because whoever wrote the stories designed the solution. They will be given real problems – customer problems, business problems, technical problems – and asked to engineer a solution. They will be expected to implement the solution within valid constraints and take responsibility for its success. Silicon Valley companies understand that this is the kind of job that attracts the best engineers.

If you want more effective recruiting in today’s very tight talent market, don’t look for software developers or mention your agile transformation. Look for software engineers and reliability engineers and make it clear that you expect them to engineer effective solutions to meaningful problems. Then make sure that your agile transformation makes this challenging work the responsibility of your engineers, because most agile methodologies place it elsewhere.

Question 3: How will you handle dependencies?

I was astonished when I heard that after Amazon completed its switch to services, the company no longer used central databases. How could this possibly work? I thought it was self-evident that a single system of record is fundamental to the success of an enterprise – so how could Amazon possibly survive without a central database? Either the information about abandoning central databases was wrong or Amazon was doing something that defied all conventional wisdom.

It turns out that the second was correct – Amazon had discovered something so obvious that it had escaped us for decades: A central database is one humongous dependency generator. Ouch! Take a look at Sam Newman’s book Building Microservices – where the case is made that dependencies are among the greatest evils in software development and central databases are among the most pernicious creators of dependencies in the software world. It’s eye-opening.

These days we see a lot of companies building microservices – Netflix and realestate.com.au and Gilt and many more. Why? Because when they experience extremely high volume, the code that handles this volume needs constant attention and tuning. The only way to make that happen at scale is to adopt a structure which allows individual teams to deploy their code – live to production – independently of other teams. A microservice is exactly that – code owned by one (small) team that designs, monitors, maintains, and deploys the service – independent of other teams and other code.

If this sounds a lot like something you’ve heard of before, that’s because independent module deployment has been the dream of software development just about forever. A couple decades ago, object-oriented programming promised this nirvana, but it never quite delivered. Now microservices are making the same promise, and there are instances of them working pretty well. Of course, microservices are rather new and the jury is still out. (See Martin Fowler’s summary of Microservices.) But we know that for very high volume systems, independent deployment appears to be mandatory and microservices seem to be the architecture of choice. Clearly microservices are a viable way – but not the only way – to handle dependencies.

No matter what kind of system you have, dependencies must be dealt with or else they will eventually haunt you. The Google code base started out as a monolith which rapidly developed many dependencies, but fortunately, Google's engineers understood the danger. So they developed a dependency matrix to keep track of code interactions, and whenever code was pushed to the test framework, the new code and all of its dependencies were tested together – immediately. If the test found problems, the code was reverted and everyone involved was notified. New code was system-tested thousands of times a day, which required a massive environment with a lot of automation. But it worked infinitely better than manually testing large changes because it identified the precise cause of potential problems before they happened. As expensive as it seems, it turns out that testing each small change with its complete stack of dependent code is better, cheaper, safer and faster than testing big batch releases the way we used to in the past.

“But how do we get from our legacy systems to that ideal state?” we are often asked. Well, that is precisely the question your agile transformation should answer. There are plenty of places to look for ideas, because this is a path many companies have taken. To get started, Martin Fowler's Strangler Application provides a general pattern for migrating away from legacy code, and several case studies can be found here. However, there are no canned answers for dealing with legacy code; the problems are quite specific to each situation. You need good engineers to take up the challenge supported by leadership that appreciates the importance of the issue. But the bottom line is that if an agile transformation does not provide a path from smashing your system with big releases to probing it with tiny bits of code, you have more homework to do before you get started.

We have learned a lot about how to deal with dependencies over the last few years. We can do it with an architecture that isolates dependencies – perhaps microservices – or by automatically testing the complete system of dependent code after every small change. We know we should NOT deal with dependencies by consuming the last third of a release cycle with system testing (and fixing) the way we used to in the waterfall days. And we know it does not make sense to automate tests just to make this back-end testing go faster – a mistake we have seen frequently that you want to avoid. Test automation should be aimed at defect prevention, not defect discovery. Preventing defects as the code is written pays for itself. Many times over. Every time.

Ask the Right Questions

If you are one of those conservative organizations that is just getting around to an agile transformation, be sure you ask the right questions before you take the leap. Remember that typical agile practices are just table stakes. You need to know how to play the complex systems game, a deeply technical game played by very smart engineers. Don’t insult their intelligence if you want to engage them.

Understand that dependencies cause most defects and fragile code bases, and they also lead to tangled organizational structures. Really. If you’re skeptical, check out Conway’s Law. Get your technical and architectural act together, as well as your strategy for dealing with dependencies, before you begin. This may prompt you to consider an organizational change as part of the transformation.

When you are ready to start, be sure to articulate the specific business goals the agile transition will help achieve and how you will measure the agile transition’s contribution to these goals. Then challenge your smart people to figure out how to move those metrics – and your transition will be off to a good start.

As an industry, we know how to do this. Your colleagues have done it. You may as well avoid the pitfalls they have discovered. Start by asking a few questions.

Lean Software Development: The Backstory

2015-06-05T14:09:00.003-05:00

We were in a conference room near the Waterfront in Cape Town. “I just lost a crown from one of my teeth.” my husband Tom declared just before I was scheduled to open the conference. Someone at our table responded, “You’re lucky, Cape Town has some of the best dentists in the world.” It didn’t feel very lucky; Cape Town was the first stop on a ten week trip to Africa, Europe, and Australia.

The situation was eerily familiar. A year earlier a chip had cracked off of my tooth as I ate a pizza in Lima, the first stop of a ten week trip to South America. I ate gingerly during the rest of the trip, worried that the tooth would crack further. Luckily I made it back home with no pain and little additional damage. Once there, it took three days to get a dentist appointment. The dentist made an impression of the gap in my tooth and fashioned a temporary crown. “This will have to last for a week or two,” she said. “If it falls out, just stick it back in and be more careful what you eat.” Luckily the temporary crown held, and ten days later a permanent crown arrived from the lab. Two weeks after we arrived home, my tooth was fixed.

We were scheduled to be in Cape Town for only two days. How was Tom going to get a crown replaced in two days? A small committee formed. Someone did a phone search; apparently the Waterfront was a good place to find dentists. A call was made. “You can go right now – the dental office is nearby. Do you want someone to walk you over?” As Tom headed out the door with an escort, I got ready for my presentation. Half way through the talk, I saw Tom return and signal that all was well.

“I lost a part of my tooth, not just the crown,” Tom told me after the talk. “I’m supposed to return at 3:30 this afternoon; I should have a new crown by the end of the day.” The dentist had a mini-lab in his office. Instead of making a temporary crown, he used a camera to take images of the broken tooth and adjacent teeth. The results were combined into a 3D model of the crown to which the dentist made a few adjustments. Then he selected a ceramic blank that matched the color of Tom’s teeth and put it in a milling machine. With the push of a button, instructions to make the crown were loaded into the machine. Cutters whirled and water squirted to keep the ceramic cool. Ten minutes later the crown was ready to cement in place. Ninety minutes after he arrived that afternoon and eight hours after the incident, Tom walked out of the dental office with a new permanent crown. It cost approximately the same amount as my crown had cost a year earlier.

Lean is about Flow Efficiency

The book This is Lean (Modig and Ahlström, 2013) describes “lean” as a relentless focus on efficiency – but not the kind of efficiency that cuts staff and money, nor the kind of efficiency that strives to keep every resource busy all of the time. In fact, a focus on resource efficiency will almost always destroy overall efficiency, the authors contend, because fully utilized machines (and people) create huge traffic jams, which end up creating a lot of extra work. Instead, Modig and Ahlström demonstrate that lean is about flow efficiency – that is, the efficiency with which a unit of work (a flow unit) moves through the system.

Consider our dental experience. It took two weeks for me get a new crown, but in truth, only an hour and a half of that time was needed to actually fix the tooth; the rest of the time was mostly spent waiting. My flow efficiency was 1.5÷336 (two weeks) = 0.45%. On the other hand, Tom’s tooth was replaced in eight hours – 42 times faster – giving him a flow efficiency of 1.5÷8 = 18.75%.

In my case, the dental system was focused on the efficiency of the lab’s milling machine – no doubt an expensive piece of equipment. But add up all of the extra costs: a cast of the crown for the lab, a temporary crown for me, two separate hour-long sessions with the dentist, plus all of the associated logistics – scheduling, shipping, tracking, etc. In Tom’s case, the dental system was focused on the speed with which it could fix his tooth – which was good for us, because a long wait for a crown was not an option. True, the milling machine in the dentist’s office sits idle much of each day. (The dentist said he has to replace two crowns a day to make it economically feasible.) But when you add up the waste of temporary crowns, the piles of casts waiting for a milling machine, and the significant cost of recovering from a mistake – an idle milling machine makes a lot of sense.

What does flow efficiency really mean? Assume you have a camera and efficiency means keeping the camera busy – always taking a picture of some value-adding action. Where do you aim your camera? In the case of resource efficiency, the camera is aimed at the resource – the milling machine – and keeping it busy is of the utmost importance. In the case of flow efficiency, the camera is on the flow unit – Tom – and work on replacing his crown is what counts. The fundamental mental shift that lean requires is this: flow efficiency trumps resource efficiency almost all of the time.

Lean Product Development: The Predecessor

During the 1980’s Japanese cars were capturing market share at a rate that alarmed US automakers. In Boston, both MIT and Harvard Business School responded by launching extensive studies of the automotive industry. In 1990 the MIT research effort resulted in the now classic book The Machine that Changed the World: the Story of Lean Production (Womack et al., 1990), which gave us the term “lean.” A year later, Harvard Business School published Product Development Performance. (Clark and Fujimoto, 1991) and the popular book Developing Products in Half the Time (Smith and Reinertsen, 1991) was released. These two 1991 books are foundational references on what came to be called “lean product development,” although the term “lean” would not be associated with product development for another decade.

Clark and Fujimoto documented the fact that US and European volume automotive producers took three times as many engineering hours and 50% more time to develop a car compared to Japanese automakers, yet the Japanese cars had substantially higher quality and cost less to manufacture. Clearly the Japanese product development process produced better cars faster and at lower cost that typical western development practices of the time. Clark and Fujimoto noted that the distinguishing features of Japanese product development paralleled features found in Japanese automotive production. For example, Japanese product development focused on flow efficiency, reducing information inventory, and learning based on early and frequent feedback from downstream processes. By contrast, product development in western countries focused on resource efficiency, completing each phase of development before starting the next, and following the original plan with as little variation as possible.

In 1991 the University of Michigan began its Japan Technology Management Program. Over the next several years, faculty and associate members included Jeffrey Liker, Allen Ward, Durward Sobek, John Shook, and Mike Rother. This group has published numerous books and articles on lean thinking, lean manufacturing, and lean product development, including The Toyota Product Development System (Morgan and Liker, 2006), and Lean Product and Process Development (Ward, 2007). The second book summarizes the essence of lean product development this way:

Understand that knowledge creation is the essential work of product development.
Charter a team of responsible experts led by an entrepreneurial system designer.
Manage product development using the principles of cadence, flow, and pull.

It is important to recognize that even though lean product development is based on the same principles as lean production, the practices surrounding development are, quite frankly, not the same as those considered useful in production. In fact, transferring lean practices from manufacturing to development has led to some disastrous results. For example, lean production emphasizes reducing variation – exactly the wrong thing to do in product development. The western practice of following a plan and measuring variance from a plan is often justified by the slogan “Do it right the first time.” Unfortunately, this approach does not allow for learning; it confines designs to those conceived when the least amount of knowledge is available. A fundamental practice in lean product development is to create variation (not avoid it) in order to explore the impact of multiple approaches. (This is called set-based engineering.)

The critical thing to keep in mind is that knowledge creation is the essential work of product development. While lean production practices support learning about and improving the manufacturing process, their goal is to minimize variation in the product. This is not appropriate for product development, where variation is an essential element of the learning cycles that are the foundation of good product engineering. Thus instead of copying lean manufacturing practices, lean product development practices must evolve from a deep understanding of fundamental lean principles adapted to a development environment.

Lean Software Development: A Subset of Lean Product Development

In 1975, computers were large, expensive, and rare. Software for these large machines was developed in the IT departments of large companies and dealt largely with the logistics of running the company – payroll, order processing, inventory management, etc. But as mainframes morphed into minicomputers, personal computers, and microprocessors, it became practical to enhance products and services with software. Then the internet began to invade the world, and it eventually became the delivery mechanism for a large fraction of the software being developed today. As software moved from supporting business process to enabling smart products and becoming the essence services, software engineers moved from IT departments to line organizations where they joined product teams.

Today, most software development is not a stand-alone process, but rather a part of developing products or services. Thus lean software development might be considered a subset of lean product development; certainly the principles that underpin lean product development are the same principles that form the basis of lean software development.

Agile and Lean Software Development: 2000 - 2010

It’s hard to believe these days, but in the mid 1990’s, developing software was a slow and painful process found in the IT departments of large corporations. As the role of software expanded and software engineers moved into line organizations, reaction against the old methods grew. In 1999, Kent Beck proposed a radically new approach to software development in the book “Extreme Programming Explained” (Beck, 1999). In 2001 the Agile Manifesto (Beck et al., 2001) gave this new approach a name – “Agile.”

In 2003, the book Lean Software Development (Poppendieck, 2003) merged lean manufacturing principles with agile practices and the latest product development thinking, particularly from the book Managing the Design Factory (Reinertsen, 1997). Lean software development was presented as a set of principles that form a theoretical framework for developing and evolving agile practices:

Eliminate waste
Amplify learning
Decide as late as possible
Deliver as fast as possible
Empower the team
Build quality in
See the whole

Although the principles of lean software development are consistent with lean manufacturing and (especially) lean product development, the specific practices that emerged were tailored to a software environment and aimed at the flaws in the prevailing software development methodologies. One of the biggest flaws at the time was the practice of moving software sequentially through the typical stages of design, development, test, and deployment – with handovers of large inventories of information accumulating at each stage. This practice left testing and integration at the end of the development chain, so defects went undetected for weeks or months before they were discovered. Typical sequential processes reserved a third of a release cycle for testing, integration, and defect removal. The idea that it was possible to “build quality in” was not considered a practical concept for software.

To counter sequential processes and the long integration and defect removal phase, agile software development practices focused on fast feedback cycles in these areas:

Test-driven development: Start by writing tests (think of them as executable specifications) and then write the code to pass the tests. Put the tests into a test harness for ongoing code verification.
Continuous integration: Integrate small increments of code changes into the code base frequently – multiple times a day – and run the test harness to verify that the changes have not introduced errors.
Iterations: Develop working software in iterations of two-to four weeks; review the software at the end of each iteration and make appropriate adjustments.
Cross-functional teams: Development teams should include customer proxies and testers as well as developers to minimize handovers.

During its first decade, agile development moved from a radical idea to a mainstream practice. This was aided by the widespread adoption of Scrum, an agile methodology which institutionalized the third and fourth practices listed above, but unfortunately omitted the first two practices.

The Difference between Lean and Agile Software Development

When it replaced sequential development practices typical at the time, agile software development improved the software development process most of the time – in IT departments as well as product development organizations. However, the expected organizational benefits of agile often failed to materialize because agile focused on optimizing software development, which frequently was not the system constraint. Lean software development differed from agile in that it worked to optimize flow efficiency across the entire value stream “from concept to cash.” (Note the subtitle of the book Implementing Lean Software Development: From Concept to Cash (Poppendieck, 2006)). This end-to-end view was consistent with the work of Taiichi Ohno, who said:

“All we are doing is looking at the time line, from the moment the customer gives us an order to the point when we collect the cash. And we are reducing that time line by removing the non-value-added wastes.” (Ohno, 1988. p ix)

Lean software development came to focus on these areas:

Build the right thing: Understand and deliver real value to real customers.
Build it fast: Dramatically reduce the lead time from customer need to delivered solution.
Build the thing right: Guarantee quality and speed with automated testing, integration and deployment.
Learn through feedback: Evolve the product design based on early and frequent end-to-end feedback.

Let’s take a look at each principle in more detail:

1. Understand and deliver real value to real customers.

A software development team working with a single customer proxy has one view of the customer interest, and often that view is not informed by technical experience or feedback from downstream processes (such as operations). A product team focused on solving real customer problems will continually integrate the knowledge of diverse team members, both upstream and downstream, to make sure the customer perspective is truly understood and effectively addressed. Clark and Fujimoto call this “integrated problem solving” and consider it an essential element of lean product development.

2. Dramatically reduce the lead time from customer need to delivered solution.

A focus on flow efficiency is the secret ingredient of lean software development. How long does it take for a team to deploy into production a single small change that solves a customer problem? Typically it can take weeks or months – even when the actual work involved consumes only an hour. Why? Because subtle dependencies among various areas of the code make it probable that a small change will break other areas of the code; therefore it is necessary to deploy large batches of code as a package after extensive (usually manual) testing. In many ways the decade of 2000-2010 was dedicated to finding ways to break dependencies, automate the provisioning and testing processes, and thus allow rapid independent deployment of small batches of code.

3. Guarantee quality and speed with automated testing, integration and deployment.

It was exciting to watch the expansion of test-driven development and continuous integration during the decade of 2000-2010. First these two critical practices were applied at the team level – developers wrote unit tests (which were actually technical specifications) and integrated them immediately into their branch of the code. Test-driven development expanded to writing executable product specifications in an incremental manner, which moved testers to the front of the process. This proved more difficult than automated unit testing, and precipitated a shift toward testing modules and their interactions rather than end-to-end testing. Once the product behavior could be tested automatically, code could be integrated into the overall system much more frequently during the development process – preferably daily – so software engineers could get rapid feedback on their work.

Next the operations people got involved and automated the provisioning of environments for development, testing, and deployment. Finally teams (which now included operations) could automate the entire specification, development, test, and deployment processes – creating an automated deployment pipeline. There was initial fear that more rapid deployment would cause more frequent failure, but exactly the opposite happened. Automated testing and frequent deployment of small changes meant that risk was limited. When errors did occur, detection and recovery was much faster and easier, and the team became a lot better at it. Far from increasing risk, it is now known that deploying code frequently in small batches is best way to reduce risk and increase the stability of large complex code bases.

4. Evolve the product design based on early and frequent end-to-end feedback.

To cap these remarkable advancements, once product teams could deploy multiple times per day they began to close the loop with customers. Through canary releases, A/B testing, and other techniques, product teams learned from real customers which product ideas worked and how to fine tune their offerings for better business results.

When these four principles guided software development in product organizations, significant business-wide benefits were achieved. However, IT departments found it difficult to adopt the principles because they required changes that lay beyond span of control of most IT organizations.

Lean Software Development: 2010 - 2015

2010 saw the publication of two significant books about lean software development. David Anderson’s book Kanban (Anderson, 2010) presented a powerful visual method for managing and limiting work-in-process (WIP). Just at the time when two week iterations began to feel slow, Kanban gave teams a way to increase flow efficiency while providing situational awareness across the value stream. Jez Humble and Dave Farley’s book Continuous Delivery (Humble and Farley, 2010) walked readers through the steps necessary to achieve automated testing, integration and deployment, making daily deployment practical for many organizations. A year later, Erik Reis’s book The Lean Startup (Reis, 2011) showed how to use the rapid feedback loop created by continuous delivery to run experiments with real customers and confirm the validity of product ideas before incurring the expense of implementation.

Over the next few years, the ideas in these books became mainstream and the limitations of agile software development (software-only perspective and iteration-based delivery) were gradually expanded to include a wider part of the value stream and a more rapid flow. A grassroots movement called DevOps worked to make automated provision-code-build-test-deployment pipelines practical. Cloud computing arrived, providing easy and automated provisioning of environments. Cloud elements (virtual machines, containers), services (storage, analysis, etc.) and architectures (microservices) made it possible for small services and applications to be easily and rapidly deployed. Improved testing techniques (simulations, contract assertions) have made error-free deployments the norm.

The State of Lean Software Development in 2015

Today’s successful internet companies have learned how to optimize software development over the entire value stream. They create full stack teams that are expected to understand the consumer problem, deal effectively with tough engineering issues, try multiple solutions until the data shows which one works best, and maintain responsibility for improving the solution over time. Large companies with legacy systems have begun to take notice, but they struggle with moving from where they are to the world of thriving internet companies.

Lean principles are a big help for organizations that want to move from old development techniques to modern software approaches. For example, (Calçado, 2015) shows how classic lean tools – Value Stream Mapping and problem solving with Five Whys – were used to increase flow efficiency at Soundcloud, leading over time to a microservices architecture. In fact, focusing on flow efficiency is an excellent way for an organization to discover the most effective path to a modern technology stack and development approach.

For traditional software development, flow efficiency is typically lower than 10%; agile practices usually bring it up to 30 or 40%. But in thriving internet companies, flow efficiency approaches 70% and is often quite a bit higher. Low flow efficiencies are caused by friction – in the form of batching, queueing, handovers, delayed discovery of defects, as well as misunderstanding of consumer problems and changes in those problems during long resolution times. Improving flow efficiency involves identifying and removing the biggest sources of friction from the development process.

Modern software development practices – the ones used by successful internet companies – address the friction in software development in a very particular way. The companies start by looking for the root causes of friction, which usually turn out to be 1) misunderstanding of the customer problem, 2) dependencies in the code base and 3) information and time lost during handovers and multitasking. Therefore they focus on three areas: 1) understanding the consumer journey, 2) architecture and automation to expose and reduce dependencies, and 3) team structures and responsibilities. Today (2015), lean development in software usually focuses on these three areas as the primary way to increase efficiency, assure quality, and improve responsiveness in software-intensive systems.

1. Understand the Customer Journey.

Software-intensive products create a two-way path between companies and their consumers. A wealth of data exists about how products are used, how consumers react to a product’s capabilities, opportunities to improve the product, and so on. Gathering this data and analyzing it has become an essential capability for companies far beyond the internet world: car manufactures, mining equipment companies, retail stores and many others gather and analyze “Big Data” to gain insights into consumer behavior. The ability of companies to understand their consumers through data has changed the way products are developed. (Porter, 2015) No longer do product managers (or representatives from “the business”) develop a roadmap and give a prioritized list of desired features to an engineering team. Instead, data scientists work with product teams to identify themes to be explored. Then the product teams identify consumer problems surrounding the theme and experiment with a range of solutions. Using rapid deployment and feedback capabilities, the product team continually enhances the product, measuring its success by business improvements, not feature completion.

2. Architecture and Automation.

Many internet companies, including Amazon, Netflix, eBay, realestate.com.au, Forward, Twitter, PayPal, Gilt, Bluemix, Soundcloud, The Guardian, and even the UK Government Digital Service have evolved from monolithic architectures to microservices. They found that certain areas of their offerings need constant updating to deal with a large influx of customers or rapid changes in the marketplace. To meet this need, relatively small services are assigned to small teams which then split their services off from the main code base in such a way that each service can be deployed independently. A service team is responsible for changing and deploying the service as often as necessary (usually very frequently), while insuring that the changes do not break any upstream or downstream services. This assurance is provided by sophisticated automated testing techniques as well as automated incremental deployment.

Other internet companies, including Google and Facebook, have maintained existing architectures but developed sophisticated deployment pipelines that automatically send each small code change through a series of automated tests with automatic error handling. The deployment pipeline culminates in safe deployments which occur at very frequent intervals; the more frequent the deployment, the easier it is to isolate problems and determine their cause. In addition, these automation tools often contain dependency maps so that feedback on failures can be sent directly to the responsible engineers and offending code can be automatically reverted (taken out of the pipeline in a safe manner).

These architectural structures and automation tools are a key element in a development approach that uses Big Data combined with extremely rapid feedback to improve the consumer journey and solve consumer problems. They are most commonly found in internet companies, but are being used in many others, including organizations that develop embedded software. (See case study, below.)

3. Team Structures and Responsibilities.

When consumer empathy, data analytics and very rapid feedback are combined, there is one more point of friction that can easily reduce flow efficiency. If an organization has not delegated responsibility for product decisions to the team involved in the rapid feedback loop, the benefits of this approach are lost. In order for such feedback loops to work, teams with a full stack of capabilities must be given responsibility to make decisions and implement immediate changes based on the data they collect. Typically such teams include people with product, design, data, technology, quality, and operations backgrounds. They are responsible for a improving set of business metrics rather than delivering a set of features. An example of this would be the UK Government Digital Service (GDS), where teams are responsible for delivering improvements in four key areas: cost per transaction, user satisfaction, transaction completion rate, and digital take-up.

It is interesting to note that UK laws makes it difficult to base contracts on such metrics, so GDS staffs internal teams with designers and software engineers and makes them responsible for the metrics. Following this logic to its conclusion, the typical approach of IT departments – contracting with their business colleagues to deliver a pre-specified set of features – is incompatible with full stack teams responsible for business metrics. In fact, it is rare to find separate IT departments in companies founded after the mid 1990’s (which includes virtually all internet companies). Instead, these newer companies place their software engineers in line organizations, reducing the friction of handovers between organizations.

In older organizations, IT departments often find it difficult to adopt modern software development approaches because they have inherited monolithic code bases intertwined with deep dependencies that introduce devious errors and thwart independent deployment of small changes. One major source of friction is the corporate database, once considered essential as the single source of truth about the business, but now under attack as a massive dependency generator. Another source of friction are outsourced applications, where even small changes are difficult and knowledge of how to make them no longer resides in the company. But perhaps the biggest source of friction in IT departments is the distance between their technical people and the company’s customers. Because most IT departments view their colleagues in line businesses as their customers, the technical people in IT lack a direct line of sight to the real customers of the company. Therefore insightful trade-offs and innovative solutions struggle to emerge.

The Future of Lean Software Development

The world-wide software engineering community has developed a culture of sharing innovative ideas, in stark contrast to the more common practice of keeping intellectual property and internally developed tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues through open source projects and repositories like GitHub. This reflects the longstanding practices of the academic world but is strikingly unique in the commercial world. Because of this intense industry-wide knowledge sharing, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

As long as the software community continues to leverage its knowledge-sharing culture it will continue to grow rapidly, because sophisticated solutions to seemingly intractable problems eventually emerge when many minds are focused on the problem. The companies that will benefit the most from these advances are the ones that not only track new techniques as they are being developed, but also contribute their own ideas to the knowledge pool.

As microstructured architectures and automated deployment pipelines become common, more companies will adopt these practices, some earlier and some later, depending on their competitive situation. The most successful software companies will continue to focus like a laser on delighting customers, improving the flow of value, and reducing risks. They will develop (and release as open source) an increasingly sophisticated set of tools that make software development easier, faster, and more robust. Thus a decade from now there will be significant improvements in the way software is developed and deployed. The Lean principles of understanding value, increasing flow efficiency, eliminating errors, and learning through feedback will continue to drive the evolution, but the term “lean” will disappear as it becomes “the way things are done.”

— Case Study —

Hewlett Packard LaserJet Firmware

The HP LaserJet firmware department had been the bottleneck of the LaserJet product line for a couple of decades, but by 2008 the situation had turned desperate. Software was increasingly important for differentiating the printer line, but the firmware department simply could not keep up with the demand for more features. Department leaders tried to spend their way out of the problem, but more than doubling the number of engineers did little to help. So they decided to engineer a solution to the problem by reengineering the development process.

The starting point was to quantify exactly where all the engineers’ time was going. Fully half of the time went to updating existing LaserJet printers or porting code between different branches that supported different versions the product. A quarter of the time went to manual builds and manual testing, yet despite this investment, developers had to wait for days or weeks after they made a change to find out if it worked. Another twenty percent of the time went to planning how to use the five percent of time that was left to do any new work. The reengineered process would have to radically reduce the effort needed to maintain existing firmware, while seriously streamlining the build and test process. The planning process could also use some rethinking.

It’s not unusual to see a technical group use the fact that they inherited a messy legacy code base as an excuse to avoid change. Not in this case. As impossible as it seemed, a new architecture was proposed and implemented that allowed all printers – past, present and even future – to operate off of the same code branch, determining printer-specific capabilities dynamically instead of having them embedded in the firmware. Of course this required a massive change, but the department tackled one monthly goal after another and gradually implemented the new architecture. But changing the architecture would not solve the problem if the build and test process remained slow and cumbersome, so the engineers methodically implemented techniques to streamline that process. In the end, a full regression test – which used to take six weeks – was routinely run overnight. Yes, this involved a large amount of hardware, simulation and emulation, and yes it was expensive. But it paid for itself many times over.

During the recession of 2008 the firmware department was required to return to its previous staffing levels. Despite a 50% headcount reduction, there was a 70% reduction in cost per printer program once the new architecture and automated provisioning system were in place in 2011. At that point there was a single code branch and twenty percent of engineering time was spend maintaining the branch and supporting existing products. Thirty percent of engineering time was spent on the continuous delivery infrastructure, including build and test automation. Wasted planning time was reclaimed by delaying speculative decisions and making choices based on short feedback loops. And there was something to plan for, because over forty percent of the engineering time was available for innovation.

This multi-year transition was neither easy nor cheap, but it absolutely was worth the effort. If you would like more detail, see (Gruver et al, 2013).

A more recent case study of how the software company Paddy Power moved to continuous delivery can be found in (Chen, 2015). In this case study the benefits of continuous delivery are listed: improved customer satisfaction, accelerated time to market, building the right product, improved product quality, reliable releases, and improved productivity and efficiency. There is really no downside to continuous delivery. Of course it is a challenging engineering problem that can require significant architectural modifications to existing code bases as well as sophisticated pipeline automation. But technically, continuous delivery is no more difficult than other problems software engineers struggle with every day. The real stumbling block is the change in organizational structure and mindset required to achieve serious improvements in flow efficiency.

— End Case Study —

Credit

This essay is a preprint of the author’s original manuscript of a chapter to be published in Netland and Powell (eds) (2016) "Routledge Companion to Lean Management"

References

Anderson, David. Kanban, Blue Hole Press, 2010

Beck, Kent. Extreme Programming Explained, Addison-Wesley, 2000

Beck, Kent et al. Manifesto for Agile Software Development, http://agilemanifesto.org/, 2001

Calçado, Phil. How we ended up with microservices. http://philcalcado.com/2015/09/08/how_we_ended_up_with_microservices.html

Chen, Lianping. "Continuous Delivery: Huge Benefits but Challenges Too" IEEE Software 32 (2). 50-54. 2015

Clark, Kim B. and Takahiro Fujimoto. Product Development Performance, Harvard Business School Press, 1991

Gruver, Gary, Mike Young, and Pat Fulghum. A Practical Approach to Large-Scale Agile Development, Pearson Education, 2013

Humble, Jez and David Farley. Continuous Delivery, Addison-Wesley Professional, 2010

Modig, Niklas, and Par Ahlstrom. This is Lean, Stockholm: Rheologica Publishing, 2012

Morgan, James M and and Jeffrey K Liker. The Toyota Product Development System, Productivity Press, 2006

Ohno, Taiichi. Toyota Production System, English, Productivity, Inc. 1988, published in Japanese in 1978

Poppendieck, Mary and Tom. Lean Software Development, Addison Wesley, 2003

Poppendieck, Mary and Tom. Implementing Lean Software Development, Addison Wesley, 2006

Porter, Michael E. and James E. Heppelmann. How Smart, Connected Products are Transforming Companies, Harvard Business Review 93 (10), 97-112, 2015

Reinertsen, Donald G. Managing the Design Factory, The Free Press, 1997

Ries, Eric. The Lean Startup, Crown Business, 2011

Smith, Preston G. and Donald G. Reinertsen. Developing Products in Half the Time, Van Nostrand Reinhold/co Wiley, 1991

Ward, Allen. Lean Product and Process Development, Lean Enterprise Institute, 2007

Womack, James P., Daniel T. Jones, and Daniel Roos. The Machine That Changed the World; the Story of Lean Production, Rawson & Associates, 1990

The Three Rules of the DevOps Game

2015-02-09T22:36:00.001-06:00

“Of course we do agile development,” she told me. “That’s just table stakes. What we need to do now is learn how to play the DevOps game. We need to know how to construct a deployment pipeline, how to keep test automation from turning into a big ball of mud, whether micro-services are just another fad, what containers are all about. We need to know if outsourcing our infrastructure is a good long term strategy and what happens to DevOps if we move to the cloud.”

Ask the right questions

Imagine something we will call the IT stack. At one end of the stack is hardware and at the other end customers get useful products and services. The game is to move things through the stack in a manner that is responsive, reliable, and sustainable. The first order of business is to understand what responsive, reliable, and sustainable mean in your world. Then you need to be the best in your field at providing products and services that strike the right balance between responsiveness, reliability and sustainability.

1. What does it mean to be Responsive?

In many industries, responsive has come to mean devising and delivering features through the entire IT stack in a matter of minutes or hours. From hosted services to bank trading desks, the ability to change software on demand has become an expected practice. In these environments, a deployment pipeline is essential. Teams have members from every part of the IT stack. Automation moves features from idea, to code, to tested feature, to integrated capability, to deployed service very quickly.

Companies that live in this fast-moving world invest in tools to manage, test, and deploy code, tools to maintain infrastructure, and tools to monitor production environments. In this world, automation is essential for rapid delivery, comprehensive testing, and automated recovery when (not if, but when) things go wrong.

On the other end of the spectrum are industries where responsiveness is a distant second to safety: avionics, medical devices, chemical plant control systems. Even here, software is expected to evolve, just more slowly. Consider Saab’s Gripen, a small reconnaissance and fighter jet with a purchase and operational cost many times lower than any comparable fighter. Over the past decade, the core avionics systems of the Gripen have been updated at approximately the same rate as major releases of the android operating system. Moreover, Gripen customers can swap out tactical modules and put in new ones at any time, with no impact on the flight systems. This “smartphone architecture” extends the useful life of the Gripen fighter by creating subsystems that use well-proven technology and are able to change independently over time. In the slow-moving aircraft world, the Gripen is a remarkably responsive system.

2. What does it mean to be Reliable?

There are two kinds of people in the world – optimists and pessimists – the risk takers and the risk adverse – those who chase gains and those who fear loss. Researcher Troy Higgins calls the two world views “promotion-focus” and “prevention-focus”. If we look at the IT stack, one end tends to be populated with promotion-focus people who enjoy creating an endless flow of new capabilities. [Look! It works!] As you move toward the other end of the stack, you find an increasing number of prevention-focused people who worry about safety and pay a lot of attention to the ways things could go wrong. They are sure that anything which CAN go wrong eventually WILL go wrong.

These cautious testers and operations people create friction, which slows things down. The slower pace tends to frustrate promotion-focused developers. To resolve this tension, a simple but challenging question must be answered: What is the appropriate trade-off between responsiveness and safety FOR OUR CUSTOMERS AT THIS TIME? Depending on the answer, the scale may tip toward a promotion-focused mindset or a prevention-focused mindset, but it is never appropriate to completely dismiss either mindset.

Consider Jack, whose team members were so frustrated with the slow pace of obtaining infrastructure that they decided to deploy their latest update in the cloud. Of course they used an automated test harness, and they appreciated how fast their tests ran in the cloud. Once all of the tests passed, the team deployed a cloud-based solution to a tough tax calculation problem. One evening a couple nights later, Jack had just put his children to bed when the call came: “A lot of customers are complaining that the system is down.” He got on his laptop and rebooted the system, praying that no one had lost data in the process. Around midnight another call came: “The complaints are coming in again. Maybe you had better check on things regularly until we can look at it in the morning.” It was a sleepless night – something Jack was not familiar with. These were the kinds of problems that operations used to handle, but since operations had been bypassed, it fell to the development team to monitor the site and keep the service working. This was a new and unpleasant experience. First thing in the morning, the team members asked an operations expert to join them. They needed help discovering and dealing with all of the ways that their “tested, integrated, working” cloud-based service could fail in actual use.

The cause of the problem turned out to be a bit of code that expected the environment to behave in a particular way, and in certain situations the cloud environment behaved differently. The team decided to use containers to ensure a stable environment. They also set up a monitoring system so they could see how the system was operating and get early warnings of unusual behavior. They discovered that their code had more dependencies on outside systems than they knew about, and they hoped that monitoring would alert them to the next problem before it impacted customers. The team learned that all of this extra work brought its own friction, so they asked operations to give them a permanent team member to advise them and help them deploy safely – whether to internal infrastructure or to the cloud.

Of course no one was in mortal danger when Jack’s system locked up – because it wasn’t guiding an aircraft or pacing a heartbeat. So it was fine for his team to learn the hard way that a good dose of prevention-focus is useful for any system, even one running in the cloud. But you do not want to put naive teams in a position where they can generate catastrophic results.

It is essential to understand the risk of any system in terms of: 1) probability of failure, 2) ability to detect failure, 3) resilience in recovering from failure, 4) level of risk that can be tolerated, and 5) remediation required to keep the risk acceptable. Note that you do not want this understanding to come solely from people with a prevention-focused mindset (eg. auditors) nor solely from people with a promotion-focused mindset. Your best bet is to assemble a mixed team that can strike the right balance – for your world – between responsiveness and reliability.

3. What does it mean to be Sustainable?

We know that technology does not stand still; in fact, most technology grows obsolete relatively quickly. We know that the reason our systems have software is so that they can evolve and remain relevant as technology changes. But what does it take to create a system in which evolution is easy, inexpensive and safe? A software-intensive system that readily accepts change has two core characteristics – it is understandable and it is testable.

a. What does it mean to be understandable?

If a system is going to be safely changed, then members of a modest sized team[1] must be able to wrap their minds around the way the system works. In order to understand the implications of a change, this team should have a clear understanding of the details of how the system works, what dependencies exist, and how each dependency will be impacted by the change.

An understandable system is bounded. Within the boundaries, clarity and simplicity are essential because the bounded system must never outgrow the team’s capacity to understand it, even as the team members change over time. The boundaries must be hardened and communication through the boundaries must be limited and free of hidden dependencies.

Finally, the need for understanding is fractal. As bounded sub-systems are wired together, the resulting system must also be understandable. As we create small, independently deployable micro-services, we must remember that these small services will eventually get wired together into a system, and a lot of micro-things with multiple dependencies can rapidly add up to a complex, unintelligible system. If a system – at any level – is too complex to be understood by a modest sized team, it cannot be safely modified or replaced; it is not renewable.

b. What does it mean to be testable?

A testable system, sub-system, or service is one that is testable both within its boundaries and at each interface with outside systems. For example, consider Service A which runs numbers through a complex algorithm and returns a result. The team responsible for this service develops a test harness along with their code which assures that the service returns the correct answer given expected inputs. It also creates a contract which clearly defines acceptable inputs, the rate it can accept inputs, and the format and meaning of the results it returns. The team documents this by writing contract tests which are made available to any team that wishes to invoke the service. Assume that service B would like to use service A. Then the team responsible for service B must place the contract tests from service A in its automated test suite and run the tests any time a change is made. If the contract tests for service A are comprehensive and the testing of service B always includes the latest version of these tests, then the dependency between the services is relatively safe.

Of course it’s not that simple. What if service A wants to change its interface? Then it is expected to maintain two interfaces, an old version and a new version, until service B gets around to upgrading to the new interface. And every service invoking service A is expected to keep track of which version it is certified to use.

Then again, service A might want to call another service – let’s say service X – and so service A must pass all of the contract tests for service X every time it makes a change. And since service X might branch off a new version, service A has to deal with multi-versioning on both its input and its output boundaries.

If you have trouble wrapping your head around the last three paragraphs, you probably appreciate why it is extremely difficult to keep an overall system with multiple services in an understandable, testable state at all times. Complexity tends to explode as a system grows, so the battle to keep systems understandable and testable must be fought constantly over the lifetime of any product or service.

A Reference Architecture

Over the last couple of decades, the most responsive, reliable, renewable systems seem to have platform-application architectures. (The smartphone is the most ubiquitous example.) Platforms such as Linux, android, and Gripen avionics focus on simplicity, low dependency, reliability, and slow evolution. They become the base for swappable applications which are required to operate in isolation, with minimum dependencies. Applications are small (members of a modest sized team can get their heads around a phone app), self-sufficient (apps generally contain their own data or retrieve it through a hardened interface), and easy to change (but every change has to be certified). If an app becomes unwieldy or obsolete, it is often easiest to discard it and create a new one. While this may appear to be a bit wasteful, it is the ability of a platform-app architecture to easily throw out old apps and safely add new ones that keeps the overall ecosystem responsive, fault tolerant, and capable of evolving over time.

So these are the three rules of the DevOps game: Be responsive. Be reliable. Be sure your work is sustainable.

[1] What is a modest sized team? We have found that in hardware-software environments, a team the size of a military platoon (three squads) is often a good size for major sub-systems. Robert Dunbar found in his research that a hunting group (30-40 people) brings the diversity of sills necessary to achieve a major objective. See the essays “Before there was Management and "The Scaling Dilemma."

The Scaling Dilemma

2014-02-18T13:29:00.001-06:00

“One of the most scalable organizations in human history was the Roman army. Its defining unit: The squad – eight guys. The number of guys that could fit in a tent,” says Chris Fry, who knows a bit about scaling. He led software development at Salesforce.com during its years of hyper growth, and is now SVP of Engineering at Twitter. Fry found that the way to build a scalable organization is to focus on the basic building blocks – small, stable, multidisciplinary teams that are expected to independently tackle problems, make decisions, and get things done. Fry advises, “When it comes to building a deeply efficient engineering organization, there are several things you can do to move the needle:

Build strong teams first. Assign them problems later.
Keep teams together.
Go modular. Remove dependencies.
Establish a short, regular ship cycle.”[1]

Amazon.com works on the same principle. Its basic unit is the two-pizza team – a team small enough to be fed with two pizzas. When Amazon needed to scale, two-pizza teams were chartered to build Amazon Web Services, one service per team. A team includes everyone needed to design, deliver, and support the service – from specifications to operations. If a service is too large for a two-pizza team, Amazon prefers split the service into smaller pieces rather than to combine teams to deal with the larger service, because this preserves the dynamic interactions of small teams.

What Could Go Wrong?

If it works for Salesforce, Amazon and Twitter, surely it will work for you… So you form a lot of small, independent, multidisciplinary teams and you are careful to give them all clear goals. What could go wrong with that?

August 27, Scene 1

“Hi Owen, how’s it going?”

“Just great! We’re going to make our target. We got the last piece done overnight. We’re working on integration testing right now. We’ll have it ready to release tomorrow. Lucky for me, because this month I really need that bonus.”

August 27, Scene 2

“Hey George, I hear you had a shutdown last night.”

“Yeah, we got things up fast, but it still counts against our shutdown limit. It’s the last one we can afford this month, if we want our bonuses. So there won’t be any more.”

“Do you know what caused it?”

“The usual. A new release from development. A naive piece of code, the kind of thing you can’t test for. Anyway, it’s fixed.”

“And you’ve made sure they won’t make that mistake again?”

“Nah, they don’t want to listen to us. We’re just not going to put up any more releases until September. We’re not going to miss our target.”

“But I hear they have another release almost ready to go.”

“Over my dead body.”

The Dilemma

It is a beautiful thing when the building block squads of an organization gel into high performance teams and can be counted on to meet challenging goals. But tricky part is, once you create these strong independent teams, how do you get them to work together? How do teams maintain their autonomous character while working in concert with an increasingly large network of other teams? How do you make sure that each team has a clear goal, but none of the goals are in conflict?

In a lean environment, the leader’s role is to set up strong teams, to be sure, but it is also to devise a system – let’s call it a goal system – which assigns goals to teams. This is no easy task. Teams need clear, meaningful goals; they have to be the right goals; teams must have the capacity, capability and autonomy to achieve their goals; and most important, the goals of various teams cannot conflict with each other.

One thing we know for certain is that local goals create local optimization. So it’s clear that the start of a goal system is a system-level, unifying goal. Something that conveys the purpose of the work, the why. Some way to confirm that progress is being made – at the team level – toward achieving the overall purpose. Of course, this is a lot easier said than done.

The Goal System

Many companies use projects to set up a goal system. A project manager lets the team know what the project goal is and what everyone needs to do to reach the goal. If there are several competing projects, a Project Management Office (PMO) is added to manage the project portfolio and distribute corporate goals among projects. However, there are problems with the project approach. You don’t often see stable teams in a project company, because people are usually assigned at project start and reassigned at the end. Worse, people are often assigned to multiple projects with competing demands on team members’ time. Finally, since most projects are conceived of as a relatively large batch of work, project teams tend to be quite a bit larger than a squad of eight to fourteen people. So the basic building blocks of scale – small, intact, multidisciplinary teams – are rarely found in a project environment.

One of the things Scrum has contributed to the practice of software development is the idea that small autonomous teams perform much better than large project teams or single-discipline teams that work in sequence. So Scrum provides the building blocks of scale, but unfortunately, it does not contain a scalable system for choosing team goals, making sure they contribute to organizational goals and are in sync with the goals of other teams. So we need to look elsewhere for ways to set up a goal system.

The Theory of Constraints

People with a lean mindset might look to the Theory of Constraints (TOC) for guidance on choosing and communicating team goals because it has a good track record for directing the efforts of multiple teams toward a single goal, at least in manufacturing.[2] TOC starts with the assumption that in any system there will always be a constraint that gets in the way of achieving the system goal, and the way to keep teams working toward the overall system goal is to be sure that everyone is focused on getting more work through the system constraint. Let’s see how TOC might be applied to developing a software system.

When the Constraint is Technical

The first step – after clarifying the overall system purpose and goal – is to find the biggest constraint to achieving that goal. For purpose of discussion, let’s choose one of the most typical technical constraints encountered in delivering a software system: the integration of various components of the system without the introduction of defects or unintended consequences. In fact, project organizations typically allocate a third or more of the project time to release overhead – including integration, testing, fixing defects, and deployment – with the largest portion going to finding and fixing problems discovered during integration. When integration is the system constraint, TOC tells us that the most important focus for development teams should be removing this constraint.

Agile approaches to software development recommend the frequent delivery of working software to customers. When this recommendation is followed literally (software is released to end customers frequently), the integration constraint is regularly exposed and has to be confronted. One of the earliest agile approaches, Extreme Programming (XP), includes technical practices such as Test Driven Development and Continuous Integration that help make frequent releases practical. Continuous Delivery[3], which expands on these practices, has gained widespread favor as the agile approach which explicitly focuses on the integration constraint. In Continuous Delivery we find actionable advice on how to tackle the integration problem with techniques that scale across large networks of teams.

The objective of Continuous Delivery is to dramatically increase the number of times integration occurs while decreasing the amount of time it takes to negligible levels. This has the same effect that just-in-time flow does in manufacturing: the impact of defects is reduced to near zero because they are discovered immediately before they can propagate or hide. The problem is, a much wider swath of an organization needs to get involved in Continuous Delivery than is typically found on a development team. The system architecture has to be devisable, the marketing department has to figure out how to deal with frequent deliveries, the development and operations departments have to work closely together.

The Theory of Constraints can help here. If the constraint is integration, TOC recommends that we measure the rate at which work moves through the constraint – in this case the rate at which completely integrated and tested software is released to production – and make improvement of this rate the goal of every team involved in the system. It turns out that a throughput measurement on the system constraint is a great metric for team goals because it is easy to measure, provides immediate feedback, and is structured to result in improved system-wide performance. If everyone working on a system is trying to stabilize and improve the rate at which tested, integrated software is successfully released to end customers, then teams across the system will naturally have to work together and will find that their goals are compatible.

When the Constraint is Knowledge

More often than not, however, technical issues are not the biggest constraint in system development and the fundamental problem is not an integration problem, it is a design problem or a fitness-for-use problem. Far too often we end up with a system that doesn’t work well or is difficult to use. In this case, the biggest constraint in developing software systems is the way in which we decide what to build and parse that decision amongst the teams doing the work.

Project organizations spend a good deal of time deciding what to do and turning these decisions into goals or requirements. However, this activity is front-loaded into the beginning of the project, while verification that the project requirements are correct waits until the project is complete – too late to make changes. Most project organizations consider deciding what to build to be an execution problem rather than a constraint. They would say: It’s unfortunate that project goals and requirements are sometimes wrong, but the way to fix this is to work harder on getting them right before the next project.

Organizations with a lean mindset would frame the problem differently. They are likely to identify the biggest barrier to making good decisions as incomplete knowledge, and thus the biggest constraint of the system would be the rate at which knowledge is generated. So a lean organization would focus on the feedback loop between customers and development teams. They would decrease the length of the feedback loop, increase the speed of the feedback loop, and remove barriers to the free flow of information inside the feedback loop. These organizations would say: If we can test our assumptions and designs more quickly we will learn faster and make better decisions.[4]

No More Politics

One of the signs that teams have conflicting goals is negative politics. A good way to eliminate political wrangling is to get teams to work together toward a unifying goal and to show team members the impact of their actions on that goal – in real time.

August 27, Alternate Scene

“Hi George, I noticed a little glitch in the last night’s graph so I thought I’d check to see if anything went wrong.”

“Oh hi, Owen. Glad you stopped by. As a matter of fact, we did have a shutdown last night, but we were lucky and caught it right away and got things back up fast enough we didn’t lose any customers.”

“So what caused it, do you think?”

“It was a database lockout. It seems to happen a lot about six to eight hours after a new release. When we find it we do a workaround that lasts until the next release.”

“Whoa! That means it’s something in the code that isn’t getting caught in testing.”

“Well, yeah! There’re a lot of problems in code that only come out during production.”

“Um, well, we worked late last night to get that feature ready that Chris asked for. Do you think we can do another release tomorrow?”

“Well Owen, let’s take a look at the graph. You see last night there was a downward spike here, where we closed down the site and no one could use the app. But we got it up so fast that almost everyone was still around and we could just reconnect them. It was the middle of the night here, so we mostly had browsers in Europe and Asia. We don’t have many customers there – yet. But what if we hadn’t caught it so fast? In five minutes we would have lost maybe half of the people using the app. What if it had happened during the daytime here? That spike would have been an order of magnitude bigger, and we would have lost a lot more people. It’s one of our busiest seasons, right before school starts. Do you really think we should take such a risk?”

“But George, we have to do it sometime. Sooner is better than later.”

“Not the way I see it, Owen. The fewer releases we put out, the fewer customers we’re likely to disrupt.”

“Okay, I see your point, but we’ve got to fix that problem so we can release frequently, because new features are what drives that graph up higher.”

“Not if the system crashes, they aren’t.”

“Whatever. We still have to fix the problem.”

“Yeah, so how do you propose we do that?”

“How about a side-by-side release with a trigger that knocks out the new system if it gets flaky?”

“Easy for you to say, Owen. You don’t have to make it happen.”

“But if I could make it happen, would you let me?”

“Sure, why not? But it won’t be easy.”

“How about I get with the team and tell them that in order to have a release, we have to write some failure detection and recovery code, and babysit the next release around the clock to be sure it works.”

“Can’t hurt to give it a try.”

The Unifying Goal

A network of strong teams is the first step to scale. The second step is to set up a system that distributes goals to teams in a way that avoids goal conflict. One good way to do this is to find a goal that is the final arbitrator of ‘good,’ make it visible to all teams in real time, and hold teams responsible for it.

The purpose of throughput accounting in the Theory of Constraints is to create just such a unifying goal. Simply stated, throughput accounting provides a measurement of the rate at which an organization achieves its purpose. This rate is effectively the same as the rate of throughput at the system constraint, so teams working to improve throughput at the constraint or to improve overall system impact are working toward the same goal. Either way, a single unifying goal allows individual teams to act autonomously, confident that they are not working at cross purposes with other teams.

What if you cannot find a unifying goal that represents the system constraint, or if a team’s work has no apparent impact on that goal? Over time it would be better to move to a decoupled architecture so that individual teams can have an impact on the overall system goal. But in the meantime, each team should monitor the impact of its work on its immediate customers. The question is not: Did the team complete its work? The proper question is: Did the team give its immediate customers what was needed, when it was needed, in a state that allowed the downstream teams to perform their work well?

Finally, remember that scaling is a two way street. If you think that scaling gives you problems as a leader, imagine the problems it brings to other people in the organization. It’s fun to work at a small company where everyone knows what’s important and works together to make the company successful. But as the company grows, people at all levels are in danger of losing their line of sight from what they are doing to the company’s success. This limits their ability to act with initiative to bring about that success, and thus undermines their engagement. You can’t scale unless you keep all of the bright minds in the company engaged and working together to help the company grow.
______________________________

Footnotes

[1] From First Round Review: Unlocking the Power of Stable Teams with Twitter's SVP of Engineering.

[2] I was reminded of this as I read Tame the Flow by Steve Tendon and Wolfram Müller, which provides a nice summary of how the Theory of Constraints in general, and throughput accounting in particular, can generate much better team goals than either the work-bin WIP limits of Kanban or the cost/profit focus of traditional cost accounting.

[3] See Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and Dave Farley (2010), and Lean Enterprise: Adopting Continuous Delivery, DevOps, and Lean Startup at Scale by Jez Humble , Barry O’Reilly, and Joanne Molesky (2014).

[4] It’s interesting to note that Continuous Delivery is an excellent way to address the knowledge constraint as well as the technical constraint of software development.

Artist Island

2013-09-14T11:07:00.001-05:00

Once there was an island called Artist Island. The people on this island made a very good living by gathering stones from Stone Island, forming them into stunning jewels and selling the jewelry to people in the Land of Festivals. At first, people from Artist Island visited the Land of Festivals to see what kind of jewelry would be popular and then scouted out the right kinds of stones on Stone Island. But that was not a very efficient use of the time of these excellent artisans, so after a while the people from Stone Island were asked to bring their stones to Artist Island. Later, to further improve efficiency, the Stone People were asked to deliver the finished jewelry to the Land of Festivals. Since the Stone People were now the ones in contact with the Festival People, they became responsible for ordering the jewelry as well.

The people of Artist Island continued their search for efficiency. It took a long time to shape a pile of stones into a pile of jewels, and by the time a pile was done, the Stone People complained that the designs were out of date. So the artisans learned about Lean and decided to reduce work-in-progress. They worked on stone piles for no longer than two weeks, and brought smaller piles of jewels to the boat dock for shipment to the Land of Festivals. However, the transport boats were still quite large and took some time to fill up, so it still took a long time for jewelry to get to the Festival People, and the designs were still out of date.

Then the Stone People came up with an idea. If they invested in smaller boats, they could deliver jewelry to the Land of Festivals in much smaller batches, shortly after it was made. Gradually they switched from large boats to small boats, and as they did, they started using the same small boats to deliver stones to Artist Island. Now a stone could be dug out of a mine, taken to Artist Island, formed into a jewel and delivered to the Land of Festivals in about a month. The jewelry designs were much more up-to-date, and sales improved.

But there was still a lot of unsold jewelry in warehouses in the Land of Festivals, because sometimes the jewels had flaws and sometimes the artisans didn't understand what the Stone People were asking them to make. Quite often the jewelry was boring because the Stone People weren't aware of the many marvelous kinds of jewels that were possible, so they didn't order many interesting designs.

The Earthquake
About this time an earthquake shifted the rocks holding water in the lake, and the water level receded. Artist Island became a peninsula connected to the mainland. Some enterprising artisans decided to walk to the mainland and talk to the Festival People. They discovered that the Festival People didn't really like their jewelry designs very much, and so the artisans went back home and produced some new designs. After a few trips, they learned more and more about the festivals, their timing, their themes, and the kind of jewels that would be best for each one. They began to produce jewelry specially designed for each festival, and sales soared.

Of course, the artisans weren't using their time quite so efficiently anymore, because some of the best craftspeople spent part of their valuable time walking over to the mainland and talking to the Festival People. But then again, contact with customers energized the artisans, and they brought their enthusiasm back to Artist Island. The new designs were wildly popular, so none of them ended up stored in a warehouse. Thus the people of Artist Island were able to sell more jewelry and charge higher prices than before.

The artisans found that the Festival People were looking for unique jewels, so they asked the Stone People to look for new kinds of stones. But the new stones that the Stone People brought were not suitable for forming into jewels, and the artisans began to wish they had kept a few of their old boats so they could go and look for new stones themselves. One day some of the artisans went exploring their new peninsula and discovered a shallow sandbar connecting Artist Island to Stone Island. So they were able to wade over to Stone Island to help the Stone People look for better stones.

Because they knew what they were looking for, the artisans soon discovered new types of stones that could be easily formed into jewels. Actually, the new stones were much bigger than the previous ones -- they barely fit in the small boats -- so the Stone People had ignored them. The large stones gave the Artist Island people a novel idea: perhaps they could make cups and bowls from the new stones. Several different kinds of stones were brought to Artist Island and artisans eagerly tried their hand at making household items. This turned out to be easier than the intricate work of making jewelry, so even apprentice artisans were able to shape the new stones. The Festival People loved the new dishes and bought many pieces in addition to the jewels they always enjoyed.

Of course, the artisans weren't using their time quite so efficiently now that some of the best craftspeople were working with the Stone People as well as the Festival People. But then again, dishes involved less intricate work so more items could be produced and it was much easier to avoid flaws. Furthermore, the Festival People were eager to buy every single item the artisans could produce. No longer was finished work gathering dust in warehouses. Thus the artisans were able to make and sell many more products than before. Finally, since the household items were considered a necessity, business remained good even during tough economic times.

The New Landscape
Let’s take a tour of Artist Island a few years after the waters receded and turned it into a peninsula. The former island has three different kinds of artisans. First of all there are the enterprising artisans who learned to empathize with customers and look for new stones to shape in novel ways to solve customer problems. They are deeply engaged in their work and have created tight feedback loops so they can continue to develop products that customers love and bring innovative new stones to the market.

This group of artisans has revised the definition of efficiency.[1] They don’t worry too much about resource efficiency -- that is, keeping every artisan busy. They focus on flow efficiency -- that is, keeping each stone moving from Stone Island all the way to customers in the Land of Festivals with as little delay as possible. They have found that with greater flow efficiency, they get higher quality, more rapid customer feedback, and thus they are more likely to make products that delight customers. The enterprising artisans are doing very well.

But not everyone on Artist Island noticed that the waters receded, or if they noticed, they were not eager to abandon their comfortable routines. The traditional artisans believe in resource efficiency -- that is, making the most efficient use of their valuable time. So they continue to receive boatloads of stones from the Stone People, form them into the kind of jewels they are asked to make, and send the jewelry to be sold in the Land of Festivals. It’s not their problem if the Stone People order the wrong jewels, or if the boats are so large that their work is out of date by the time it reaches the Land of Festivals, or if half of their jewelry ends up in warehouses, unused by the Festival People. Their job is to deliver what they are asked for in a timely manner, and to continually reduce their costs.

Of course, the work is not very challenging and it’s difficult to get enthusiastic about making piles of jewelry that no one is likely to use. Because of this, many traditional artisans are leaving to join the enterprising artisans, attracted by the opportunity to think for themselves, the challenge of improving their artistic skills, and the satisfaction of seeing their work appreciated by the Festival People.

There is a third area of Artist Island -- one that wasn't mentioned earlier -- an area that has been around ever since stones began to be used for practical items. Here we find the parts-makers, artisans who form stones into parts for automobiles and airplanes and medical devices and control systems and things like that. They make up almost half of the population of Artist Island. The parts-makers work with vehicle and device designers to make sure their parts fit and operate properly. Recently they have learned a few things from the enterprising artisans, such as focusing on flow efficiency (moving stones through their work area without delay), understanding the needs of their customers (both the device designers as well as device customers), and constantly looking for new stones that can better meet these needs.

The Future
As the years go by, the enterprising artisans will move to the mainland to work side-by-side with the Festival People. The parts-makers will also migrate to the mainland and join forces with the device designers. All that will be left on the former island are the cost centers housing traditional artisans, but even these will gradually shrink and eventually disappear. Because in the end, while the talent of the artisans will remain essential, Artist Island will not matter anymore.

Navigating the New Landscape
We have spent a lot of time over the past decade working to make life better on Artist Island (a.k.a. Software Development Land). We promoted Lean principles such as small batches and steady flow and quality at the source. But over the past few years we have watched the waters recede and marveled as the island turned into a peninsula. The best and brightest of the artisans have abandoned the boat system and learned to talk directly with customers, work as partners with other disciplines, and seek out new approaches to solving problems.

We have written three books about Artist Island, but we couldn't write a fourth, because the island has largely disappeared. In its place is a new landscape, one in which integrated product teams are expected to ask the right questions, solve the right problems, and deliver solutions that customers love. So we wrote our fourth book -- The Lean Mindset: Ask the Right Questions -- about thriving in the new landscape, a land without islands, a land that doesn't have quite enough artisans, a land that’s full of endless possibilities.
_____________________________
Footnote:
[1] Thanks to Niklas Modig and Pär Åhlström, for introducing us to these two viewpoints on efficiency. See their excellent book -- This Is Lean: Resolving the Efficiency Paradox.

Don’t Separate Design from Implementation

2011-08-19T09:56:00.007-05:00

I was a programmer for about fifteen years. Then I managed a factory IT department for a few years, and managed vendors delivering software for yet more years. In all of those years (with one exception), software was delivered on time and customers were happy. Yet I never used a list of detailed requirements, let alone a backlog of stories, to figure out what should be done – not for myself, not for my department, not even for vendors.

In fact, I couldn’t imagine how one could look at a piece of paper – words – and decipher what to program. I felt that if the work to be done could be adequately written down in a detailed enough manner that code could be written from it, well, it pretty much had to be pseudocode. And if someone was going to write pseudocode, why not just write the code? It would be equally difficult, less error-prone, and much more efficient.

Software Without Stories
So if I didn’t use detailed requirements – how did I know what to code? Actually, everything had requirements, it’s just that they were high level goals and constraints, not low level directives. For example, when I was developing process control systems, the requirements were clear: the system had to control whatever process equipment the guys two floors up were designing, the product made by the process had to be consistently high quality, the operator had to find the control system convenient to use, and the plant engineer had to be able to maintain it. In addition, there was a deadline to meet and it would be career-threatening to be late. Of course there was a rough budget based on history, but when a control system was going to be used for some decades, one was never penny wise and pound foolish. With these high level goals and constraints, a small team of us proceeded to design, develop, install, and start up a sophisticated control system, with guidance from senior engineers who had been doing this kind of work for decades.

One day, after I had some experience myself, an engineering manager from upstairs came to ask me for help. He had decided to have an outside firm develop and install a process monitoring system for a plant. There was a sophisticated software system involved – the kind I could have written, except that it was too large a job for the limited number of engineers who were experienced programmers. He had chosen to contract with the outside firm on a time-and-materials basis even though his boss thought time-and-materials was a mistake. The engineering manager didn’t believe that it was possible to pre-specify the details of what was needed, but if a working system wasn’t delivered on time and on budget, he would be in deep trouble. So he gave me this job: “Keep me out of trouble by making sure that the system is delivered on time and on budget, and make sure that it does what Harold Stressman wants it to do.”

Harold was a very senior plant product engineer who wanted to capture real time process information in a database. He already had quality results in a database, and he wanted to do statistical analysis to determine which process settings gave the best results. Harold didn’t really care how the system would work, he just wanted the data. My job was to keep the engineering manager out of trouble by making sure that the firm delivered the system Harold envisioned within strict cost and schedule constraints.

The engineering manager suggested that I visit the vendor every few weeks to monitor their work. So every month for eighteen months I flew to Salt Lake City with a small group of people. Sometimes Harold came, sometimes the engineers responsible for the sensors joined us, sometimes the plant programmers were there. We did not deliver “requirements;” we were there to review the vendor’s design and implementation. Every visit I spent the first evening pouring over the current listings to be sure I believed that the code would do what the vendor claimed it would do. During the next day and a half we covered two topics: 1) What could the system actually do today (and was this a reasonable step toward getting the data Harold needed)? and 2) Exactly how did the vendor plan to get the system done on time (and was the plan believable)?

This story has a happy ending: I kept the engineering manager out of trouble, the system paid for half of its cost in the first month, and Harold was so pleased with the system that he convinced to plant manager to hire me as IT manager.

At the plant, just about everything we did was aimed at improving plant capacity, quality, or throughput, and since we were keepers of those numbers, we could see the impact of changes immediately. The programmers in my department lived in the same small town as their customers in the warehouse and on the manufacturing floor. They played softball together at night, met in town stores and at church, had kids in the same scout troop. Believe me, we didn’t need a customer proxy to design a system. If we ever got even a small detail of any system wrong, the programmers heard about it overnight and fixed it the next day.

Bad Amateur Design
The theme running through all of my experience is that the long list of things we have come to call requirements – and the large backlog of things we have come to call stories – are actually the design of the system. Even a list of features and functions is design. And in my experience, design is the responsibility of the technical team developing the system. For example, even though I was perfectly capable of designing and developing Harold’s process monitoring system myself, I never presumed to tell the vendor’s team what features and functions the system should have. Designing the system was their job; my job was to review their designs to be sure they would solve Harold’s problem and be delivered on time.

If detailed requirements are actually design, if features and functions are design, if stories are design, then perhaps we should re-think who is responsible for this design. In most software development processes I have encountered, a business analyst or product owner has been assigned the job of writing the requirements or stories or use cases which constitute the design of the system. Quite frankly, people in these roles often lack the training and experience to do good system design, to propose alternative designs and weigh their trade-offs, to examine implementation details and modify the design as the system is being developed. All too often, detailed requirements lists and backlogs of stories are actually bad system design done by amateurs.

I suggest we might get better results if we skip writing lists of requirements and building backlogs of stories. Instead, expect the experienced designers, architects, and engineers on the development team to design the system against a set of high-level goals and constraints – with input from and review by business analysts and product managers, as well as users, maintainers, and other stakeholders.

A couple of my “old school” colleagues agree with me on this point. Fred Brooks, author of the software engineering classic “The Mythical Man Month” wrote in his recent book “The Design of Design” [1]:

“One of the most striking 20th century developments in the design disciplines is the progressive divorce of the designer from both the implementer and the user. … [As a result] instances of disastrous, costly, or embarrassing miscommunication abound.”

Tom Gilb, author of the very popular books “Principles of Software Engineering Management” and “Competitive Engineering” recently wrote [2]:

“The worst scenario I can imagine is when we allow real customers, users, and our own salespeople to dictate ‘functions and features’ to the developers, carefully disguised as ‘customer requirements’. Maybe conveyed by our product owners. If you go slightly below the surface of these false ‘requirements’ (‘means’, not ‘ends’), you will immediately find that they are not really requirements. They are really bad amateur design for the ‘real’ requirements….

"Let developers engineer technical solutions to meet the quantified requirements. This gets the right job (design) done by the right people (developers) towards the right requirements (higher level views of the qualities of the application).”

Separating design from implementation amounts to outsourcing the responsibility for the suitability of the resulting system to people outside the development team. The team members are then in a position of simply doing what they are told to do, rather than being full partners collaborating to create great solutions to problems that they care about.

_________________________________
Footnotes:
[1] “The Design of Design” by Fred Brooks, pp 176-77. Pearson Education, 2010
[2] "Value-Driven Development Principles and Values;" by Tom Gilb, July 2010 Issue 3, Page 18, Agile Record 2010 (www.AgileRecord.com)

How Cadence Predicts Process

2011-07-15T23:36:00.009-05:00

If you want to learn a lot about a software development organization very quickly, there are a few simple questions you might ask. You might find out if the organization focuses on projects or products. You might look into what development process it uses. But perhaps most the revealing question is this: How far apart are the software releases?

It is rare that new software is developed from scratch; typically existing software is expanded and modified, usually on a regular basis. As a result, most software development shops that we run into are focused on the next release, and very often releases are spaced out at regular intervals. We have discovered that a significant differentiator between development organizations is the length of that interval – the length of the software release cycle.

Organizations with release cycles of six months to a year (or more) tend to work like this: Before a release cycle begins, time is spent deciding what will be delivered in the next release. Estimates are made. Managers commit. Promises are made to customers. And then the development team is left to make good on all of those promises. As the code complete date gets closer, emergencies arise and changes have to be made, and yet, those initial promises are difficult to ignore. Pressure increases.

If all goes according to plan, about two-thirds of the way through the release cycle, code will be frozen for system integration testing (SIT) and user acceptance testing (UAT). Then the fun begins, because no one really knows what sort of unintended interactions will be exposed or how serious the consequences of those interactions will be. It goes without saying that there will be defects; the real question is, can all of the critical defects be found and fixed before the promised release date?

Releases are so time-consuming and risky that organizations tend to extend the length of their release cycle so as not to have to deal with this pain too often. Extending the release cycle invariably increases the pain, but at least the pain occurs less frequently. Counteracting the tendency to extend release cycles is the rapid pace of change in business environments that depend on software, because the longer release cycles become a constraint on business flexibility. This clash of cadences results in an intense pressure to cram as many features as possible into each release. As lengthy release cycles progress, pressure mounts to add more features, and yet the development organization is expected to meet the release date at all costs.

Into this intensely difficult environment a new idea often emerges – why not shorten the release cycle, rather than lengthen it? This seems like an excellent way to break the death spiral, but it isn’t as simple as it seems. The problem, as Kent Beck points out in his talk “Software G Forces: the Effects of Acceleration,” is that shorter release cycles demand different processes, different sales strategies, different behavior on the part of customers, and different governance systems. These kinds of changes are notoriously difficult to implement.

Quick and Dirty Value Stream Map
I’m standing in front of a large audience. I ask the question: “Who here has a release cycle longer than three months?” Many hands go up. I ask someone whose hand is up, “How long is your release cycle?” She may answer, “Six months.” “Let me guess how much time you reserve for final integration, testing, hardening, and UAT,” I say. “Maybe two months?” If she had said a year, I would have guessed four months. If she had said 18 months, I would have guessed 6 months. And my guess would be very close, every time. It seems quite acceptable to spend two-thirds of a release cycle building buggy software and the last third of the cycle finding and fixing as many of those bugs as possible.

The next question I ask is: When do you decide what features should go into the release? Invariably when the release cycle is six months or longer, the answer is: “Just before we start the cycle.” Think about a six month release cycle: For the half year prior to the start of the cycle, demand for new or changed features has been accumulating – presumably at a steady pace. So the average wait of a feature to even be considered for development is three months – half of the six month cycle time. Thus – on the average – it will take a feature three months of waiting before the cycle begins, plus six months of being in development and test before that feature is released to customers; nine months in all.

Finally I ask, “About how many features might you develop during a six month release cycle?” Answers to this vary widely from one domain to another, but let’s say I am told that that about 25 features are developed in a six month release, which averages out to about one feature per week.

This leaves us with a quick and dirty vision of the value stream: a feature takes a week to develop and best case it takes nine months (38 weeks) to make it through the system. So the process efficiency is 1÷38, or about 2.6%. A lot of this low efficiency can be attributed to batching up 25 features in a single release. A lot more can be attributed to the fact that only 4 months of the 9 total months are actually spent developing software – the rest of the time is spent waiting for a release cycle to start or waiting for integration testing to finish.

Why not Quarterly Releases?
With such dismal process efficiency, let’s revisit to the brilliant idea of shortening release cycles. The first problem we encounter is that at a six month cadence, integration testing generally takes about two months; if releases are annual, integration testing probably takes three or four months. This makes quarterly releases quite a challenge.

For starters, the bulk of the integration testing is going to have to be automated. However, most people rapidly discover that their code base is very difficult to test automatically, because it wasn’t designed or written to be tested automatically. If this sounds like your situation, I recommend that you read Gojko Adzic’s book “Specification by Example.” You will learn to think of automated tests as executable specifications that become living documentation. You will not be surprised to discover that automating integration tests is technically challenging, but the detailed case studies of successful teams will give you guidance on both the benefits and the pitfalls of creating a well architected integration test harness.

Once you have the beginnings of an automated integration test harness in place, you may as well start using it frequently, because its real value is to expose problems as soon as possible. But you will find that code needs to “done” in order to be tested in this harness, otherwise you will get a lot of false negatives. Thus all teams contributing to the release would do well to work in 2-4 week iterations and bring their code to a state that can be checked by the integration test harness at the end of every iteration. Once you can reasonably begin early, frequent integration testing, you will greatly reduce final integration time, making quarterly releases practical.

Be careful, however, not to move to quarterly releases without thinking through all of the implications. As Kent Beck noted in his Software G Forces talk, sales and support models at many companies are based on annual maintenance releases. If you move from an annual to a quarterly release, your support model will have to change for two reasons: 1) customers will not want to purchase a new release every quarter, and 2) you will not be able to support every single release over a long period of time. You might consider quarterly private releases with a single annual public release, or you might want to move to a subscription model for software support. In either case, you would be wise not to guarantee long term support for more than one release per year, or support will rapidly become very expensive.

From Quarterly to Monthly Releases
Organizations that have adjusted their processes and business models to deal with a quarterly release cycle begin to see the advantages of shorter release cycles. They see more stability, more predictability, less pressure, and they can be more responsive to their customers. The question then becomes, why not increase the pace and release monthly? They quickly discover that an additional level of process and business change will be necessary to achieve the faster cycle time because four weeks – twenty days – from design to deployment is not a whole lot of time.

At this cadence, as Kent Beck notes, there isn’t time for a lot of information to move back and forth between different departments; you need a development team that includes analysts and testers, developers and build specialists. This cross-functional team synchronizes via short daily meetings and visualization techniques such as cards and charts on the wall – because there simply isn’t time for paper-based communication. The team adopts processes to ensure that the code base always remains defect-free, because there isn’t time to insert defects and then remove them later. Both TDD (Test Driven Development) and SBE (Specification by Example) become essential disciplines.

From a business standpoint, monthly releases tend to work best with software-as-a-service (SaaS). First of all, pushing monthly releases to users for them to install creates huge support headaches and takes far too much time. Secondly, it is easy to instrument a service to see how useful any new feature might be, giving the development team immediate and valuable feedback.

Weekly / Daily Releases
There are many organizations that consider monthly releases a glacial pace, so they adopt weekly or even daily releases. At a weekly or daily cadence, iterations become largely irrelevant, as does estimating and commitment. Instead, a flow approach is used; features flow from design to done without pause, and at the end of the day or week, everything that is ready to be deployed is pushed to production. This rapid deployment is supported by a great deal of automation and requires a great deal of discipline, and it is usually limited to internal or SaaS environments.

There are a lot of companies doing daily releases; for example, one of our customers with a very large web-based business has been doing daily releases for five years. The developers at this company don’t really relate to the concept of iterations. They work on something, push it to systems test, and if it passes it is deployed at the end of the day. Features that are not complete are hidden from view until a keystone is put in place to expose the feature, but code is deployed daily, as it is written. Occasionally a roll-back is necessary, but this is becoming increasingly rare as the test suites improve. Managers at the company cannot imagine working in at a slower cadence; they believe that daily deployment increases predictability, stability, and responsiveness – all at the same time.

Continuous Delivery
Jez Humble and David Farley wrote “Continuous Delivery” to share with the software development community techniques they have developed to push code to the production environment as soon as it is developed and tested. But continuous delivery is not just a matter of automation. As noted above, sales models, pricing, organizational structure and the governance system all merit thoughtful consideration.

Every step of your software delivery process should operate at the same cadence. For example, with continuous delivery, portfolio management becomes a thing of the past; instead people make frequent decisions about what should be done next. Continuous design is necessary to keep pace with the downstream development, validation and verification flow. And finally, measurements of the “success” of software development have to be based on delivered value and improved business performance, because there is nothing else left to measure.

Before There Was Management

2011-02-07T20:53:00.027-06:00

Management is a rather recent invention in the history of human evolution – it’s been around for maybe 100 or 150 years, about two or three times longer than software. But people have been living together for thousands of years, and it could be argued that over those thousands of years, we did pretty well without managers. People are social beings, hardwired through centuries of evolution to protect their family and community, and to provide for the next generation. For tens of thousands of years, people have lived together in small hamlets or clans that were relatively self-sufficient, where everyone knew – and was probably related to – everyone else. These hamlets inevitably had leaders to provide general direction, but day-to-day activities were governed by a set of well understood mutual obligations. As long as the hamlets stayed small enough, this was just about all the governance that was needed; and most hamlets stayed small enough to thrive without bureaucracy until the Industrial Revolution.

The Magic Number One Hundred and Fifty
Early in his career, British Anthropologist Robin Dunbar found himself studying the sizes of monkey colonies, and he noticed that different species of monkeys preferred different size colonies. Interestingly, the size of a monkey colony seemed to be related to the size of the monkeys’ brains; the smaller the brain, the smaller the colony. Dunbar theorized that brain size limits the number of social contacts that a primate could maintain at one time. Thinking about how humans seemed to have evolved from primates, Dunbar wondered if, since the human brain was larger than the monkey brain, humans would tend to live in larger groups. He calculated the maximum group size that humans would be likely to live in based on the relative size of the human brain, and arrived at a number just short of 150. Dunbar theorized that humans might have a limit on their social channel capacity (the number of individuals with whom a stable inter-personal relationship can be maintained) of about 150.[1]

To test his theory, Dunbar and other researchers started looking at the size of social groups of people. They found that a community size of 150 has been a very common maximum limit in human societies around the world going back in time as far as they can investigate. And Dunbar’s Number (150) isn’t found only in ancient times. The Hutterites, a religious group that formed self-sufficient agricultural communities in Europe and North America, have kept colonies under 150 people for centuries. Beyond religious communities, Dunbar found that during the eighteenth century, the average number of people in villages in every English county except Kent was around 160. (In Kent it was 100.) Even today, academic communities that are focused on a particular narrow discipline tend to be between 100 and 200 – when the community gets larger, it tends to split into sub-disciplines.[2]

Something akin to Dunbar’s number can be found in the world of technology also. When Steve Jobs ran the Mackintosh department at Apple, his magic number was 100. He figured he could not remember more than 100 names, so the department was limited to 100 people at one time. A team that never exceeded 100 people designed and developed both the hardware and software that became the legendary Apple Macintosh.[3] Another example: in a 2004 blog The Dunbar Number as a Limit to Group Sizes, Christopher Allen noted that on-line communities tend to have 40 to 60 active members at any one time. You can see two peaks in Allen’s chart of group satisfaction as a function of group size – one peak for a team size of 5 to 8, and an equally high peak when team size is around 50.[4]

Steve Job’s limit of 100 people was probably a derivative of the Dunbar Number, but Allen’s peak at 50 is something different. According to Dunbar, “If you look at the pattern of relationships within… our social world, a number of circles of intimacy can be detected. The innermost group consists of about three to five people. … Above this is a slightly larger grouping that typically consists of about ten additional people. And above this is a slightly bigger circle of around thirty more…”[5] In case you’ve stopped counting, the circles of intimacy are 5, 15, 50, 150 – each circle about three times the size of the smaller circle. The number 50, which Allen found in many on-line communities, is the number of people Dunbar found in many hunting groups in ancient times – and three of these groups of 50 would typically make up a clan.

Does this Work in Companies?
One Hundred and fifty is certainly a magic number for W.L. Gore & Associates. Gore is a privately held business that specializes in developing and manufacturing innovative products based on PTFE, the fluoropolymer in Gore-Tex fabrics. Gore has revenues exceeding 2.5 billion US dollars, employs over 8000 people, and has been profitable for over a half a century. It has held a permanent spot on the U.S. "100 Best Companies to Work For" since it’s inception in 1984, and is a fixture on similar lists in many countries in Europe. This amazing track record might be related to the fact that Gore doesn’t have managers. There are plenty of leaders at Gore, but leaders aren’t assigned the job, they have to earn it by attracting followers.

You’ve got to wonder how such a large company can turn in such consistent performance for such a long period of time without using traditional management structures. The answer seems to have something to do with the fact that Gore is organized into small businesses units that are limited to about 150 people. “We found again and again that things get clumsy at a hundred and fifty,” according to founder Bill Gore. So when the company builds a new plant, it puts 150 spaces in the parking lot, and when people start parking on the grass, they know it’s time to build a new plant.

Since associates at Gore do not have managers, they need different mechanisms to coordinate work, and interestingly, one of the key mechanisms is peer pressure. Here is a quote from Jim Buckley, a long-time associate at a Gore plant: “The pressure that comes to bear if we are not efficient as a plant, if we are not creating good enough earnings for the company, the peer pressure is unbelievable. …This is what you get when you have small teams, where everybody knows everybody. Peer pressure is much more effective than a concept of a boss. Many, many times more powerful.”[6]

Like many companies that depend on employees to work together and make good decisions, Gore is very careful to hire people who will fit well in its culture. Leaders create environments where people have the tools necessary for success and the information needed to make good decisions. Work groups are relatively stable so people get to know the capabilities and expectations of their colleagues. But in the end, the groups are organized around trust and mutual obligation – a throwback to the small communities in which humans have thrived for most of their history.

Google’s management culture has quite a few similarities with Gore’s. Google was designed to work more or less like a university – where people are encouraged to decide on their own (with guidance) what they want to investigate. Google is extremely careful about hiring people who will fit in its culture, and it creates environments where people can pursue their passion without too much management interference. For a deep dive into Google’s culture, see this video: Eric Schmidt at the Management Lab Summit

Peer Cultures
Before there were managers, peer cultures created the glue that held societies together. In clans and hamlets around the world throughout the centuries, the self-interest of the social group was tightly coupled with the self-interest of individuals and family units; and thus obligations based on family ties and reciprocity were essential in creating efficient communities.

There are many, many examples of peer cultures today, from volunteer organizations to open source software development to discussion forums and social networks on the web. In these communities, people are members by their own choice; they want to contribute to a worthy cause, get better at a personal skill, and feel good about their contribution. In a peer culture, leaders provide a vision, a way for people to contribute easily, and just enough guidance to be sure the vision is achieved.

Arguably, peer cultures work a lot better than management at getting many things done, because they create a social network and web of obligations that underlie intrinsic motivation. So perhaps we’d be better off taking a page out of the Gore or the Google or the Open Source playbook and leverage thousands of years of human evolution. We are naturally social beings and have a built-in need to protect our social unit and ensure that it thrives.

Example: Hardware/Software Products
“We have found through experience that the ideal team size is somewhere between 30 and 70,” the executive told us. At first we were surprised. Aren’t teams supposed to be limited to about 7 people? Don’t teams start breaking up when they’re much larger? Clearly the executive was talking about a different kind of team than we generally run into in agile software development. But his company was one of the most successful businesses we have encountered recently, so we figured there had to be something important in his observation.

We spend a morning with a senior project manager at the company – the guy who coordinated 60 people in the development of a spectacular product in record time. The resulting product was far ahead of its time and gave the company a significant competitive advantage. He explained how he coordinated the work: “Every 2 or 3 months we produced a working prototype, each one more sophisticated than the last one. As we were nearing the end of development, a new (faster, better, cheaper) chip hit the market. The team decided to delay the next prototype by two months so they could incorporate the new chip. Obviously we didn’t keep to the original schedule, but in this business, you have to be ready to seize the opportunities that present themselves.”

It’s not that this company had no small teams inside the larger teams; of course they did. It’s just that the coordination was done at the large team level, and the members of the smaller teams communicated on a regular basis with everyone on the larger team. All team members were keenly aware of the need to meet the prototype deadlines and they didn’t need much structure or encouragement to understand and meet the needs of their colleagues.

Another Example: Construction
The Lean Construction Institute has developed a similar approach to effectively organizing construction work. The first thing they do is to break down very large projects into multiple smaller ones so that a reasonable number of contractors can work together. (Remember Dunbar’s Number.) For example, they might completely separate a parking structure and landscaping from the main building; in a large building, the exterior would probably be a separate project from the interior. Each sub-project is further divided into phases of a few months; for example, foundation, structure, interior systems, etc. Before a phase starts, a meeting of all involved contractors is held and all of the things that need to be done to complete that phase are posted on cards on a wall by the contractors. The cards are organized into a timeline that takes dependencies into account, and all of the contractors agree that the wall represents a reasonable simulation of the work that needs to be done. This is not really a plan so much as an agreement among the contractors doing the work about what SHOULD be done to complete the phase.

Each week all of the “Last Planners” (crew chiefs, superintendents, etc.) get together and look at what they SHOULD do, and also what they CAN do, given the situation at the building site. Then they commit to each other what they WILL complete in the next week. The contractors make face-to-face commitments to peers that they know personally. This mutual commitment just plain gets things done faster and more reliably than almost any other organizing technique, including every classic scheduling approach in the book.

The Magic Number Seven
George Miller published “The Magical Number Seven, Plus or Minus Two” in The Psychological Review in 1956. Miller wasn’t talking about team size in this article; he was discussing the capacity of people to distinguish between alternatives. For example, most people can remember a string of 7 numbers, and they can divide colors or musical tones into about 7 categories. Ask people to distinguish between more than 7 categories, and they start making mistakes. “There seems to be some limitation built into us either by learning or by the design of our nervous systems, a limit that keeps our channel capacities in this general range [of seven],” Miller wrote.

This channel capacity seems to affect our direct interaction with other people – we can keep up a conversation with 7 or so people, but when a group gets larger, it is difficult to maintain a single dialog, and small groups tend to start separate discussions. So for face-to-face groups that must maintain a single conversation, the magic number of 7 +/-2 is a good size limit. And historically, most agile software development teams have been about this size.

Moving Beyond Seven
The problem is, 7 people are not enough to accomplish many jobs. Take the job of putting a new software-intensive product on the market, for example. The product is almost never about the software – the product is a medical device or a car or a mobile phone or maybe it’s a financial application. Invariably software is a subsystem of a larger overall system, which means that invariably the software development team is a sub-team of a larger overall system team.

In the book Scaling Lean & Agile Development, Craig Larman and Bas Vodde make a strong case for feature teams – cross-functional teams that deliver end-to-end customer features. They recommend against component teams, groups formed around a single component or layer of the system. I agree with their advice, but it seems to me that software is invariably a component of whatever system we are building. We might be creating software to automate a process or software to control a product, software to deliver information or software to provide entertainment. But our customers don’t care about the software; they care about how the product or process works, how relevant the information is or how entertaining the game might be. And if software is a component of a system, then software teams are component teams. What we might want to consider is that real feature teams – teams chartered to achieve a business goal – will almost certainly include more than software development.

Agile development started out as a practice for small software teams, but these days we often see teams of 40 or 50 developers applying agile practices to a single business problem. In almost every case, we notice that the developers are organized into several small teams that work quite separately – and in almost every case, therefore, the biggest problem seems to be coordination across the small teams. There are many mechanisms: use a divisible system architecture so teams can be truly independent; draw from a common list of tasks, which makes teams highly interdependent; send small team representatives to weekly coordinating meetings; and so on. But rarely do we see the most powerful coordination mechanism of all for groups this size: create a sense of mutual obligation through peer commitments.

Mutual Obligation
You can call mutual obligation peer pressure if you like, but whatever name you use, when individuals on a large team make a commitment to people they know well, the commitment will almost certainly be honored. Mutual obligation is a much more powerful motivating force than being told to do something by an authority figure. And the interesting thing is, the power of mutual obligation is not confined to small teams. It works very well in teams of 50, and can be effective with teams up to 150. The time to split teams is not necessarily when they reach 10; team sizes up to 100 or 150 can be very effective – if you can create a sense of mutual obligation among the team members.

There are, of course, a few things that need to be in place before mutual commitment can happen. First of all, team members must know each other – well. So this won’t work if you constantly reform teams. In addition to knowing each other’s names, teammates must understand the capabilities of their colleagues on the team, have the capacity to make reliable commitments, and be able to trust that their teammates will meet their commitments. This process of creating mutual obligations actually works best if there is no manager brokering commitments, because then the commitments are made to the manager, not to teammates. Instead, a leader’s role is to lay out the overall objectives, clarify the constraints, and create the environment in which reliable commitments are exchanged.

For example, the project manager of the hardware/software product (above) laid out a series of increasingly sophisticated prototypes scheduled about three months apart. Having made a commitment to the team, sub-teams organized their work so as to have something appropriate ready at each prototype deadline. When an opportunity to dramatically improve the product through incorporation of a new chip, the whole team was in a position to rapidly re-think what needed to be done and commit to the new goal.

In the case of lean construction (above), a large team of contractor representatives works out the details of a “schedule” every few months. Each week, the same team gets together and re-thinks how that “schedule” will have to be adapted to fit current reality. At that same weekly meeting, team members commit to each other what they will actually accomplish in the next week, which gives their colleagues a week to plan work crews, material arrival, and so on for the following week.

It certainly is a good idea to have small sub-teams whose members work closely together on focused technical problems, coordinating their work with brief daily meetings to touch base and make sure they are on track to meet their commitments. But the manner in which these sub-teams arrive at those commitments is open for re-thinking. It may be better to leverage thousands of years of human evolution and create an environment whereby people know each other and make mutual commitments to meet the critical goals of the larger community. After all, that’s the way most things got accomplished before there was management.

_________________________________
Footnotes:
[1] Technically, Dunbar calculated the relative sizes of the neocortex – the outer surface of the brain responsible for conscious thinking. For a humorous parody of Dunbar's theory, see "What is the Monkeysphere?" by David Wong.

[2] Information in this paragraph is from: How Many Friends Does One Person Need? by Robin Dunbar.

[3] See John Sculley On Steve Jobs.

[4] This figure from “The Dunbar Number as a Limit to Group Sizes” is antidotal.

[5] From How Many Friends Does One Person Need? by Robin Dunbar. Interestingly, while Dunbar finds 15 an approximate limit of the second circle of intimacy, Allen finds a group of 15 problematic.

[6] The Dunbar Number was popularized by Malcolm Gladwell in Tipping Point. Much information and both quotes in this section are from Chapter 5 of that book. See http://nextreformation.com/wp-admin/general/tipping.htm for an extended excerpt.

A Tale of Two Terminals

2011-01-15T11:01:00.032-06:00

By any measure, Terminal 3 at Beijing Capital Airport is impressive. Built in less than four years and officially opened barely four months before the Olympics, the massive terminal has received numerous awards for both its stunning design and its comfortable atmosphere. And it escaped the start-up affliction of many new airport terminals when it commenced full operations on March 26, 2008 without any notable problems.

The next day, half way around the world, Heathrow Terminal 5 opened for business. At one-third the size of Beijing Terminal 3, the new London terminal had taken twice as long to build and cost twice as much. Proud executives at British Airlines and BAA (British Airports Authority) exuded confidence in a flawless opening, but that was not to be. Instead, hundreds of flights were canceled in the first few days of operation, and about 28,000 bags went missing the first weekend. The chaotic opening of Heathrow Terminal 5 was such an embarrassment that it triggered a government investigation.

The smooth opening of Beijing Terminal 3 was not an anomaly – Terminal 2 at Shanghai Pudong International Airport opened the same day, also without newsworthy incident. Given the timing just before the Beijing Olympic games, it was clear that China was keenly interested in projecting an image of competence to the traveling public. But of course, the UK was equally interested in showcasing its proficiency, and the British executives clearly expected that the opening of Heathrow Terminal 5 would go smoothly. So the question to ponder is this: How did the Chinese airports manage two uneventful terminal openings? Did they do something different, or were the problems in London just bad luck?

It’s not like testing was overlooked at Heathrow Terminal 5. In fact, a simulation of the terminal’s systems was developed and all of the technical systems were tested exhaustively, even before they were built. A special testing committee was formed and thousands of people were recruited to be mock passengers, culminating in a test with 2000 volunteer passengers a few weeks before the terminal opened. On the other hand, the planned testing regime was curtailed because the terminal construction was not completed as early as planned; in fact, hard-hats were required in the baggage handling area until shortly before opening day. In addition, a decision was made to move 70% of the flights targeted for Terminal 5 on the very first day of operations, because it was difficult to imagine how to move in smaller increments.[1]

Those of us in the software industry have heard this story before: the time runs out for system testing, but a big-bang cut-over to a mission critical new system proceeds anyway, because the planned date just can’t be delayed. The result is predictable: wishful thinking gives way to various degrees of disaster.

That kind of disaster wasn’t going to happen in Beijing. I was in Beijing a month before the Olympics, and every single person I met – from tour guide to restaurant worker – seemed to feel personally responsible for projecting a favorable image as the eyes of the world focused on their city. I imagine that for every worker in Terminal 3, a smooth startup was a matter of national pride. But the terminal didn’t open smoothly just because everyone wanted things to go well. It opened smoothly because the airport authorities understood how to orchestrate such a large, complex undertaking that involved hundreds of people. After all, they had just finished building the airport at amazing speed.[2]

The opening ceremony of the Beijing Olympics was also a large, complex undertaking that involved hundreds of people. It’s easy to imagine the many rehearsals that took place to make sure that everyone knew their part. When it comes to opening a new terminal, the idea of rehearsals doesn’t usually occur to the authorities, but at Beijing Capital Airport, rehearsals started in early February. First a couple of thousand mock passengers took part in a rehearsal, then five thousand, and finally, on February 23rd, 8000 mock passengers checked in luggage for 146 flights. This was the average daily load expected a week later, when six minor airlines moved into Terminal 3. During the month of March, Terminal 3 operated on a trial basis, ironing out any problems that arose. Meanwhile, staff from the large airlines about to move to the terminal rehearsed their jobs in the new terminal day after day, so that when the big moving day arrived, everyone knew what to do. On March 26, all the practice paid off when the Terminal was opened with very few problems.

This certainly wasn’t the approach taken at Heathrow Terminal 5. It’s pretty clear that the opening chaos was caused because people did not know what to do: they didn’t know where to park, couldn’t get through security, didn’t know how to sign on to the new PDA’s to get their work assignments, didn’t know where to get help, and didn’t know how to stop all the luggage from coming at them until their problems got sorted out. Even the worst of the technical problems was actually a people problem: the baggage handling software had been put in a ‘safe’ mode for testing, and apparently no one was responsible for removing the patch which cut off communication to other terminals in the airport. It took three days to realize that this very human error was the main cause of the software problems![3]

In testimony to the British House of Commons, union Shop Steward Iggy Vaid testified:[4]

We raised [worker concerns] with our senior management team especially in British Airways. … [Their response was to] involve what we call process engineers who came in and decided what type of process needed to be installed. They only wanted the union to implement that process and it was decided by somebody else, not the people who really worked it. The fact is that they paid lip service to, ignored or did not implement any suggestion we made.

… as early as January there was a meeting with the senior management team at which we highlighted our concerns about how the baggage system and everything else would fail, that the process introduced would not work and so on. We highlighted all these concerns, but there was no time to change the whole plan.

... [Workers] had two days of familiarization in a van or were shown slides; they were shown where their lockers were and so on, but there was no training for hands-on work.….

The opening of a new airport terminal is an exercise in dealing with complexity. At Heathrow Terminal 5, new technical systems and new work arrangements had to come together virtually overnight – and changing the date once it has been set would have been difficult and expensive. Hundreds of people were involved, and every glitch in the work system had a tendency to cascade into ever larger problems.

If this sounds familiar, it’s because this scenario has been played out several times in the lives of many of us in software development. Over time, we have learned a lot about handling unforgiving, complex systems, particularly systems that include people interacting with new technology. But every time we encounter messy transition like the one at Heathrow Terminal 5, we wonder if our hard-learned lessons for dealing with complexity couldn’t be spread a bit wider.

Socio-technical Systems
Not very far from Heathrow, the Tavistock Institute of London has spent some decades researching work designs that deal effectively with turbulence and complexity. In the 1950’s and 60’s, renowned scientists such as Eric Trist and Fred Emery documented novel working arrangements that were particularly effective in the coal mines and factories of Great Britain. They found that especially effective work systems were designed (and continually improved) by semi-autonomous work teams of between 10 and 100 people that accepted responsibility for meaningful (end-to-end) tasks. The teams used their knowledge of the work and of high-level objectives to design a system to accomplish the job in a manner that optimized the overall results. Moreover, these teams were much better at managing uncertainty and rapidly adapting to various problems as they were encountered. The researchers found that the most effective work design occurs when the social aspects of the work are balanced with its technical aspects, so they called these balanced work systems Socio-technical systems.

In 1981, Eric Trist published “The Evolution of Socio-technical Systems,” an engaging history of his work. He attributes the “old paradigm” of work design to Max Weber (bureaucracy) and Frederick Taylor (work fragmentation). He proposed that a “new paradigm” would be far more effective for organizations in turbulent, competitive, or rapidly changing situations:[5]

Old Paradigm	New Paradigm
The technological imperative	Joint optimization [of social & technical systems]
Man as an extension of the machine	Man as complementary to the machine
Man as an expendable spare part	Man as a resource to be developed
Maximum task breakdown, simple narrow skills	Optimum task grouping, multiple broad skills
External controls (supervisors, specialist staffs, procedures)	Internal controls (self-regulating subsystems)
Tall organization chart, autocratic style	Flat organization chart, participative style
Competition, gamesmanship	Collaboration, collegiality
Organization’s purposes only	Members’ and society’s purposes also
Alienation	Commitment
Low risk taking	Innovation

In the 1980’s the socio-technical paradigm gained increased popularity when team work practices from Japan were widely copied in Europe and America. In the 1990’s socio-technical ideas merged with general systems theory, and the term “socio-technical systems” fell into disuse. But the ideas lived on. These days, it is generally accepted that the most effective way to deal with complex or fast-changing situations is by structuring work around semi-autonomous teams that have the leadership and training to respond effectively to any situation the groups are likely to encounter.

The clearest example we have of semi-autonomous work teams are emergency response teams – firefighters, paramedics, emergency room staff. Their job is to respond to challenging, complex, rapidly changing situations, frequently in dangerous surroundings and often with lives at stake. Emergency response teams prepare for these difficult situations by rehearsing their roles, so everyone knows what to do. During a real emergency, that training coupled with the experience of internal leaders enables the teams to respond dynamically and creatively to the emergency as events unfold.

Design Social Systems Along with Technical Systems
Developing a software system that automates a work system is fraught with just about as much danger as moving to a new airport terminal. There are many things we can do to mitigate that risk:

1. Cutover to any new system should be in small increments. Impossible? Don’t give up on increments too quickly – and don’t leave this to “customers” to decide! The technical risk of a big-bang cut-over is immense. And it’s almost always easier to divide the system in some way to facilitate incremental deployment than it is to deal with the virtually guaranteed chaos of a big-bang cutover.

2. Simplify before you automate. Never automate a work process until the work teams have devised as simple a work process as they possibly can. Automating the right thing is at least as important as automating it right.

3. Do not freeze work design into code! Leave as much work design as possible for work teams to determine and modify. If that is not possible, make sure that the people who will live with the new system are involved in the design of their work.

4. Rehearse! Don’t just test the technical part, include the people who will use the new system in end-to-end rehearsals. Be prepared to adapt the technical system to the social system and to refine the social system as well. Be sure everyone knows what to do; be sure that the new work design makes sense. Leave time to adjust and adapt. Don’t cut this part short.

5. Organize to manage complexity. Structure work around work teams that can adapt to changing situations, especially if the environment is complex, could change rapidly, or is mission critical. At minimum, have emergency response teams on hand when the new system goes live.

Much of the software we write ends up having an impact on the lives of other people; in other words, our work creates changes in social systems. We would do well to consider those social systems as we develop the technical systems. If we want to create systems that are truly successful, the technical and social aspects of our systems must be designed together and kept in balance.
___________________
Footnotes:
[1] “The opening of Heathrow Terminal 5” report to the House of Commons Transportation Committee pages 13-14.

[2] Contrast this with the absence of the BAA management team that oversaw the on-time, on-budget construction of Heathrow Terminal 5; they were replaced after a 2006 takeover of BAA by the Spanish company Ferrovial.

[3] See “The opening of Heathrow Terminal 5” report to the House of Commons Transportation Committee.

[4] “The opening of Heathrow Terminal 5” report to the House of Commons Transportation Committee pages 22-25.

[5] From "The Evolution of Socio-technical Systems" by Eric Trist, 1981. p 42.

The Product Owner Problem

2010-12-23T11:06:00.027-06:00

“We’re really struggling with the Product Owner concept, and many of our Scrum teams just don’t feel very productive.” they told us. “We’d like you to take a look at this and make some recommendations.” The company had several vertical markets, with a Scrum team of about ten people assigned to each market. Each market had a Product Manager, a traditional role found in most product companies. The company was clear about the role of a Product Manager; after all, there are university courses and professional organizations for this role. The Product Managers had a customer-facing job that included business responsibility for determining product direction and capability.

However, there was serious confusion about the Scrum role of Product Owner and its fit with the classic role of Product Manager. In addition to business responsibility, the Scrum Product Owner has the team-facing responsibility of managing the detailed product requirements.[1] In this company, the Product Managers found it impossible to handle both the customer-facing and team-facing jobs at the same time. So most teams had added an additional person to assist the Product Manager by preparing stories for the team, and called this person the Product Owner. The job of these Product Owners resembled the classic role of business analyst or, in some cases, user interaction designer.

Unfortunately, these Product Owners had little technical background in analysis or design, and yet they were expected to prepare detailed stories for the development team. Critical tradeoffs between business and technical issues often fell to these Scrum Product Owners, yet they had neither the first hand customer knowledge nor the in-depth technical knowledge to make such decisions wisely. They had become a choke point in the information flow between the Product Manager and the development team.

We asked the obvious question: How are things organized in the markets where things seem to be working well? It turns out that in the two highly successful vertical markets, there was no Product Owner preparing and prioritizing stories for the development team. Instead, the Product Manager had regular high level conversations with the development team about the general capabilities that would be desirable over the next two or three months. They discussed the feasibility of adding features and the results that could be expected. A real time application was created to show live web analytics of several key metrics that the Product Manager correlated to increased revenue. Then the team developed the capabilities most likely to drive the metrics in the right direction, observed the results, and modified their development plans accordingly.

This is a pattern we have seen frequently: Product Managers who lack the time, training, or temperament to handle both the customer-facing and the team-facing responsibilities of software development have two options. They can appoint Scrum Product Owners for each development team, or they can provide high-level guidance to a development team capable of designing the product and setting its own priorities. We observe that the second option generally works better, because an intermediary Product Owner brings a single perspective and limited time to the complex job of designing a product.

In 1988, Tom Gilb wrote the book Principles of Software Engineering Management, which is now in its 20th printing. One of the earliest advocates of evolutionary development, he has recently reiterated the elements of good software engineering in an article in Agile Record[2], from which I quote liberally:

Principle 1. Control projects by quantified critical-few, results.
1 Page total! (not stories, functions, features, use cases, objects, ..)

Principle 2. Make sure those results are business results, not technical.
Align your project with your financial sponsor’s interests!

Principle 3. Give developers freedom, to find out how to deliver those results.
The worst scenario I can imagine is when we allow real customers, users, and our own salespeople to dictate ‘functions and features’ to the developers, carefully disguised as ‘customer requirements’. Maybe conveyed by our Product Owners. If you go slightly below the surface, of these false ‘requirements’ (‘means’, not ‘ends’), you will immediately find that they are not really requirements. They are really bad amateur design, for the ‘real’ requirements – implied but not well defined.

Principle 4. Estimate the impacts of your designs, on your quantified goals.
….We have to design and architect with regard to many stakeholders, many quality and performance objectives, many constraints, many conflicting priorities. We have to do so in an ongoing evolutionary sea of changes with regard to all requirements, all stakeholders, all priorities, and all potential architectures…. a designer [should be able] to estimate the many impacts of a suggested design on our requirements.

Principle 5. Select designs with the best value impacts for their costs, do them first.
Assuming we find the assertion above, that we should estimate and measure the potential, and real, impacts of designs and architecture on our requirements, to be common sense. Then I would like to argue that our basic method of deciding ‘which designs to adopt’, should be based on which ones have the best value for money.

Principle 6. Decompose the workflow, into weekly (or 2% of budget) time boxes.
….I would argue that we need to do more than chunk by ‘product owner prioritized requirements’. We need to chunk the value flow itself – not just by story/function/use cases. This value chunking is similar to the previous principle of prioritizing the designs of best value/cost.

Principle 7. Change designs, based on quantified value and cost experience of implementation.

Principle 8. Change requirements, based in quantified value and cost experience, new inputs.

Principle 9. Involve the stakeholders, every week, in setting quantified value goals.
….In real projects, of moderate size, there are 20 to 40 interesting stakeholder roles worth considering…. But it can never be a simple matter of analyzing all stakeholders and their needs, and priorities of those needs up front. The fact of actual value delivery on a continuous basis will change needs and priorities. The external environment of stakeholders (politics, competitors, science, economics) will constantly change their priorities, and indeed even change the fact of who the stakeholders are. So we need to keep some kind of line open to the real world, on a continuous basis. We need to try to sense new prioritized requirements as they emerge, in front of earlier winners. It is not enough to think of requirements as simple functions and use cases. The most critical and pervasive requirements are overall system quality requirements, and it is the numeric levels of the ‘ilities’ that are critical to adjust, so they are in balance with all other considerations.

Principle 10. Involve the stakeholders, every week, in actually using value increments.
….I believe that should be the aim of each increment. Not ‘delivering working code to customers’. This means you need to recognize exactly which stakeholder type is projected to receive exactly which value improvement, and plan to have them, or a useful subset of them, on hand to get the increment, and evaluate the value delivered.

The Scrum Product Owner might be a role, but it should not be a job title. Product Owners wear many hats: Product Manager, Systems Engineer, User Interaction Designer, Software Architect, Business Analyst, Quality Assurance Expert, even Technical Writer. We would do well to use these well-known job titles, rather than invent a new, ambiguous title that tends to create a choke point and often removes from the development team its most important role – that of product design.

Discovery of the right thing to build is the most important step in creating a good product. Get that wrong and you have achieved 100% waste. Delegating decisions about what to build to a single Product Owner is outsourcing the most important work of the development team to a person who is unlikely to have the skills or knowledge to make really good decisions. The flaw in many Product Owner implementations is the idea that the Product Owner prepares detailed stories for the team to implement. This does not allow team members to be partners and collaborators in designing the product.

The entire team needs to be part of the design decision process. Team members should have the level of knowledge of the problems and opportunities being addressed necessary for them to contribute their unique perspective to the product design. Only when decisions cannot be made by the development team would they be resolved by a product leader. The main team-facing responsibility of the product leader is to ensure the people doing the detailed design have a clear understanding of the overall product direction.

The concept of single focus of accountability is at the center of this issue. Too often, accountability is implemented as a prioritized list of product details (stories) rather than as communication of intended results (business relevant metrics). As a result, the expertise, creative input, and passion of team members is sacrificed to the false goal of a single point of responsibility.
___________________
Footnotes:
[1] “The Product Owner is responsible for the Product Backlog, its contents, its availability, and its prioritization.” “The Product Backlog represents everything necessary to develop and launch a successful product. It is a list of all features, functions, technologies, enhancements, and bug fixes that constitute the changes that will be made to the product for future releases.” From Scrum: Developed and sustained by Ken Schwaber and Jeff Sutherland; http://www.scrum.org/scrumguides/.

[2] “Value-Driven Development Principles and Values – Agility is the Tool, not the Master.” Agile Record, Issue 3, July 2010, pp 18-25. Available at www.agilerecord.com. Used with permission.

Screen Beans Art, © A Bit Better Corporation

Lean Essays

When Demand Exceeds Capacity

What’s Wrong With Training Wheels?

Grown-Up Lean

Lean was introduced to software a couple of decades ago. How are they getting along?

The Nature of Software

Example: Amazon

Example: Google

The Lean Approach to Software

When are ‘requirements’ not required?

The Roots of Lean Product Development

Example: SpaceX

Making the Shift to Digital

The Nature of Lean

Friction #1: Inefficient Flow

Reduce Friction: Continuous Delivery / DevOps

Reduce Friction: Limit Work to Capacity

See Friction: Backlogs

Friction #2: Dependencies

Reduce Friction: Federated Architecture

Reduce Friction: Sync and Stabilize

Friction #3: Cost Centers

See Friction: Life in a Cost Center

Reduce Friction: Cost of Delay

Friction #4: Proxies

Reduce Friction: Full Stack Teams

Case Study: ING Netherlands

Spark New Zealand

20/20 Vision

Footnotes

Lean Lunch Lines

What If Your Team Wrote the Code for the 737 MCAS System?

An Interview

When did you first start applying Lean to your software development work? Where did you get the inspiration from?

From the organizations you've worked with, what have been some of the most common challenges associated with Lean transformations?

There's lots of talk now around scaled Agile frameworks such as SAFe, Nexus, LESS, etc. with mixed results. How do you approach the challenge of scaling this way of working?

One of the common complaints from developers on Agile teams have is they don't feel connected to customers, and there is sometimes a feeling of working on outputs, rather than customer outcomes. How might this be changed?

At our meetup last year, you spoke about resisting proxies, and one of those proxies is the Product Owner. What alternative approaches have you seen work for Lean or Agile teams, as opposed to having a Product Owner?

What is the most common thing you've seen recently which is slowing down organizations' concept-to-cash loop?

Official Intelligence

A Killer App for IoT

Day 1

True Customer Obsession

Too Big to Communicate

The Cathedral and the Bazaar

Knowledge Workers

Volunteers

Promises

A Skeptical View of Proxies

High-Velocity Decision Making

Permission

Punishment

Eager Adoption of External Trends

Official Intelligence

The Cost Center Trap

The Capitalization Dilemma

Beware of Proxies

The Only Country in the World

The End of Enterprise IT

The Two Sides of Teams

Collective wisdom outweighs individual insights

Deliberation Makes a Group Dumber

The Two Sides of Teams

Integration Does. Not. Scale.

ERP Systems Meet Digital Organizations

Postmodern ERP

Integration Does. Not. Scale

What Scales

What about Standardization?

In Conclusion

The New Technology Stack

Five World-Changing Software Innovations

The Cloud

Big Data

Antifragile Systems

Content Platforms

Mobile Apps

What about “Agile”?

Friction

Friction in the Customer Journey

Lean was introduced to software a couple of decades ago.
How are they getting along?