Sunday, July 14, 2019

Grown-Up Lean


Lean was introduced to software a couple of decades ago. 
How are they getting along?

This working paper was submitted as a chapter in The International Handbook of Lean Organization, Cambridge University Press, Forthcoming.

The Nature of Software

“Do not go where the path may lead,
go instead where there is no path and leave a trail”
--    Ralph Waldo Emerson

It’s May 27, 1997. The Internet has been open to public for six years. Linux is six years old. Amazon is three. Google doesn’t exist. The dotcom bubble hasn’t happened.

In Würzburg, Germany, Eric Raymond presents an essay called "The Cathedral and the Bazaar"[1] at the Linux Kongress. He describes “some surprising theories about software engineering”:
I discuss these theories in terms of two fundamentally different development styles, the "cathedral" model of most of the commercial world versus the "bazaar" model of the Linux world. I show that these models derive from opposing assumptions about the nature of the software-debugging task. I then make a sustained argument from the Linux experience for the proposition that “Given enough eyeballs, all bugs are shallow”, suggest productive analogies with other self-correcting systems of selfish agents, and conclude with some exploration of the implications of this insight for the future of software.
The implications were clear:
Perhaps in the end the open-source culture will triumph not because cooperation is morally right…. but simply because the commercial world cannot win an evolutionary arms race with open-source communities that can put orders of magnitude more skilled time into a problem.
The democratization of programming arrived with the public Internet in 1991, and within a decade it became clear that the old model for developing software was obsolete. No longer was it practical for experts to write requirements and send them to a support group where programmers wrote code and testers wrote corresponding tests and then reconciled the two versions of the requirements; finally, after weeks, months or even years, a big batch of new code was released to consumers (aka. ‘users’).  This ‘process’ never really worked, but the commercial world had not yet found a replacement.

However, the open source world figured out a better way to develop software. Eric Raymond was right – it was not about writing the code, it was about ‘the software-debugging task’. It’s easy to write bug-free code in isolation; most bugs are caused by the way one piece of correct software interacts with another piece of correct software. As a code base grows large, potential interactions grow exponentially, and it quickly becomes impossible to test every interaction, or even predict which interactions might cause defects. In the open source world, staffed completely by volunteers, there was no attempt to test for every potential problem before making a change to a live code base – no one would volunteer to do the work. On the contrary, volunteers were motivated by seeing their contribution working right away. So small changes were submitted, reviewed, and integrated into the live code base as quickly as possible. If a bug surfaced, there were plenty of eyeballs to see the problem, limit the damage, find the cause, and fix it. Plus, the offending code was probably the latest submission, so the person whose code triggered the problem was usually identified and would be embarrassed. Open source was (and is) known to be a brutal but effective training ground for software engineers.

Example: Amazon

One of the earliest commercial companies to figure out the nature of ‘software-debugging’ was Amazon. As the company outgrew its traditional cathedral-style software architecture in the early 2000’s, the leadership team felt that the growing pains could be addressed with better communication between teams. But CEO Jeff Bezos disagreed. He believed that the only way to grow seriously large was to have many independent (selfish) agents making local decisions – essentially a Bazaar-style organizational architecture.  Bezos declared that teams should be small enough to be fed with two pizzas, and these teams should operate independently. It took some years to evolve to a software architecture that supported such teams, but eventually small, independent services owned by two-pizza teams made up the core of Amazon’s infrastructure. Customer-focused metrics were used to guide a team’s performance, and teams were expected to work both autonomously and asynchronously to improve customer outcomes. Initially this created havoc in operations, which was responsible for any problems that surfaced once code ‘went live’.  But the infrastructure VP invented ways for engineering teams to self-provision hardware and self-deploy software, which made it possible for teams to retain responsibility for any problems their services encountered once it went ‘live’, not just during development.

Once software engineers realized they might be awakened in the middle of the night if their code created a problem, they became very good at keeping bugs out of their code. Three strategies emerged:
  1. Teams hardened their service interfaces, effectively isolating their service from unintended interactions with the rest of the system. These interfaces, called API’s (Application Program Interfaces) were contracts between the service and its consumers or suppliers. No interactions or data exchanges were allowed except through API’s, which reduced the number of possible interactions to a manageable number and provided testing surfaces for every interaction.
  2. If you give software engineers manual work, their first instinct is to automate it. So, when a small team which included software engineers became responsible for testing a service and its interfaces, you can bet the job was quickly automated.
  3. Teams released software early and often. They did this because they could release at any time, so why not now? After all, just as in open source, seeing the results of your work is motivating.
There you have it: ownership, isolation, automation, and fast feedback turn out to be among the best strategies we have for keeping software working correctly.

Once Amazon figured out how to make this all work (which took years), it leveraged the knowledge by selling its internal services under the brand AWS (Amazon Web Services). In 2018 AWS was a $25 billion / year business, growing at very fast clip.  Much of this growth comes from large enterprises that discover they cannot win an arms race with an architecture and strategy that manages complex systems orders of magnitude more efficiently than they can.

Example: Google

Another company that learned the nature of ‘software debugging’ early in its life was Google. From the beginning, Google hired ‘software engineers’, because they were looking for people who could figure out how to “organize the world’s information and make it universally accessible and useful”[2] and solve the technical problems that came with such an aggressive mission. The earliest technical problems centered on how to store all that data, and then how to search it. Fast.

In 1988, Berkley scientists David A Patterson, Garth Gibson, and Randy H Katz presented the paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)"[3] at the ACM SIGMOD Conference. They stunned the computer-savvy world by suggesting that a redundant array of inexpensive disks promised “improvements of an order of magnitude in performance, reliability, power consumption, and scalability” over single large expensive disks. (In other words, a bazaar-style hardware architecture was vastly superior to a cathedral-style architecture.) Berkley is a close neighbor of Stanford, where Google was born. In hindsight, it is not surprising that Google started its life using a redundant array of inexpensive hardware to store the data it gathered while crawling the Internet. In 2003 and 2004 Google engineers released three groundbreaking papers: "Web Search for a Planet: The Google Cluster Architecture",[4]  "The Google File System",[5] and "MapReduce: Simplified Data Processing on Large Clusters".[6] These papers explained their approach to managing a vast array of inexpensive hardware, decomposing massive amounts of data into clusters, storing it redundantly, searching it in place and returning results almost instantly. At the heart this approach to infrastructure are the core strategies of isolation, redundancy, fault detection, and automation. This was (and is) a truly impressive engineering accomplishment.

Leveraging its core infrastructure principles to complex software systems, Google’s approach to maintaining the quality of its rapidly growing code base used these strategies:
  1. Ownership: Software engineers were responsible for the quality of their code. Test engineers were available to help engineering teams create and use tools that to help engineers debug their software.
  2. Isolation: Google developed an understanding of the boundaries of its systems by creating a dependency map. Testing of a section of code could then be confined to that section and its dependencies.
  3. Redundancy: Engineers created two machine-readable versions of system behavior by writing automated tests (test code should be considered a specification), and then writing the code to pass the tests. This is like double-entry bookkeeping, a practice that uses redundancy to ensure accuracy.
  4. Fault Detection: Tests are put into a test harness that is run automatically whenever code is checked into the code repository to ensure that the new code works and has not broken any tests that used to pass. In addition, real time behavior monitoring is used to detect and respond to anomalous behaviors whether due to software issues, hardware, network, load or some other issue.
  5. Feedback/Learning: From the beginning, Google adopted the open source model of releasing ‘early and often’, using the ‘Beta’ label to signal that things would change frequently. They treated consumers as co-developers, enticing them to explore the site daily to check out new features.
  6. Automation: Google developed a host of automated tools to deploy changes to a limited audience, run A/B tests, monitor the health of its systems, find sources of defects, etc.
Of course, the reality is more complicated than this simplifying description, but you get the idea. See How Google Tests Software,[7]  by James Whittaker, Jason Arbon, and Jeff Carollo for more information.

The Lean Approach to Software

It’s hard to count the many times that someone told me “software development is not like manufacturing.”  I agree; I’ve developed software and I’ve worked in manufacturing and I assure you, they are very different. Attempts to apply lean production tools and practices to a development process have a dismal track record. Copying practices from one context to another is always problematic, but using operational practices in a development environment is particularly awkward and not recommended.

Software development does have one thing in common with manufacturing; they are both seriously complex systems. (Anyone who thinks manufacturing is 'simple' has never been there.) One reason lean works in manufacturing is because it is an effective way to manage complexity. If you go far enough up the chain to lean’s first principles, you will find that they apply to software complexity also, but they don’t apply in the same way.

I have observed that lean organizations consistently exhibit certain characteristics, which I would consider first principles:
  1. Customer Focus 
  2. Rapid Flow
  3. Systematic Learning
  4. Built-in Quality
  5. Respect for People
  6. Long-Term / Whole System Perspective
If you match these principles to the software engineering approaches of Amazon and Google, you will find that they are quite ‘lean’, even if the companies do not use that term. They start with customers. They release early and often, resulting in rapid feedback. They combine this feedback with data-driven approaches to adapt their offerings. They leverage redundancy and automation make sure their code – and data centers – remain stable, secure, and resilient. They have a culture of respect for engineers, and of long-term thinking.

Let’s look at how these principles might be applied differently in the same company. If you think of lean as a learning system, then the principle of systematic learning is a good place to start. At AWS (Amazon Web Services), the most important thing to learn is WHAT to build. They search for answers to questions such as: What matters to customers? What causes friction in the customer experience? What can we do to make customers feel awesome? What current – and future – technologies can we use to lower their costs? Based on the answer to these questions, Amazon introduced a service called Lambda in 2014 that responds to events quickly and inexpensively. Lambda replaced the need for customers to pay for servers sitting around listening for events to occur – reducing the cost (and Amazon’s revenue) for event-driven systems by a factor of 5 to 10 (!). Customers said WOW! and whole new category of cloud services was born.

At an Amazon fulfillment center, systematic learning focuses on HOW to improve the process of packing, shipping, and handling returns. The questions to ask might be: How long does it take from order to shipping and can we shorten this time? How can we package items faster, cheaper, with fewer materials and less stress on people? How arcuately can we predict delivery dates and how well do we keep our delivery promises? How can we improve our delivery predictability? Can we make delivery easier for shippers? Is there a better way to handle returns that would reduce friction for customers and sellers?

Built-in quality at AWS is very different than built-in quality at a fulfillment center because the underlying causes of error are not the same; they are not even similar. A Poka-yoke (mistake-proofing) system in a warehouse might be a scale that weighs each package and checks that it matches the weight of what is supposed to be shipped. A Poka-yoke system at AWS might be the use of Specification by Example to create a way to automatically validate the software’s behavior.

When are ‘requirements’ not required?

One basic principle of lean is that learning through systematic problem-solving is everyone’s job, all the time. Managers are mentors who help people and teams learn how to learn.[8] If you look at the way software used to be developed – somebody came up with a list of ‘requirements’ which were ‘implemented’ by programmers – it’s easy to see that this was not a lean process because the requirements, or ‘scope’ were fixed at the onset; learning was not allowed.

Early attempts to apply lean tools in software development processes often used the mantra ‘Get it Right the First Time’ to insist on a complete, accurate, unchangeable description of ‘Right’ before starting a project. This approach failed to ask the basic question: are we making the right thing? It ignored the fact that for most software projects, ‘requirements’ represented little more than a guess at what needed to be done to achieve the purpose of the project.[9] After all, those ‘requirements’ were often written by someone who had little technical background, limited understanding of the problem domain, and no responsibility for achieving the purpose of the project. In addition, this ‘scope’ was fixed at a time when the least possible information was available. Since no learning was expected, software engineers were required to do a lot of work they suspected was unnecessary, while being asked to make trade-offs that prioritized short term feature delivery over clean, high quality, robust code. This lack of respect for the time and expertise of software engineers discouraged engagement and made retention difficult.

A production view of software development is fundamentally flawed. When you apply lean to a development process, you are looking for ways to learn as much as possible about the customer problem and potential technical solutions, so you finalize product content as late as possible. For software-intensive products and services, the modern approach is to continuously deliver small changes in capability in order to set up short, direct feedback loops between the engineering team and its customers. As a bonus, this is an excellent technique for managing complexity and assuring the quality, resilience, and adaptability of a product over time.

The Roots of Lean Product Development

During the 1980’s, when it became apparent that Japanese automotive companies were making higher quality, lower cost cars than US automotive companies, Boston rivals MIT and Harvard Business School started programs to investigate the situation. MIT established the International Motor Vehicle Program, which produced the 1990 best-seller The Machine that Changed the World: The Story of Lean Production[10] by James P. Womack, Daniel T. Jones, and Daniel Roos. Womack and Jones went on to establish Lean training and consulting organizations in the US and Europe.

Across the Charles River, Harvard Business School was also looking into the automotive industry, and in 1991 it published Product Development Performance: Strategy, Organization, and Management in the World Auto Industry by Kim B. Clark and Takahiro Fujimoto. This book did not become a best-seller, but it did provide a summary of how lean principles work differently in automotive product development.[11] For example, the book equates short production throughput time to short development lead time. Work-in-process inventory in production is comparable to information inventory between development steps. While pull systems in production are triggered by downstream demand, they are triggered by downstream market introduction dates in development. Flexibility to changes in volume and product mix in production are the same as flexibility to changes in design, schedule, and cost targets in development. Continuous improvement in production equates to frequent, incremental innovations in development.

The most important findings in the book were:
  1. The development processes of high performing companies focused on fast time-to-market, excellent product quality, and high engineering productivity (i.e. the number of hours and level of resources needed to move a vehicle from concept to market.)
  2. High performing development programs were led by a strong product manager who started as the product concept champion and then led the development effort (as ‘chief engineer’), continually reinforcing the concept vision with the engineering teams as they designed the vehicle.
  3. High performing development processes were organized by forming integrated product teams – relatively small cross-functional teams with members from product planning, product engineering, and process engineering. These teams engaged in continual problem-solving cycles focused on specific vehicle capabilities. They enabled a high degree of information exchange between upstream and downstream processes throughout the development cycle, which contributed to shorter lead times, higher product quality, and greater engineering productivity – in short, higher development performance.
So, there you have it, a good summary of three important characteristics of ‘lean’ product development, written about 30 years ago, before ‘lean’ became a commonly used term. It turns out that the third characteristic, ‘integrated product teams’, is especially important. Today, most software-intensive products and services are developed by such teams, although they probably have a different name: cross-functional teams or multi-discipline teams or full stack teams. These teams create a rapid flow of high-quality prototypes or deliverables which generate feedback to improve the design – an approach that has proven to be far more productive than sequential development.

The second characteristic has also proven to be important. Most modern software-intensive products and services – including open source projects, startup company products, AWS services, and SpaceX rockets – have a strong (entrepreneurial) leader who champions the product vision.

Example: SpaceX

On September 14, 2017, SpaceX posted a video on YouTube called "How Not to Land an Orbital Rocket Booster;"[12] it shows crash after crash during attempted landings of rocket boosters. As you might guess, the video ends with success – the first successful landing, and later the first successful landing on a drone ship. But think about it: Why would a company showcase so many failures?

SpaceX was founded in 2002 with the goal of making access to space affordable by designing and launching low cost orbital rockets. The company has kept engineering cost low through a rapid design process that values learning by doing. As SpaceX Launch Director John Muratore explained[13] “Because we can design-build-test at low cost. we can afford to learn through experience rather than consuming schedule attempting to anticipate all possible system interactions.”

SpaceX has also kept launch costs low through a program of recovery and reuse of rocket boosters and other parts. The first thing the company had to learn was how to land rocket boosters under their own power so that they could be reused. It took a lot of trial and error before the first booster landed successfully but learning through experience (including crash landings) was much faster and far less expensive that the anticipate-everything-in-advance approach. So, SpaceX is rightfully proud of its engineering approach: it works. It’s faster, better, and cheaper to learn by doing instead of learning before engineering starts. Learning is what engineering is all about.

Muratore says “SpaceX operates on the Philosophy of Responsibility.  No engineering process in exitance can replace this for getting things done right, efficiently.”[14] What is the Philosophy of Responsibility? It means that Engineers are responsible for the design and engineering of a component, and for ensuring that their component operates properly and does its job as part of the overall system. So, let’s say a rocket booster crashes into the ocean rather than landing on a drone ship. Engineers know that they have 24 hours to report on what caused the failure and outline a plan to keep that thing from ever happening again. Thus, every launch is heavily instrumented with video and data-transmitting devices; these are not for advertising, they exist to provide detailed feedback to the engineering team so they can improve the design.

When SpaceX was learning how to land rocket boosters, it scheduled a test launch every couple of months. Each responsible engineer knew that the launch would happen, and their part had better be ready. The launch date Would. Not. Move. The next launch date effectively pulled the work of integrated product teams focusing on getting each component ready for the next launch.

Through the principle of responsibility and the practice of frequent integration tests, SpaceX has developed launch capability at a cost that is an order of magnitude lower than the companies who developed rockets under government contract. In addition, SpaceX’s cost to launch a kilo of payload is about an order of magnitude lower than the current cost for other large rockets. It should be no surprise that SpaceX’s low engineering and launch costs are threatening the existence of its competitors.

SpaceX is a good example of the essence of lean product development – small, responsible teams learn through a series of rapid experiments. Perfect launches are not the goal – at least not at first. Crashes are to be expected – but make sure the damage is limited and be prepared to determine the cause and never let it happen again. The goal is not perfect launches, it’s learning. As any good musician knows, practice time is the time to push the limits and make mistakes. If you never make any mistakes, you never learn.

Making the Shift to Digital

If your organization was not born digital, it may be considering a shift toward digital in order to leverage technologies such as artificial intelligence, augmented reality, ubiquitous Internet, and more. If digital startups are entering your market or competitors are making the shift to digital, you may have no choice; the ability to compete in a digital world is becoming necessary for survival. If this sounds familiar, check out Mark Schwartz’s book War and Peace and IT,[15] which summarizes the mindset shift necessary for companies to make the shift to digital. He discusses the ‘contractor model’ of IT – where the IT department is viewed as a contractor receiving specifications from ‘the business’ – and shows why this arm’s length relationship is obsolete in the digital age. In chapter 3 (Agility and Leanness) he introduces DevOps, a set of technical practices based on cross-functional teams and heavy automation that effectively does away with the tradeoff between speed and control – you can have both.

In the Harvard Business Review article “Building the AI-Powered Organization,”[16] Tim Fountaine, Brian McCarthy, and Tamim Saleh contend that a successful move to digital involves aligning a company’s culture, structure, and ways of working. Three fundamental shifts are required:
  1. From siloed work to interdisciplinary collaboration.
  2. From experience-based, leader-driven decision-making to data-driven decision-making at the front line.
  3. From rigid and risk-adverse to agile, experimental, and adaptable.
Both of these references, and many more, confirm Clarke and Fujimoto’s description of a high-performance development organization:
  1. Integrated product teams include product design, product engineering, and process engineering
  2. Product leaders create a product vision that enables teams to make detailed decisions
  3. The product is developed through rapid problem-solving cycles by multi-disciplinary teams
Digital natives like SpaceX, AWS, and Google have always worked this way. You might call this lean; you might call it digital, but in any case, it is the way good software engineering is done these days.

The Nature of Lean

“Friction is the concept which distinguishes real war from war on paper.” 
--  Carl von Clausewitz

The operational focus of lean is to eliminate ‘waste’ – all the extra work that does not add value. For software engineering, we prefer to use the word ‘friction’ (instead of ‘waste’) to describe the stuff that annoys people and slows down processes, but it’s essentially the same thing. We spend our time trying to reduce friction in the consumer experience, friction within our products, and friction in our processes. But if you prefer the word ‘waste’, feel free to substitute it for ‘friction’.

Last winter we had an ice storm here in Minnesota, and it was impossible to walk down our driveway to get the mail. It was impossible to drive our car into our garage. We were surrounded by a moat of glare ice until we spread sand on the driveway to add some friction. It’s easy to understand that there are times when friction is necessary; it is also easy to realize that in general, the less friction the better.

During its early years, Amazon focused intently on removing friction from the customer experience and from the experience of third-party sellers. Amazon’s most enduring innovations as a young company came from imagining ways to give customers and merchants ‘superpowers’, making their experience with Amazon as friction-free as possible.[17]

This is what good product design is all about: walk in the shoes of customers, learn to see the friction in their journey, and find ways to reduce that friction. Lean product development adds one more dimension: focus on reducing customer friction as rapidly and smoothly as possible. In order to do this, the development process needs to be low friction also. That means looking inside the development workflow to find and reduce any friction that slows things down, reduces quality, or incurs unnecessary cost. In this section we’ll look at the four biggest sources of friction when creating software-intensive products and services.

Friction #1: Inefficient Flow

 A huge source of friction for many people is rush hour traffic. They know how long it would take them to get to and from work if the roads were empty, and a commute time that’s much longer is annoying. To get a measure of how efficient a commute is, divide the ideal commute time (with empty roads) by the actual commute time. If the commute has no delays, it is 100% efficient. A 50% efficiency means the commute takes twice as long as it needs to. 10% efficiency means it takes 10 times longer to get home than it would without rush hour traffic. A look at Google Maps during the 10% efficient commute would show a lot of red. That’s friction.

How long does it take a market opportunity to commute through your development process, from concept to cash? How fast could it travel if there were no backlogs, no loopbacks, no red spots on the process map? To measure the flow efficiency of your process, divide the fastest possible commute time by the typical commute time; that tells you how much of the time you are actually working on a problem as it moves through your process.  The flow efficiency of a typical software development process is around 10%. The flow efficiency of a lean development process should be over 50%.

Most companies measure how efficiently they use their resources rather than how efficiently they chase a market opportunity. But that’s like a city measuring the efficiency of its road system by counting how many cars it can fit on its roads rather than how fast the traffic moves. What’s more important – full roads or faster commutes? What’s more important – busy engineers or the capacity to rapidly seize a market opportunity?

In the book This is Lean,[18] Niklas Modig and Pär Åhlström make the case that the essence of Lean is a bias for flow efficiency over resource efficiency. When a company competes in the digital world this makes a lot of sense, because technology changes so fast and opportunities are so fleeting that time to market is critical. But a couple of decades ago, the typical software development process emphasized resource efficiency (keep people and equipment fully utilized to minimize cost) because time-to-market did not seem particularly important.

Then in the early 2000’s, agile and lean ideas began making inroads into the way software was designed, created, and maintained. Extreme Programming[19] contained the roots of technical disciplines such as continuous integration and automated testing. Scrum[20] emphasized iterations. Kanban[21] improved flow management by limiting work-in-process. Twenty years is a long time in a rapidly moving field such as software, and in those two decades Extreme Programming has faded from sight even as its practices became widely accepted and expanded. Kanban charts continue to be used by many teams to visualize and manage their workflow. But Scrum has failed to evolve fast enough. Shortcomings such as unlimited backlogs, relatively long iterations, product owners as proxies, and silence on technical disciplines earn Scrum a ‘not recommended’ rating for lean practitioners today.

Reduce Friction: Continuous Delivery /  DevOps

Dramatic advances in software engineering workflow can be traced to the 2010 book Continuous Delivery[22] by Jez Humble and David Farley. This is arguably one of the most influential books in changing the workflow paradigm from a focus on resource efficiency to a focus on flow efficiency. It laid out in detail the technologies and processes that would enable large enterprises to safely change their production code very frequently, even continuously.

At the time (2010), Amazon was deploying changes to production an average of every 11 seconds. Google was deploying changes multiple times a day. Most digital startups were using similar rapid processes and not experiencing much difficulty. The cloud was gaining traction. And yet, most enterprises thought rapid delivery was an anomaly – certainly serious enterprises that valued stability would not engage in such dangerous practices. But Continuous Delivery debunked the myth that speed was the same as sloppy and introduced the concept that high speed and high discipline went hand in hand.

The basic idea of continuous delivery is to create a workflow that has no interruptions from the time a development team chooses to work on a feature until it is ready to be ‘deployed’ (‘go live’); in many cases it actually goes live immediately and automatically. Of course, this means the operations people who used to receive software releases ‘over the wall’ must be continually involved – they must be part of the team. The combined team is called a DevOps team, and the term DevOps has become almost synonymous with continuous delivery.[23]

A second important practice of continuous delivery is this: batches of code are no longer accumulated (as they used to be) on branches that must be merged later, because merging batches of code invariably exposes interaction problems that must found and fixed. Collecting a batch of software prior to testing – even a two-week batch – makes finding the cause of defects much too hard.

Instead, the practice of trunk-based development[24] has replaced branches. All code resides on main branch (trunk), where continuous integration with the entire code base is possible. Code under development is checked into a repository very frequently, triggering an automated test harness to run; if the tests don’t pass the new code is rolled back or reverted, leaving the trunk in an error-free state. If the tests pass, the new code moves down a continuous integration / continuous deployment (CI/CD) pipeline which applies increasing layers of integration and more sophisticated automated tests (for example, security tests).

The objective is to be sure that the trunk is always ready for deployment, and if it is not, a virtual ‘Andon cord’ is pulled and the highest priority of the team is to return it to a deployable state. Depending on the context, the code may be deployed as soon as it reaches the end of the pipeline (typical of an online environment), or deployment may be delayed for domain reasons. A compromise practice is to deploy code as soon as it reaches the end of the pipeline, but with new features turned off. This provides a final robustness test, and at a convenient later time individual features can be turned on (and off) with a software switch. As an added benefit, the switches enable A/B testing as well as targeted ‘canary’ releases that limit the impact of any problems and allow rapid rollback if necessary.

Developing a robust CI/CD pipeline and managing controlled roll-outs is challenging, high discipline work. It involves a lot of automation and is usually accompanied by a change in system architecture, organizational structure, and incentives (more on that later). It is the technical enabler of lean in software engineering, and today, a decade after the Continuous Delivery book was published, it is the way modern software is built.

Reduce Friction: Limit Work to Capacity

Integrating and releasing software in big batches is not the only practice that slows down software engineering workflow. Another significant source of friction is the failure to limit work to capacity. Just about every software engineering organization on the planet has more requests for work than it can accommodate, so this is a universal problem.

If an organization wants to limit work to capacity, the first question it should ask is: On average, how many ‘things’ get released to production in a unit of time (quarter, month, week, day)? Most software engineering organizations can answer this question or easily find the data. And most organizations find that for small or medium sized efforts, their output rate is more or less the same over time. However, even when they know their output rate, many organizations fail to limit amount of work they accept to the amount of work they finish. Instead they accept work and then put it in a ‘backlog’ that is subject to endless prioritization. It’s clear that they can’t do all the work, they just don’t want to say “no”.

See Friction: Backlogs

The obvious way to limit work to capacity is to use a pull system that accepts work items at the same rate as they are completed. For example, if an average of two items are deployed every day, then no more than two items per day should be accepted. Work items should not be put on a backlog for later prioritization, they should be accepted or rejected as quickly as practical. Teams do not need a long list of work to be done, and requestors do not need to be left wondering whether (and when) their problem will be resolved.  A small, limited buffer may be necessary to absorb variation in input flow, and some capacity may be held open for urgent work, but that small amount of friction is like putting sand on ice.

Backlogs, on the other hand, tend to generate a huge amount of friction at many points in the development process, unnecessary friction that dramatically slows down every item that goes through the process. A better approach is to respond to work requests immediately with one of two responses: “Yes, we can do what you requested, and you can expect delivery by [insert a valid promise date].” or “Unfortunately, we do not have the capacity to do what you are requesting.” That’s it. Learn to say “no.” Immediately. Customers appreciate it. Teams love it. Everything gets done a lot faster. It works.

Friction #2: Dependencies

Arguably the biggest friction-generator in software systems is dependencies – one part of the code depends on another part of the code, or most likely many other parts of the code. These dependencies create a complex web of interactions that quickly become impossible to trace. Bugs show up as insidious unintended consequences of these interactions after the software is deployed. Experience has taught us that finding all the unintended consequences is impossible, no matter how much testing is done. It is well understood that software systems are inherently complex,[25] and that tightly coupled complex systems will eventually fail.[26]

The key to solving this problem lies in the words ‘tightly coupled.’ Loosely coupled systems can be very robust. Consider the Internet. No one worries about my web site accidentally changing the balance of my bank account, even though I can display them both on my computer at the same time. Or consider a smart phone. My weather app cannot accidentally add things to my shopping list.

The key is to eliminate dependencies rather than cater to them. For decades, enterprises have attempted to coordinate their desperate applications through an enterprise database, but that database became a dependency generator. Changes to the data format in one application – even small changes – meant changing every application that used the same data. Then each of these applications had to be tested both separately and together in a newly built system. If an error was found the process had to be repeated, often many times. Since we’re talking about slow and expensive manual testing, this could go on for a long time.

Reduce Friction: Federated Architecture 

Theoretically, we know that a federated architecture will address this problem, but practically speaking, enterprises did not do an effective job of adopting federated architectures until around the year 2000. That’s when newly minted internet companies tried to grow systems many times larger than any enterprise could manage. Without a new paradigm for system architecture, scaling was extraordinarily difficult, so many failed. It wasn’t until Google began publishing papers on scalable infrastructure and AWS started selling it that practical ways to break the crippling dependencies of our enterprise systems began to emerge.

The strategy for breaking dependencies, in a nutshell, is to take the bazaar approach to both system architecture and development teams. Small, independent teams own a small service – called a microservice these days. Services communicate with other services through hard boundaries with API (Application Program Interface) contracts prescribing their interaction. Integration testing is done at service boundaries, dramatically limiting the number of interactions that need testing, while simultaneously clarifying responsibility for correct performance.

Do not think of a microservice architecture as a flat layer of tiny services. Consolidator services aggregate smaller services into higher-level services, usually resulting in layers of consolidated services. Related services that are likely to change together are often grouped together (this is called a ‘bounded context’). All teams are expected to understand their role in the larger context and work toward shared goals. Very rapid releases through a CI/CD pipeline provide continuous feedback to the teams on their progress, pulling any needed adjustments from each team.

Reduce Friction: Sync and Stabilize

Hardware systems that rely heavily on software also use bounded contexts, but they are usually defined by the hardware components. Consider SpaceX’s Falcon Heavy sitting on the launch pad about to send satellites into space. At the top is the payload, next is the second stage, and at the bottom are three first stage rocket boosters. Each first stage booster has nine Merlin engines, a fuel storage/dispensing system, and four landing legs. A landing leg is a component; it has a team made up of both hardware and software engineers, led by a responsible engineer. Their job is to make sure the landing legs work properly and do their job as part of the overall system. Let’s say they are working on an improved design that will hook the rocket into place after landing on a drone ship. They know when the next launch is scheduled, and they know the date will not move. This launch date ‘pulls’ their work as well as their coordination with the drone ship landing pad team. They perform static tests of the hooking system, which go well, but the true test happens when as the rocket attempts a landing on the drone ship landing after the launch. Each launch is a test to synchronize all the components and stabilize their combined performance.

This ‘Sync and Stabilize’[27] approach has long been used in the development of software-intensive hardware systems, and it works. Here is an email I received recently that shows its benefits:[28]
You may recall that you spent a day with us last June. In one of the sessions I described to you a big challenge we had: which was to achieve success in a critical warehouse project involving dozens of software development teams (30+).  The project had extremely tight timescales, complex integration requirements, additional software to be developed and a large number of unknowns.  The good news was that a test facility was being prepared for us to use, including the relevant automation /robotics.  But how should we use it? 
In the session you recommended we use a 'Synch and Stabilize' demo approach, which you described to those assembled. This email is to let you know that we did indeed do what you suggested - and it has proved revolutionary for us.  Pretty much the next day we started organizing our first planning session, which involved 50-60 teams.  Our first demo was in September.  We have now run 6 demos and we have been making excellent progress.  There is no going back!  As you would have predicted, the approach has yielded many benefits e.g. alignment, communication, sense of commitment, teams helping other teams plus teams inspiring other teams etc.

Friction #3: Cost Centers

In the 1960’s, IT was largely an in-house back-office function focused on process automation and cost reduction. Today, digital technology plays a significant strategic and revenue role in most companies and is deeply integrated with business functions. Digital natives (companies born in the last two decades) typically do not have IT departments, but in industries that were born before the Internet, IT departments are still commonly found. And where they exist, IT departments are usually cost centers; that is, performance is measured by cost containment and/or reduction. Since a key cost driver of IT departments is salaries, good performance usually means doing more work with fewer (or less expensive) people. Thus, IT incentives have historically been stacked in favor of resource efficiency: keep everyone fully utilized and outsource work to lower salaried regions. The fact that these are two of the best ways to decrease flow efficiency carried little weight.

Back in the mid 1980’s, before ‘lean’ came into our lexicon, Just-in-Time (JIT) was gaining traction in manufacturing companies. JIT always drove inventories down sharply, giving companies a much faster response time when demand changed. However, accounting systems count inventory as an asset, so any significant reduction in inventory had a negative impact on the balance sheet. Balance sheet metrics made their way into senior management metrics, so successful JIT efforts tended to make senior managers look bad. Often senior management metrics made their way down into the metrics of manufacturing organizations, and when they did, efforts to reduce inventory were half-hearted at best. A generation of accountants had to retire before serious inventory reduction was widely accepted as a good thing.[29]

Returning to the present, being a cost center means that IT performance is judged – from an accounting perspective – solely on cost management. Frequently these accounting metrics make their way into the performance metrics of senior managers, while contributions to business performance tend to be deemphasized or absent. As the metrics of senior managers make their way down through the organization, a culture of cost control develops, with scant attention paid to improving overall business performance. Help in delivering business results is appreciated, of course, but is rarely rewarded, and rarer still is the cost center that voluntarily accepts responsibility for business results.

In addition, cost center projects are normally capitalized until they are “done” (they reach “final operating capability”) and are turned over to production and maintenance.[30] But when an organization adopts modern software practices such as continuous delivery (or DevOps), the concept of final operating capability – not to mention maintenance – disappears. This creates a big dilemma because it's no longer clear when, or even if, software development should be capitalized. Moving expenditures from capitalized to expensed not only changes whose budget the money comes from; it can have tax consequences as well. And what happens when all that capitalized software (which, by the way, is an asset) vanishes? Just as in the days when JIT was young, continuous delivery has introduced a paradigm shift that messes up the balance sheet.

But the balance sheet problem is not the only issue; depreciation of capitalized software can wreak havoc as well. In manufacturing, the depreciation of a piece of process equipment is charged against the unit cost of products made on that equipment. The more products that are made on the equipment, the less cost each product has to bear. So, there is strong incentive to keep machines running, flooding the plant with inventory that is not currently needed. In a similar manner, the depreciation of software makes it almost impossible to ignore its sunk cost, which often drives sub-optimal usage, maintenance and replacement decisions.

Capitalization of development creates a hidden bias toward large projects over incremental or continuous delivery, making it difficult to look favorably upon lean development practices. Hopefully we don't have to wait for another generation of accountants to retire before delivering software rapidly is considered a good thing.

See Friction: Life in a Cost Center

Cost Centers have another problem: they can be demoralizing. You aren’t on the A team that creates awesome customer journeys and brings in revenue, you’re on the B team that writes code and consumes resources. No matter how well the business performs, you’ll never get credit. Your budget is unlikely to increase when times are good, but when times are tight, it will be the first to be cut. Should you have a good idea, it had better not cost anything, because you can’t spend money to make money. If you think that a bigger monitor would make you more efficient, good luck making your case. Yet if your colleagues in trading suggest larger monitors will help them generate more revenue, the big screens will show up in a flash.[31] It’s no wonder that IT departments have found it challenging to attract and retain good people, especially in the face of a world-wide shortage of software engineers.

Reduce Friction: Cost of Delay

There are two sides to an investment – how much it costs and how much benefit (cost reduction or added revenue) it will generate. The fact that these are guesses about the future doesn’t stop them from being used to make decisions. So, we may as well assume that the cost and benefit projections used to justify an investment are correct and use them to calculate a third number: the cost of delay.[32]  How much is the cost being increased and how much of the benefit is being lost for each day of delay? What would be the cost savings and added benefits if a valuable feature were delivered early?

If accounting is going to drive decisions, then why not calculate the time value of money along with cost and benefit calculations, and use the result to invest in flow efficiency? A development team – even one in a cost center – should be able to spend the cost of a day’s delay in order to deliver the benefit a day earlier. Unfortunately, it’s a rare event when a development team gets to spend even one day’s cost of delay.

Friction #4: Proxies

We make a mistake when we put proxies between an engineering team and its customers, and yet we do it all the time. When colleagues in “The Business” request new features from IT, they do not ask for improved business outcomes, they request capabilities that may or may not produce the desirable business results. These proxies for business outcomes detach engineering team members from the purpose of their work.

Consider this: Jeff Dean, co-inventor of Google’s amazing data storage and search capabilities (mentioned earlier), left DEC Research labs in 1999 to join a startup called Google. "Ultimately, it was this frustration of being one level removed from real users using my work that led me to want to go to a startup," Dean says.[33] Good software engineers share that desire to have an immediate connection with customers, but such connections are rarely found in IT departments, or in the contracting organizations they mimic.

There are a lot of proxies in our development processes. “The Business” is a proxy. A product owner is described as a proxy in Scrum Guides. Projects typically start after the deliverables have been specified and are considered successful if cost, schedule, and scope targets are met; thus, project metrics are proxies for the outcomes envisioned by those who funded the project. Project teams are often not told about the desired project outcomes, generally have no way to influence those outcomes, and are almost never responsible for them. In most projects, team members never hear about the actual outcomes after the project is ‘delivered’.

Proxies create friction in many ways in in a development process. Multiple handovers slow things down and lose a lot of information. The engineering team lacks firsthand experience with the problem to be solved so amateurs end up designing technical solutions to technical problems. Their guesses at solutions are turned into requirements with no attempt to validate them. Feedback loops – should they exist at all – are far too long.

And that is perhaps the worst thing about proxies. We know that proxy metrics drive teams to excel at what is measured – feature delivery, for example – rather than what is desired – business outcomes. We also know that proxy metrics are almost never validated against the desired business outcomes, and that the majority of the features and functions in a bespoke software system are neither needed nor likely to be used.[34] We know that building the wrong thing is the biggest waste in software engineering and the best way to build the right thing is to validate the business impact of each feature as we deploy it. And we have all of the tools in our toolbox to be able to create the rapid feedback loops necessary for such validation. So why would anyone use proxy metrics rather than business outcomes to measure development performance?

Reduce Friction: Full Stack Teams

In 2002, John Rossman was hired by Amazon to lead the launch of third-party services. He began by using Amazon’s standard approach – write a press release set in the future which describes the experience of future customers. His press release read: “A seller, in the middle of the night without talking to anyone, can register, list an item, fulfill an order, and delight a customer as though Amazon the retailer had done it.”[35]  That’s it. No requirements or other proxies, just a powerful statement that succinctly defined the responsibility, constraints, and expected outcomes of the team that would work on this service. It also defined the composition of the team that would bring the service to life: everyone necessary to start up a new line of business.

The most successful technology companies today establish relatively small, full stack teams and challenge them with interesting problems. These teams have:
  1. A clear description of the team’s mission (responsibility), constraints, and expected outcomes.
  2. A leader (responsible engineer, product manager) who guides the team toward good decisions.
  3. An immediate connection with their consumers, minimum dependencies on other teams, freedom to act autonomously and asynchronously within constraints, and full responsibility for outcomes.
Full stack teams develop a product, component, or service, while maintaining a clear understanding of their role and responsibility within the larger system. They are supported by experts or leaders in competency areas that are particularly important in their industry or market.

Case Study: ING Netherlands

In 2015 the employees at ING Netherlands headquarters – over 3,000 people from marketing, product management, channel management, and IT development – were told that their jobs had disappeared.  Their old departments would no longer exist; small squads would replace them, each with end-to-end responsibility for making an impact on a focused area of the business. Existing employees would fill the new jobs, but they needed to apply for the positions.[36]

It was a bold move for the Netherlands bank. The leaders were giving up their traditional hierarchy, detailed planning and “input steering” (giving directions). Instead they would trust empowered teams, informal networks, and “output steering” (responding to feedback) to move the bank forward. The bank was not in trouble; it did not really need to go through such a dramatic change. What prompted this bet-your-company experiment?

The change had been years in the making. After initial experiments in 2010, the IT organization put aside waterfall development in favor of agile teams. As successful as this change was, it did not make much difference to the bank, so Continuous Delivery and DevOps teams were added to increase feedback and stability. But still, there was not enough impact on business results. Although there were ample opportunities for business involvement on the agile teams and input into development priorities, the businesses were not organized to take full advantage of the agile IT organization. Eventually, according to CIO Ron van Kemenade (CIO of ING Netherlands from 2010 until he became CIO of ING Bank in 2013):[37]
The business took it upon itself to reorganize in ways that broke down silos and fostered the necessary end-to-end ownership and accountability. Making this transition … proved highly challenging for our business colleagues, especially culturally. But I tip my hat to them. They had the guts to do it.
The leadership team at ING Netherlands had examined its business model and come to an interesting conclusion: their bank was no longer a financial services company; it was a technology company in the financial services business. The days of segmenting customers by channel were over. The days of push marketing were over. Thinking forward, they understood that winning companies would use technology to provide simple, attractive customer journeys across multiple channels. This was true for companies in the media business, the search business, most retail businesses, and it was certainly true for companies in the financial services business. Moreover, expectations for engaging customer interactions were not being set by banks – they were being set by media and search and retail companies. Banks had to meet these expectations just to stay in the online game.

ING Netherlands’ leadership team decided to look to other technology companies, rather than banks, for inspiration. For example, on a trip to the Google IO developers conference Ron van Kemenade was impressed by the amazing number of enthusiastic, engaged engineers at Google. He realized that such enthusiasm could not surface in his company, because the culture did not value good engineering.

The leaders at ING Netherlands decided to investigate how top technology companies attract talented people and come up with engaging products. Through concentrated visits to some of the most attractive technology companies, they saw a common theme – these companies did not have traditional enterprise IT departments even though they were much bigger than any bank. Nor did they have much of a hierarchical structure. Instead, they were organized in teams – or squads – that had a common purpose, worked closely with customers, and decided for themselves how they would accomplish their purpose.

ING Netherlands decided that if it was going to be a successful technology company and attract talented engineers, it had to be organized like a technology company. Studying the best technology companies convinced them that they needed to change – and the change had to include the whole company, not just IT. The bank had already modularized its architecture, streamlined and automated provisioning and deployment, moved to frequent deployments, and formed agile teams. But this was done within the IT department rather than across the organization, and the results were not exceptional. Now it was time to create a digital company across all functions.

They chose to adopt an organizational structure in which small teams – ING calls them squads – accept end-to-end responsibility for a consumer-focused mission. Squads are expected to make their own decisions based on a shared purpose, the insight of their members, and rapid feedback from their work. Squads are grouped into tribes of perhaps 150 people that share a value stream (e.g. mortgages), and within each tribe, chapter leads provide competency leadership. Along with the new organizational structure, ING’s leadership team worked to create a culture that values technical excellence, experimentation, and customer-centricity.

We visited ING Netherlands in 2017. We found a small group of very experienced lean ‘sensi’ who had worked at the bank for many years. They showed us a strategy deployment room that had the bank’s long-term strategy at the top, with about a dozen focus areas just below. Each focus area was connected to one or more strategic initiatives, and each strategic initiative had a few problems below it, typically with an A3 document attached. Teams were given these problems to solve, and they could walk into the room at any time to see its context in the company’s overall strategy.

We talked to members of several teams and found them uniformly delighted at their new way of working. We heard repeatedly: “You should have seen us a year ago – it’s so much better now!” One team showed us how they decided what to work on. They had a list of customer frustrations gleaned from various sources, including artificial intelligence tools scanning social media. When they were ready to attack a new problem, they picked the top frustration on the list. Their leader, a designer, helped the team design candidates for a better experience, test them, and implement the best ones.

Spark New Zealand

About the same time, managers from Spark New Zealand also visited ING. Spark’s managing director, Simon Moutter, says that ING’s success helped guide his company through a similar change:[38]
I was impressed when we visited ING; I thought ING’s model was structured, performance driven, and very applicable in our context – “agile for grown-ups,” if you like. It was less about beanbags and foosball tables and more about real delivery action, and that gave me confidence that there was an outcome that – if we could deliver it – would make a big and enduring difference.
Spark New Zealand subsequently made the switch to integrated teams with impressive results:
Spark is seen as a positive company, an innovative company, …. Over the past two years or so, we’ve been winning a range of business awards, a number of which we weren’t even getting nominated for before. We also have a degree of execution excellence now that has been noticed by investors. We say it, we do it. 
The success is showing up in the “hard” numbers; our mobile market share is up eight percentage points, to 40 percent, since 2013—a huge turnaround. 

20/20 Vision

“The future is here, it’s just not evenly distributed.” 
--  William Gibson

You do not have to look far to see lean principles being applied – or being pursued – in the design and engineering of software-intensive products. The term ‘lean’ may not be used, but the shift to digital has become a strategic necessity for a large number of companies. If you look closely at successful digital companies, they look rather ‘lean’. They obsess over customers. They create an engaging engineering culture.  Full stack teams deliver early and often and learn from feedback. Infrastructure and products are stable, secure, and resilient. This is lean. This is the future. It’s just not (yet) evenly distributed.

__________

Footnotes

[1] The text of Eric Raymond’s presentation can be found here: https://firstmonday.org/article/view/578/499A later version of The Cathedral and the Bazaar was published by O’Reilly Media in 1999.
[3] “A Case for Redundant Arrays of Inexpensive Disks (RAID)” by David A Patterson, Garth Gibson, and Randy H Katz ACM SIGMOD Conference, June 1988
[4] “Web Search for a Planet: The Google Cluster Architecture” by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, Published by IEEE MICRO, March-April 2003. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/googlecluster-ieee.pdf
[5] “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, presented at ACM SOSP’03, October 19–22, 2003. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
[6] “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat, presented at ACM/USENIX Symposium on Operating System Design and Implementation (OSDI), 2004. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
[7] How Google Tests Software by James A. Whittaker, Jason Arbon, and Jeff Carollo, Addison-Wesley Professional, March, 2012
[8] See Managing to Learn: Using the A3 Management Process to Solve Problems, Gain Agreement, Mentor and Lead by John Shook and Jim Womack, Lean Enterprises Inst Inc, June, 2008
[9] See “Online Experimentation at Microsoft” by Ron Kohavi, Thomas Crook, and Roger Longbotham, presented at the ACM Knowledge Discovery & Data Mining (KDD) Conference, 2009. Available at https://exp-platform.com/Documents/ExPThinkWeek2009Public.pdf
[10] The Machine That Changed the World; the Story of Lean Production by James P. Womack, Daniel T. Jones, and Daniel Roos. Rawson & Associates, 1990
[11] Product Development Performance: Strategy, Organization, and Management in the World Auto Industry by Kim B. Clark and Takahiro Fujimoto, Harvard Business School Press, 1990. See pg. 172.
[12] How not to land an orbital rocket booster https://www.youtube.com/watch?v=bvim4rsNHkQ published by SpaceX, Sept 14, 2017
[13] John Muratore, SpaceX launch Director, American Institute of Aeronautics and Astronautics (AIAA) 2012 Complex Aerospace Systems Exchange, Available at http://store.xitricity.skydreams.ws/s3j95uj8a.pdf
[14] Ibid
[15] War and Peace and IT: Business Leadership, Technology, and Success in the Digital Age by Mark Schwartz, IT Revolution Press, May, 2019
[16] “Building the AI-Powered Organization: Technology isn’t the biggest challenge. Culture is.” by Tim Fountaine, Brian McCarthy, and Tamim Saleh, Harvard Business Review, July-August, 2019
[17] Think Like Amazon: 50 1/2 Ideas to Become a Digital Leader by John Rossman, McGraw-Hill Education April, 2019. See Idea 26: Innovate by Reducing Friction.
[18] This is Lean: Resolving the Efficiency Paradox by Niklas Modig and Pär Åhlström, Rheologica Publishing, November, 2012
[19] Extreme Programming Explained, by Kent Beck. Addison-Wesley, 2000 
[20] Agile Software Development with SCRUM by Ken Schwaber and Mike Beedle, Pearson, October 2001
[21] Kanban: Successful Evolutionary Change for Your Technology Business by David J. Andreson, Blue Hole Press; April, 2010
[22] Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and David Farley. , Addison-Wesley Professional, 2010
[23] The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Gene Kim,  Jez Humble, Patrick Debois, and John Willis, IT Revolution Press, October, 2016
[24] See https://trunkbaseddevelopment.com/
[25] “No Silver Bullet. Essence and Accident in Software Engineering” by Frederick Brooks, IFIP Tenth World Computing Conference, Amsterdam, NL, 1986
[26] Normal Accidents: Living With High-risk Technologies by Charles Perrow, Basic Books, 1985
[27] Michael Cusumano popularized the term Sync and Stabilize in his book Microsoft Secrets: How the World's Most Powerful Software Company Creates Technology, Shapes Markets and Manages People, by Michael A. Cusumano, Free Press, December 1998.
[28] Used with permission.
[29] The 1962 book “The Structure of Scientific Revolutions” by Thomas Kuhn discussed how significant paradigm shifts in science do not take hold until a generation of scientists brought up with the old paradigm finally retire.
[30] From “What is Digital Intelligence” by Sunil Mithas and F. Warren McFarlan, IEEE Computing Edge, November 2017. Pg.9.
[31] Thanks to Nick Larsen. Does Your Employer See Software Development as a Cost Center or a Profit Center? https://stackoverflow.blog/2017/02/27/employer-see-software-development-cost-center-profit-center/
[32] The idea of “Cost of Delay” was introduced by Preston G. Smith and Donald G. Reinertsen in their book Developing Products in Half the Time, Van Nostrand Reinhold, 1991, Second edition, Wiley, 1997.
[33] “If Xerox Parc Invented the PC, Google Invented the Internet” by Cade Metz, Wired, August, 2012. Available at https://www.wired.com/2012/08/google-as-xerox-parc/
[34] Standish Group Study Reported at XP2002 by Jim Johnson, Chairman
[35] Think Like Amazon: 50 1/2 Ideas to Become a Digital Leader by John Rossman, McGraw-Hill Education April, 2019, See Idea 45 The Future Press Release.
[36] From: ING’s agile transformation, an interview with Peter Jacobs, CIO of ING Netherlands, and Bart Schlatmann, former COO of ING Netherlands, in McKinsey Quarterly, January 2017. See also: Software Circus Cloudnative Conference keynote by Peter Jacobs. (Peter Jacobs, replaced Ron van Kemenade as CIO of ING Netherlands in 2013.)
[37] From: Building a Cutting-Edge Banking IT Function, An Interview with Ron van Kemenade, CIO ING Bank, by Boston Consulting Group. See also talks by Ron van Kemenade: Nothing Beats Engineering Talent…The AGILE Transformation at ING and The End of Traditional IT.
[38] “All in: From recovery to agility at Spark New Zealand” From interviews with Spark New Zealand’s Simon Moutter, Jolie Hodson, and Joe McCollum by McKinsey’s David Pralong, Jason Inacio, and Tom Fleming, McKinsey Quarterly, June 2019


Friday, June 7, 2019

Lean Lunch Lines


Budapest
“This doesn’t look good,” Tom said, pointing out six signs high on the wall, side-by-side.  Three said “Pork and Beef.” Three said “Fish and Vegan.” Below each sign a serving station was being set up. Lunch break started in a half an hour. There were over 2000 people to feed, and the pouring rain meant any outdoor food options were unlikely to attract much traffic.

We weren’t the only ones who noticed. Lines began to form at the food stations; the queues for meat grew especially long. We joined one of the three long lines and moved slowly to the front. At the food station we had several choices to make – so it took a while to be served.

I did the math. Each meat station was serving about 4 people per minute, so the three meat stations might serve roughly 750 people in an hour – maybe 1000 if the service got faster. From the short lines at the fish/vegan stations, I inferred that they might account for 15-20% of the demand, leaving over 1500 people to be served by the three lines offering meat. The 90-minute lunch break was probably not going to be long enough.

Sure enough, just before the sessions were set to resume, the following tweet announced that the afternoon talks would be delayed by a half hour:

At a conference whose attendees pride themselves on agility and thus should understand queuing theory, this was a big disappointment. In Lean we have a mantra: “Go and See.” It means go to the place where delays are happening and see for yourself what is going on.  I can’t help thinking that if an observant person with authority to change things had taken a close look at the lines on the first day, it might have been possible to improve the process and keep the conference on schedule. For example, they might have switched two of the three fish/vegan lines to meat, or perhaps they could have served all types of food from all six food stations, or shortened the time it took to serve a meal.

There was a second day to the conference; and I assumed that lessons had been learned, and queues would be shorter. As lunch time approached, there were four more signs posted high on a wall at the far end of the room, perpendicular to the first six signs. Two said “Pork and Beef.” Two said “Fish and Vegan.” So now there were five long lines for the 80% or so of the attendees who preferred meat, and five short lines for the others. An observant caterer would not have added more fish/vegan lines; with a total of ten lines, no more than two should have been devoted to the meatless meals preferred by perhaps 20% of the attendees. Of course, that ratio was not precise; for example, although I preferred meat, I opted for the very short line serving fish, and I’m sure I was not alone.

The real test of lunch line flow is this: How long do attendees have to wait in line to obtain the food of their choice? At a conference where organizers understand queuing theory, attendees should not have to wait in a food queue for more than 10 minutes – or maybe up to 15 minutes at peak times. Asking people to stand in line for longer periods shows a lack of respect for their time.

Zürich
The following week we attended DevOps Days in Zürich. There were roughly 400 attendees, and as lunch approached I noticed two stations being stocked with food. I wondered how many people a station might serve per minute. If it was four per minute (as in Budapest) and there were only two stations serving lunch, I speculated that it could take 50 minutes to serve everyone – and that’s a long time for anyone to stand in line.

But I was wrong. Julia Faust, who helped plan the lunch, explained to me: “DevOps is about FLOW, so we want lunch lines that flow. We have four food stations – not two – and we have the same food at each station so anyone can get in any line. We limited the number of food offerings to be sure service is fast; we expect each station to serve about ten people per minute. We think that the four food stations can feed up to 40 people per minute, so 400 people could be fed in ten to fifteen minutes. We have also placed appetizers around on the tables so people can eat something before getting in line. We are hoping that the lines will form gradually and remain very short.”

Sure enough, there were almost no lines at the food stations, and everyone was served within fifteen minutes. Perhaps no one noticed that there was more time for networking and lunch-time gatherings, but to me it was clear that the organizers of this conference understood queuing theory and respected the time of attendees.

Lyon
We went to the MiXiT conference in Lyon the following week, where about 1000 attendees were expecting lunch. I was delighted to see that people were able to help themselves to food rather than have it served. I never could understand why most European conferences I attend find it necessary to have someone serve food and pour coffee. After all, just about every European hotel I stay at seems to have a breakfast buffet, so there's nothing inherently difficult about self-service meals.

There were two long food tables, one on each side of the lunchroom. Again, I did the math. To feed 1000 people in 15 minutes, the tact time (or output rate) for each table would have to be about 32 people per minute. With a line on each side of each table, the tact time should be about 16 people per minute for each of the four lines. But my calculation was wrong – there would be only one line per table, not two, so a 15 minute line would require that each line serve 32 people per minute. Clearly this was not going to happen.

“Why not pull the tables away from the wall a bit further and allow people to get food from both sides?” I asked the gentleman in charge of lunch.

“That’s not possible,” he replied. “We need access to one side of each table for replenishment.”

“But,” I said, “Then the lines would move twice as fast.”

“They are fast enough,” he said. “Yesterday the lines were 35 minutes long; they don't need to be faster.”

I saw tables stacked high with boxes and bags of food, with a long line of people moving past each table, picking up three or four individually packaged items. On the other side, a few people watched and occasionally replenished one or the other stacks of food. They could have easily interrupted a line to add a depleted item – this happens all the time at breakfast buffets. There was no reason (other than habit) to limit each table to one line. A conference that respects its attendees should optimize lunch lines for the convenience of attendees, and find other ways to optimize the convenience of the people serving food.

A Footnote on Diversity
Every software conference I attend broadcasts a policy encouraging diversity. I welcome that because I am different than most attendees – I am 75 years old (and proud of it). But somehow, my kind of diversity has not been considered at most conferences. Consider the first conference we stopped at this spring – Agile Lean Ireland. There were virtually no chairs except in rooms where talks were held; everyone was expected to stand during coffee and lunch breaks. So Tom and I ate – usually alone – in a conference room.

Lunch was served at multiple locations strung out down a long hallway. The station furthest from the conference rooms opened first, to encourage people to move to the most remote station. This might have been a good idea, except for one thing – I found myself swept up in a swarm of attendees racing down the hallway to get served first. After a short time, I just stopped – I had gone far enough. I turned to the servers at a nearby lunch station (which was not yet opened) and said “Give me food. Now. I can’t go any further.”

The gentleman I spoke to was about to refuse, but his wiser companion indicated he should go ahead. As he served lunch for Tom and me, I smiled gratefully at the woman who had broken the rules. Then she said to the nearby people hoping to get some food, “This location is not yet open. You have to keep moving to the end of the hall.” Sigh.

Long walks, long lines, no chairs, and toilets up or down stairs are all indications that a conference does not really welcome older, less agile attendees. Of the four conferences that Tom and I attended in April and May, DevOps Days in Zürich was the only one which had none of these limitations, and thus made us feel the most welcome.