Lean Essays: Design

Showing posts with label Design. Show all posts

September 23, 2017

The Only Country in the World

Software systems that interact with people speak volumes about the people who designed them. In particular, software systems used by travelers often send a clear message: “This is the only country in the world. If you are an outsider, you are not welcome.”

Let’s start with the US. If you want to buy gas at the pump – as almost anyone who travels by car needs to do occasionally – and you don’t have a US zip code, you are out of luck. The gas pumps will require a five digit zip code that matches your home zip code, which, of course, you don’t have. US software systems for purchasing gas are very clear: If you don’t live in the US, you can’t buy gas here.

Not that it’s easy for me to buy gas in Europe, because there I need a chip-and-pin card. But credit card companies in the US have settled on chip-and-signature cards, effectively preventing me from purchasing gas at a pump in Europe. My European friends have no sympathy, since they can’t purchase gas in the US either.

Of course, the problems with my chip-and-signature card do not lie in the gas pump software, but in the choice made by US credit card issuers to use signature as the authentication method. We all know that signature authentication is a joke which leads to a far less secure credit card, but in addition, it prevents me from using the pin authentication systems that are common outside the US. My credit card company has issued me a chip-and-signature card that they claim is a “travel card” – which would be true if the US were the only country in the world. But should I happen to travel to another country, not only are gas pumps off limits to my chip-and-signature card, but I can’t purchase train or bus tickets – or anything sold at a kiosk.

There are other countries where software systems used by travelers are limited to residents. For example, in the Netherlands, train tickets are typically purchased through a bank account which – you guessed it – must be at a Netherlands bank. Earlier this year, I was unable to purchase NS train tickets online with a credit card; I had to get a colleague in the Netherlands to purchase online tickets and email them to me. I didn't want to chance getting tickets once I arrived, since I understand there are very few NS ticket kiosks usable by outsiders.

In the UK, there are very nice train discount schemes; for example, two people traveling together can get serious discounts. The catch is that they must first purchase a discount card with pictures of the two travelers, which can easily be obtained online, but must be mailed to a UK address. True, it is possible to obtain a discount card at a train station, but not at Heathrow – and where do you suppose most travelers arrive? Unlucky travelers without a UK address must pay full price for (expensive) Heathrow Express tickets, and then stand in line at Paddington with the proper paper applications and photographs to get a discount card.

Attention UK software designers: did it occur to you that some people don’t have a UK address? How hard would it be to charge a bit more for shipping to addresses outside the UK?

You would not think that Sweden would belong to the club of countries with software systems designed as if it were the only country in the world. But when we arrived at Arlanda airport on a Friday night and tried to buy the special weekend two-for-one ticket on the Arlanda Express, it was not on the kiosk menu. (Yep, I was using a chip-and-pin card – my debit card!). I searched and searched and finally saw the message stuck to the kiosk below the screen: A recent change had been made: now the special discount ticket could only be purchased through the Arlanda Express app or online, not at the kiosk.

Reading between the lines, this is clearly an attempt to limit the best Arlanda Express ticket pricing to Swedish residents. "Not so!" the software designers probably argued. "Anyone can load the app and buy a ticket." How am I supposed to load an app, validate my payment method, and buy a ticket before the train leaves – all without internet access? I complained to the train conductor, who said he thought the scheme was terrible – he has listened to complaints from countless deeply annoyed visitors to Sweden – would I please complain directly to customer service? As I was composing my complaint email on the train, I had to listen to messages about how hard Arlanda Express was working to make our experience wonderful. Yes, but only if you happen to live in Sweden.

We took a taxi to our Stockholm hotel from the train station and tried to pay the driver in cash, only to learn that our Swedish money was out of date and no longer legal tender. So I asked at the hotel desk how to change the old notes into new ones. The person at reception was very helpful – she told me that I could mail the cash in with an on-line form and the money would be deposited in my bank account, even if it were a “foreign” account. I was skeptical. And sure enough, when we looked at the form (which was in Swedish) we found that only bank accounts with IBAN numbers would work. Those of us from countries without IBAN numbers are apparently too foreign to merit a convenient way to get our money back, even though we are the most likely people to have the old currency.

Clearly there are far too many software systems in the travel industry that are built as if the local country were the only country in the world. This is a plea to all the software teams designing systems that might be used by travelers from another country – or might be used by your customers when they travel to another country – have you built your system as if your country were the only country in the world? Why not try a few use cases for travelers from / and to / other countries? We exist, you know, and we’re getting tired of arrogance embedded in software.

August 19, 2015

Friction

One third of the fuel that goes into a car is spent overcoming friction. By comparison, an electric car loses half as much energy - one sixth - to friction. Who knew electric cars had such an advantage?

Friction is the force that resists motion when the surface of one object comes into contact with the surface of another. You can imagine parts moving against one another in cars, but do you ever wonder what happens when your products and services come in contact with customers? Might this create some friction? Could there be competing offerings that create considerably less friction? If so, you can be sure your customers will find the low friction offering more attractive than yours.

Friction in the Customer Journey

Think of friction as the cognitive overhead that a system places on those who use it. Let’s consider the friction involved in taking a taxi from an airport to a hotel. When I arrive at most airports, I get in a taxi queue, heeding the conspicuous warnings not to ride with unauthorized drivers. When I reach the front of the queue, I take the next taxi in line, and I assume that the cost is the same no matter which taxi I take. But this is not true in Stockholm, where taxis can charge any rate they wish simply by posting it in the window. Nor is it true in many other locations, so I have learned to research the taxi systems in every city I visit. That’s cognitive load. I also bring enough local currency to pay for a taxi ride to my hotel and check on whether a tip is expected. More cognitive load.

Uber set out to remove the friction from taking a taxi by reimagining the entire experience, from hailing to routing to paying; from drivers and cars to insurance and regulations. By removing as many points of friction as possible for riders, Uber has become wildly popular in a very short time. In January 2015, four years after launch, Uber reported that its revenue its home city of San Francisco had grown to three times the size of the entire taxi market in that city. Uber has recently opened a robotics center in Pittsburgh and joins Google in working to create a practical driverless car. Its intent is to bring the cost and convenience of ride services to a point where owning a car becomes the expensive option.

Full Stack Startups

Uber is among the largest of a new crop of startups – investor Chris Dixon calls them full stack startups – that bypass incumbents and reinvent the entire customer experience from start to finish with the aim of making it as frictionless as possible. Full stack startups focus on creating a world that works the way it should work, given today’s technology, rather than optimizing the way it does work, given yesterday’s mental models. Because these companies are creating a new end-to-end experience, they rarely leverage existing capabilities aimed at serving their market; they develop a “full stack” of new capabilities.

"The challenge with the full stack approach is you need to get good at many different things: software, hardware, design, consumer marketing, supply chain management, sales, partnerships, regulation, etc. The good news is that if you can pull this off, it is very hard for competitors to replicate so many interlocking pieces." Chris Dixon

Large companies have the same full stack of capabilities as startups, but these capabilities lie in different departments and there is friction at every department boundary. Moreover, incumbents are deeply invested in the way things work today, so large incumbent companies are usually incapable of truly reinventing a customer journey. As hard as they try to be innovative, incumbents tend to be blind to the friction embedded in the customer journey that they provide today.

Consider banks. They have huge, complex back end systems-of-record that are expensive to maintain and keep secure. But customers expect mobile access to their bank accounts, so banks have added “front end teams” to build portals (mobile apps) to access the back end systems. Typically banks end up with what Gartner calls “Bimodal IT.” One part of IT handles the backend systems using traditional processes, while a different group uses different processes to deliver web and mobile apps. As a result, the front end teams are not able to reimagine the customer journey; they are locked into the practices and revenue models embedded in the back end systems. So in the end, banks have done little to change the customer journey, the fee structure, or anything else fundamental to banking.

For example, in the US it is nearly impossible for me to transfer money to my granddaughter’s bank account without physically mailing a check or paying an exorbitant wire transfer fee. Not only that, but I cannot use my chip-and-pin card at many places in Europe because US banks don’t let me enter a pin with my card (they still depend on signatures!), while unstaffed European kiosks always require a pin. Anyone who banks in Europe would find US banking practices archaic. I find that they generate a lot of friction and I expect that they cost a lot of money.

Creative Friction

Why do banks adopt Bimodal IT? According to Gartner, “There is an inherent tension between doing IT right and doing IT fast.” I respectfully disagree; there is nothing inherently wrong about being fast. In fact, when software development is done right, speed, quality and low cost are fully compatible. Hundreds of enterprises, including Amazon and Google (whose systems manage billions of dollars of revenue every month), have demonstrated that the safest approach to software development is automated, it is adaptive, and it is fast.

It is true that there is tension between different disciplines: front end and back end; dev and ops; product and technology. But the best way to leverage these tensions is not to separate the parties, but to put them together on the same team with a common goal. You will never have a great product, or a great process, without making tradeoffs – that is the nature of difficult engineering problems. If your teams lack multiple perspectives on a problem, they will be unable to make consistently good tradeoff decisions, and their results will be mediocre.

Friction in the Code

The Prussian general and military theorist Carl von Clausewitz (1780-1831) thought of friction as the thing which tempers the good intentions of generals with the reality of the battlefield. He was thinking of the friction caused by boggy terrain that horses cannot cross, soldiers exhausted by heat and heavy burdens, fog that obscures enemy positions, supplies that don’t keep pace with military movements. He noted that battalions are made up of many individuals moving at a different rates with different amounts of confusion and fear, each one affecting the others around him in unpredictable ways. It is impossible for the thousands of individual agents on the battlefield to behave exactly according to a theoretical plan, Clausewitz wrote. Unless generals have actually experienced war, he said, they will not be able to account for the accumulated friction created by all of these agents interacting with each other and their environment.

Anyone who has ever looked closely at a large code base would be forgiven for thinking that Clausewitz was writing about software systems. Over time, any code base acquires lots of moving parts and increasing amounts of friction develops between these parts, until eventually the situation becomes hopeless and the system is either replaced or abandoned. Unless, of course, the messy parts are systematically cleaned up and friction is kept in check. But who is allowed to take time for this sort of refactoring if the decision-makers have never written any code, never been surprised by hidden dependencies, never been bitten by the unintended consequences of seemingly innocuous changes?

Failure

Not long ago the New York Stock Exchange was shut down for half a day due to “computer problems.” It’s not uncommon for an airline reservation systems suffer from “computer problems” so severe that planes are grounded. But we don’t expect to hear about “computer problems” at Twitter or Dropbox or Netflix or similar systems – maybe they had problems a few years ago, but they seem to be reasonably reliable these days. The truth is, cloud-based systems fail all the time, because they are built on unreliable hardware running over unreliable communication links. So they are designed to fail, to detect failure, and to recover quickly, without interrupting or corrupting the services they provide. They appear to be reliable because their robust failure detection and recovery mechanisms isolate users from the unreliable infrastructure.

The first hint of this approach was Google’s early strategy for building a server farm. They used cheap off-the-shelf components that would fail at a known rate, and then they automated failure detection and recovery. They replicated server contents so nothing was lost during a failure, and they automated the monitoring, detection, and recovery process. Amazon built its cloud with the same philosophy – they knew that at the scale they intended to pursue, everything would fail sooner rather than later, so automated failure detection and recovery had to be designed into the system.

Designing failure recovery into a system requires a special kind of software architecture and approach to development. To compensate for unreliable communication channels, messaging is usually asynchronous and on a best-efforts basis. Because servers are expected to fail, interfaces are idempotent so you get the same results on a retry as you get the first time. Since distributed data may not always match, software is written to deal with the ambiguities and produce eventual consistency.

Fault tolerance is not a new concept. Back in the days before solid state components, computer hardware was expected to fail, so vast amounts of time and energy were dedicated to failure detection and recovery. My first job was programming the Number 2 ESS (Electronic Switching System) being built in Naperville, IL by Bell Labs about the time I got out of college. This system, built out of discrete components prior to the days of integrated circuits, had a design goal of a maximum downtime of two hours in forty years. The hardware was completely duplicated and easily half of the software was dedicated to detecting faults, switching out the bad hardware, and identifying the defective discrete component so it could be replaced. This allowed a system built on unreliable electronic components to match the reliability of the electro-mechanical switching systems that were commonly in use at the time.

Situational Awareness

Successful cloud-based systems have a LOT of moving parts – that pretty much comes as a byproduct of success. With all of these parts moving around, designing for failure hardly seems like an adequate explanation for the robustness of these systems. And it isn’t. At the heart of a reliable cloud-based system are small teams (you might call them “full stack” teams) of people who are fully responsible for their piece of the system. They pay attention to how their service is performing, they fix it when it fails, and they continuously improve it to better serve its consumers.

Full stack teams that maintain end-to-end responsibility for a software service do not fit the model we used to have of the “right” way to develop software. These are not project teams that write code according to spec, turn it over to testing, and disband once it’s tossed over the wall to operations. They are engineering teams that solve problems and make frequent changes to the code they are responsible for. Code bases created and maintained by full stack teams are much more resilient than the large and calcified code bases created by the project model precisely because people pay attention to (and change!) the internal workings of “their” code on an on-going basis.

Limited Surface Area

Clearly, many small teams making independent changes to a large code base can generate a lot of Clausewitzian friction. But since friction occurs when the surfaces of two objects come in contact with each other, strictly limiting the surface area of the code exposed by each team can dramatically reduce friction. In cloud-based systems, services are designed to be as self-contained as possible and interactions with other services are strictly limited to hardened interfaces. Teams are expected to limit changes to the surface area (interfaces) of their code and proactively test any changes that might make it through that surface to other services.

Modern software development includes automated testing strategies and automated deployment pipelines that take the friction out of the deployment process, making it practical and safe to independently deploy small services. Containers are used to standardize the surface area that services expose to their environment, reducing the friction that comes from unpredictable surroundings. Finally, when small changes are made to a live system, the impact of each change is monitored and measured. Changes are typically deployed to a small percentage of users (limiting the deployment surface area), and if any problems are detected small changes can be rolled back quickly. We know that the best way to change a complex system is to probe and adapt, and we know that software systems are inherently complex. This explains why the small rapid deployments common in cloud-based systems turn out to be much safer and more robust than the large releases that we used to think were the “right” way to deliver software.

Shared Learning

Do you ever wonder how the sophisticated testing and deployment tools used at companies like Netflix actually work? Would you like to know how Netflix stores and analyzes data or how it monitors the performance of its platform? Just head over to the Netflix Open Source Center on GitHub; it’s all there for you to see – and use if you’d like. Want to analyze a lot of data? You will undoubtedly consider Hadoop, originally developed at Yahoo! based on Google research papers, open sourced through Apache, and now at the core of many open source tools that abstract its interface and extend its capability.

The world-wide software engineering community has developed a culture of sharing intellectual property, in stark contrast to the more common practice of keeping innovative ideas and novel tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues. Because of this, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

Friction in the Process

Between 2004 and 2010, the FBI tried twice to develop an electronic case management system, and it failed both times, squandering hundreds of millions of dollars. UK’s National Health system lost similar amounts of money on a patient booking system that was eventually abandoned, and multiple billions of pounds on a patient record system that never worked. In 2012 Sweden decided to scrap and rewrite PUST, a police automation system that actually worked quite well, but not well enough for those who chose to have it rewritten the “right” way. The rewrite never worked and was eventually abandoned, an expensive fiasco that left the police without any system at all.

I could go on and on – just about every country has its story about an expensive government-funded computer system that cost extraordinary amounts of money and never actually worked. The reason? Broadly speaking, these fiascoes are caused by the process most governments use to procure software systems – a high friction process with a very high rate of failure.

One country that does not have an IT fiasco story is Estonia, probably the most automated country in the world. A few years ago British MP Francis Maude visited Estonia to find out how they managed to implement such sophisticated automation on a small budget. He discovered that Estonia automated its government because it had such a small budget, and properly automated government services are much less expensive than their manual counterparts.

Estonia’s process is simple: small internal teams work directly with consumers, understand their journey, and remove the friction. Working software is delivered in small increments to a small number of consumers, adjustments are made to make it work better, and once things work well the new capability is rolled out more broadly. Then another capability is added in the same manner, and thus the system grows steadily in small steps over time. (Incidentally, when this process is used, it is almost impossible to spend a lot of money only to find out the system doesn’t work.)

The UK government formed a consortium with Estonia and three other countries (called the Digital 5) to “provide a focused forum to share best practice [and] identify how to improve the participants’ digital services.” Maude started up the UK’s Government Digital Services, where small internal teams focus on making the process of obtaining government information and services as frictionless as possible. If you want to see how the UK Government Digital Services actually works, check out its Design Principles which summarize a new mental model for creating digital services, and the Governance approach, which outlines an effective, low friction software development process.

The HealthCare.gov fiasco in the US in 2013 led to the creation of US Digital Services, which is working in partnership with UK Digital Services to rethink government software development and delivery strategies. The US Digital Services Playbook is a great place for any organization to find advice on implementing a low friction development process.

DIGITAL SERVICE PLAYS:

Understand what people need

Address the whole experience, from start to finish

Make it simple and intuitive

Build the service using agile and iterative practices

Structure budgets and contracts to support delivery

Assign one leader and hold that person accountable

Bring in experienced teams

Choose a modern technology stack

Deploy in a flexible hosting environment

Automate testing and deployments

Manage security and privacy through reusable processes

Use data to drive decisions

Default to open

US Digital Services Playbook

The New Mental Model

The UK government changed – seemingly overnight – from high friction processes orchestrated by procurement departments to small internal teams governed by simple metrics. Instead of delivering “requirements” that someone else thinks up, teams are required to track four key performance indicators and figure out how to move these metrics in the right direction over time.

UK Digital Service’s four core KPIs:

Cost per transaction

User satisfaction

Completion rate

Digital take-up

See Gov.UK’s Performance Dashboard.

This is an entirely new mental model about how to develop effective software – one that removes all of the intermediaries between an engineering team and its consumers. It is a model that makes no attempt to define requirements, make estimates, or limit changes; instead it assumes that digital services are best developed through experimentation and require on-going improvement.

This is the mental model used by those who developed the first PUST system in Sweden, the one that was successful and appreciated by the police officers who used it. But unfortunately, conventional wisdom said it was not developed the “right” way, so the working system was shut down and rebuilt using the old mental model. And thus Sweden snatched failure from the jaws of success, proving once again that when it comes to developing interactive services, the old mental model simply Does. Not. Work.

Unexpected Points of Friction

It turns out that when governments move from the old mental model to the new mental model, many of the things that were considered “good” or “essential” in the past turn out to be “questionable” or “to be avoided” going forward. It’s a bit jarring to look at the list of good ideas that should be abandoned, but when you consider the friction that these ideas generate, it’s easier to see why forward-looking governments have eliminated them.

1. Requirements generate friction. The concept that requirements are specified by [someone] and implemented by “the team” has to be abandoned. Rather a team of engineers should explore hypotheses, testing and modifying ideas until they are proven or abandoned. Engineering teams should be expected to figure out how to make a positive impact on business metrics within valid constraints.

2. Handovers generate friction. The engineering team should have direct contact with at least a representative sample of the people whose journey they are automating. Just about any intermediary is problematic, whether the go-between is a procurement officer, business analyst, or product owner.

3. Organizational boundaries generate friction. There is a reason why the UK and US use internal teams to develop Digital Services. Going through a procurement office creates insurmountable friction – especially when procurement is governed by laws passed in the days of the old mental model. The IT departments of enterprises often generate similar friction, especially when they are cost centers.

4. Estimates generate friction. Very little useful purpose is served by estimates at the task level. Teams should have a good idea of their capacity by measuring the rate at which they complete their current work or the time it takes work to move through their workflow. Teams should be asked "What can be completed in this time-frame?" rather than "How long will this take?" The UK Digital Services funds service development incrementally, with a general time limit for each phase. If a service does not fall within the general time boundaries, it is usually broken down into smaller services.

5. Multitasking generates friction. Teams should do one thing at a time and get it done, because task switching burns up a lot of cognitive overhead. Moreover, partially done work that has been put aside gums up the workflow and slows things down.

6. Backlogs generate friction. A long "to do" list takes time to compile and time to prioritize, while everything on the list grows old and whoever put it there grows impatient. Don't prioritize - decide! Either the capacity exists to do the work, or it doesn't. Teams need only three lists: Now, Next, and Never. There is no try.

If Governments can do it, so can Enterprises

If governments can figure out how to design award-winning services [GOV.UK won the Design Museum Design of the Year Award in 2013] while moving quickly and saving money, surely enterprises can do the same. But first there is a lot of inertia to overcome. Once upon a time, governments assumed that obtaining software systems through a procurement process was essential, because it would be impossible to hire the people needed to design and develop these systems internally. They were wrong. They assumed that having teams scattered about in various government agencies would lead to a bunch of unconnected one-of systems. They were wrong. They were afraid that without detailed requirements and people accountable for estimates, there would be no governance. They were wrong. Once they abandoned these assumptions and switched to the low friction approach pioneered by Estonia, governments got better designs, more satisfied consumers, lower cost, and far more predictable results.

Your organization can reap the same benefits, but first you will have to check your assumptions at the door and question some comforting things like requirements, estimates, IT departments, contracts, backlogs – you get the idea. Read the US Digital Services Playbook. Could you run those 13 plays in your organization? If not, you need to uncover the assumptions that are keeping you in the grasp of the old mental model.

August 19, 2011

Don’t Separate Design from Implementation

I was a programmer for about fifteen years. Then I managed a factory IT department for a few years, and managed vendors delivering software for yet more years. In all of those years (with one exception), software was delivered on time and customers were happy. Yet I never used a list of detailed requirements, let alone a backlog of stories, to figure out what should be done – not for myself, not for my department, not even for vendors.

In fact, I couldn’t imagine how one could look at a piece of paper – words – and decipher what to program. I felt that if the work to be done could be adequately written down in a detailed enough manner that code could be written from it, well, it pretty much had to be pseudocode. And if someone was going to write pseudocode, why not just write the code? It would be equally difficult, less error-prone, and much more efficient.

Software Without Stories
So if I didn’t use detailed requirements – how did I know what to code? Actually, everything had requirements, it’s just that they were high level goals and constraints, not low level directives. For example, when I was developing process control systems, the requirements were clear: the system had to control whatever process equipment the guys two floors up were designing, the product made by the process had to be consistently high quality, the operator had to find the control system convenient to use, and the plant engineer had to be able to maintain it. In addition, there was a deadline to meet and it would be career-threatening to be late. Of course there was a rough budget based on history, but when a control system was going to be used for some decades, one was never penny wise and pound foolish. With these high level goals and constraints, a small team of us proceeded to design, develop, install, and start up a sophisticated control system, with guidance from senior engineers who had been doing this kind of work for decades.

One day, after I had some experience myself, an engineering manager from upstairs came to ask me for help. He had decided to have an outside firm develop and install a process monitoring system for a plant. There was a sophisticated software system involved – the kind I could have written, except that it was too large a job for the limited number of engineers who were experienced programmers. He had chosen to contract with the outside firm on a time-and-materials basis even though his boss thought time-and-materials was a mistake. The engineering manager didn’t believe that it was possible to pre-specify the details of what was needed, but if a working system wasn’t delivered on time and on budget, he would be in deep trouble. So he gave me this job: “Keep me out of trouble by making sure that the system is delivered on time and on budget, and make sure that it does what Harold Stressman wants it to do.”

Harold was a very senior plant product engineer who wanted to capture real time process information in a database. He already had quality results in a database, and he wanted to do statistical analysis to determine which process settings gave the best results. Harold didn’t really care how the system would work, he just wanted the data. My job was to keep the engineering manager out of trouble by making sure that the firm delivered the system Harold envisioned within strict cost and schedule constraints.

The engineering manager suggested that I visit the vendor every few weeks to monitor their work. So every month for eighteen months I flew to Salt Lake City with a small group of people. Sometimes Harold came, sometimes the engineers responsible for the sensors joined us, sometimes the plant programmers were there. We did not deliver “requirements;” we were there to review the vendor’s design and implementation. Every visit I spent the first evening pouring over the current listings to be sure I believed that the code would do what the vendor claimed it would do. During the next day and a half we covered two topics: 1) What could the system actually do today (and was this a reasonable step toward getting the data Harold needed)? and 2) Exactly how did the vendor plan to get the system done on time (and was the plan believable)?

This story has a happy ending: I kept the engineering manager out of trouble, the system paid for half of its cost in the first month, and Harold was so pleased with the system that he convinced to plant manager to hire me as IT manager.

At the plant, just about everything we did was aimed at improving plant capacity, quality, or throughput, and since we were keepers of those numbers, we could see the impact of changes immediately. The programmers in my department lived in the same small town as their customers in the warehouse and on the manufacturing floor. They played softball together at night, met in town stores and at church, had kids in the same scout troop. Believe me, we didn’t need a customer proxy to design a system. If we ever got even a small detail of any system wrong, the programmers heard about it overnight and fixed it the next day.

Bad Amateur Design
The theme running through all of my experience is that the long list of things we have come to call requirements – and the large backlog of things we have come to call stories – are actually the design of the system. Even a list of features and functions is design. And in my experience, design is the responsibility of the technical team developing the system. For example, even though I was perfectly capable of designing and developing Harold’s process monitoring system myself, I never presumed to tell the vendor’s team what features and functions the system should have. Designing the system was their job; my job was to review their designs to be sure they would solve Harold’s problem and be delivered on time.

If detailed requirements are actually design, if features and functions are design, if stories are design, then perhaps we should re-think who is responsible for this design. In most software development processes I have encountered, a business analyst or product owner has been assigned the job of writing the requirements or stories or use cases which constitute the design of the system. Quite frankly, people in these roles often lack the training and experience to do good system design, to propose alternative designs and weigh their trade-offs, to examine implementation details and modify the design as the system is being developed. All too often, detailed requirements lists and backlogs of stories are actually bad system design done by amateurs.

I suggest we might get better results if we skip writing lists of requirements and building backlogs of stories. Instead, expect the experienced designers, architects, and engineers on the development team to design the system against a set of high-level goals and constraints – with input from and review by business analysts and product managers, as well as users, maintainers, and other stakeholders.

A couple of my “old school” colleagues agree with me on this point. Fred Brooks, author of the software engineering classic “The Mythical Man Month” wrote in his recent book “The Design of Design” [1]:

“One of the most striking 20th century developments in the design disciplines is the progressive divorce of the designer from both the implementer and the user. … [As a result] instances of disastrous, costly, or embarrassing miscommunication abound.”

Tom Gilb, author of the very popular books “Principles of Software Engineering Management” and “Competitive Engineering” recently wrote [2]:

“The worst scenario I can imagine is when we allow real customers, users, and our own salespeople to dictate ‘functions and features’ to the developers, carefully disguised as ‘customer requirements’. Maybe conveyed by our product owners. If you go slightly below the surface of these false ‘requirements’ (‘means’, not ‘ends’), you will immediately find that they are not really requirements. They are really bad amateur design for the ‘real’ requirements….

"Let developers engineer technical solutions to meet the quantified requirements. This gets the right job (design) done by the right people (developers) towards the right requirements (higher level views of the qualities of the application).”

Separating design from implementation amounts to outsourcing the responsibility for the suitability of the resulting system to people outside the development team. The team members are then in a position of simply doing what they are told to do, rather than being full partners collaborating to create great solutions to problems that they care about.

_________________________________
Footnotes:
[1] “The Design of Design” by Fred Brooks, pp 176-77. Pearson Education, 2010
[2] "Value-Driven Development Principles and Values;" by Tom Gilb, July 2010 Issue 3, Page 18, Agile Record 2010 (www.AgileRecord.com)

December 23, 2010

The Product Owner Problem

“We’re really struggling with the Product Owner concept, and many of our Scrum teams just don’t feel very productive.” they told us. “We’d like you to take a look at this and make some recommendations.” The company had several vertical markets, with a Scrum team of about ten people assigned to each market. Each market had a Product Manager, a traditional role found in most product companies. The company was clear about the role of a Product Manager; after all, there are university courses and professional organizations for this role. The Product Managers had a customer-facing job that included business responsibility for determining product direction and capability.

However, there was serious confusion about the Scrum role of Product Owner and its fit with the classic role of Product Manager. In addition to business responsibility, the Scrum Product Owner has the team-facing responsibility of managing the detailed product requirements.[1] In this company, the Product Managers found it impossible to handle both the customer-facing and team-facing jobs at the same time. So most teams had added an additional person to assist the Product Manager by preparing stories for the team, and called this person the Product Owner. The job of these Product Owners resembled the classic role of business analyst or, in some cases, user interaction designer.

Unfortunately, these Product Owners had little technical background in analysis or design, and yet they were expected to prepare detailed stories for the development team. Critical tradeoffs between business and technical issues often fell to these Scrum Product Owners, yet they had neither the first hand customer knowledge nor the in-depth technical knowledge to make such decisions wisely. They had become a choke point in the information flow between the Product Manager and the development team.

We asked the obvious question: How are things organized in the markets where things seem to be working well? It turns out that in the two highly successful vertical markets, there was no Product Owner preparing and prioritizing stories for the development team. Instead, the Product Manager had regular high level conversations with the development team about the general capabilities that would be desirable over the next two or three months. They discussed the feasibility of adding features and the results that could be expected. A real time application was created to show live web analytics of several key metrics that the Product Manager correlated to increased revenue. Then the team developed the capabilities most likely to drive the metrics in the right direction, observed the results, and modified their development plans accordingly.

This is a pattern we have seen frequently: Product Managers who lack the time, training, or temperament to handle both the customer-facing and the team-facing responsibilities of software development have two options. They can appoint Scrum Product Owners for each development team, or they can provide high-level guidance to a development team capable of designing the product and setting its own priorities. We observe that the second option generally works better, because an intermediary Product Owner brings a single perspective and limited time to the complex job of designing a product.

In 1988, Tom Gilb wrote the book Principles of Software Engineering Management, which is now in its 20th printing. One of the earliest advocates of evolutionary development, he has recently reiterated the elements of good software engineering in an article in Agile Record[2], from which I quote liberally:

Principle 1. Control projects by quantified critical-few, results.
1 Page total! (not stories, functions, features, use cases, objects, ..)

Principle 2. Make sure those results are business results, not technical.
Align your project with your financial sponsor’s interests!

Principle 3. Give developers freedom, to find out how to deliver those results.
The worst scenario I can imagine is when we allow real customers, users, and our own salespeople to dictate ‘functions and features’ to the developers, carefully disguised as ‘customer requirements’. Maybe conveyed by our Product Owners. If you go slightly below the surface, of these false ‘requirements’ (‘means’, not ‘ends’), you will immediately find that they are not really requirements. They are really bad amateur design, for the ‘real’ requirements – implied but not well defined.

Principle 4. Estimate the impacts of your designs, on your quantified goals.
….We have to design and architect with regard to many stakeholders, many quality and performance objectives, many constraints, many conflicting priorities. We have to do so in an ongoing evolutionary sea of changes with regard to all requirements, all stakeholders, all priorities, and all potential architectures…. a designer [should be able] to estimate the many impacts of a suggested design on our requirements.

Principle 5. Select designs with the best value impacts for their costs, do them first.
Assuming we find the assertion above, that we should estimate and measure the potential, and real, impacts of designs and architecture on our requirements, to be common sense. Then I would like to argue that our basic method of deciding ‘which designs to adopt’, should be based on which ones have the best value for money.

Principle 6. Decompose the workflow, into weekly (or 2% of budget) time boxes.
….I would argue that we need to do more than chunk by ‘product owner prioritized requirements’. We need to chunk the value flow itself – not just by story/function/use cases. This value chunking is similar to the previous principle of prioritizing the designs of best value/cost.

Principle 7. Change designs, based on quantified value and cost experience of implementation.

Principle 8. Change requirements, based in quantified value and cost experience, new inputs.

Principle 9. Involve the stakeholders, every week, in setting quantified value goals.
….In real projects, of moderate size, there are 20 to 40 interesting stakeholder roles worth considering…. But it can never be a simple matter of analyzing all stakeholders and their needs, and priorities of those needs up front. The fact of actual value delivery on a continuous basis will change needs and priorities. The external environment of stakeholders (politics, competitors, science, economics) will constantly change their priorities, and indeed even change the fact of who the stakeholders are. So we need to keep some kind of line open to the real world, on a continuous basis. We need to try to sense new prioritized requirements as they emerge, in front of earlier winners. It is not enough to think of requirements as simple functions and use cases. The most critical and pervasive requirements are overall system quality requirements, and it is the numeric levels of the ‘ilities’ that are critical to adjust, so they are in balance with all other considerations.

Principle 10. Involve the stakeholders, every week, in actually using value increments.
….I believe that should be the aim of each increment. Not ‘delivering working code to customers’. This means you need to recognize exactly which stakeholder type is projected to receive exactly which value improvement, and plan to have them, or a useful subset of them, on hand to get the increment, and evaluate the value delivered.

The Scrum Product Owner might be a role, but it should not be a job title. Product Owners wear many hats: Product Manager, Systems Engineer, User Interaction Designer, Software Architect, Business Analyst, Quality Assurance Expert, even Technical Writer. We would do well to use these well-known job titles, rather than invent a new, ambiguous title that tends to create a choke point and often removes from the development team its most important role – that of product design.

Discovery of the right thing to build is the most important step in creating a good product. Get that wrong and you have achieved 100% waste. Delegating decisions about what to build to a single Product Owner is outsourcing the most important work of the development team to a person who is unlikely to have the skills or knowledge to make really good decisions. The flaw in many Product Owner implementations is the idea that the Product Owner prepares detailed stories for the team to implement. This does not allow team members to be partners and collaborators in designing the product.

The entire team needs to be part of the design decision process. Team members should have the level of knowledge of the problems and opportunities being addressed necessary for them to contribute their unique perspective to the product design. Only when decisions cannot be made by the development team would they be resolved by a product leader. The main team-facing responsibility of the product leader is to ensure the people doing the detailed design have a clear understanding of the overall product direction.

The concept of single focus of accountability is at the center of this issue. Too often, accountability is implemented as a prioritized list of product details (stories) rather than as communication of intended results (business relevant metrics). As a result, the expertise, creative input, and passion of team members is sacrificed to the false goal of a single point of responsibility.
___________________
Footnotes:
[1] “The Product Owner is responsible for the Product Backlog, its contents, its availability, and its prioritization.” “The Product Backlog represents everything necessary to develop and launch a successful product. It is a list of all features, functions, technologies, enhancements, and bug fixes that constitute the changes that will be made to the product for future releases.” From Scrum: Developed and sustained by Ken Schwaber and Jeff Sutherland; http://www.scrum.org/scrumguides/.

[2] “Value-Driven Development Principles and Values – Agility is the Tool, not the Master.” Agile Record, Issue 3, July 2010, pp 18-25. Available at www.agilerecord.com. Used with permission.

Screen Beans Art, © A Bit Better Corporation

August 1, 2003

Concurrent Development

When sheet metal is formed into a car body, a massive machine called a stamping machine presses the metal into shape. The stamping machine has a huge metal tool called a die which makes contact with the sheet metal and presses it into the shape of a fender or a door or a hood. Designing and cutting these dies to the proper shape accounts for half of the capital investment of a new car development program, and drives the critical path. If a mistake ruins a die, the entire development program suffers a huge set-back. So if there is one thing that automakers want to do right, it is the die design and cutting.

The problem is, as the car development progresses, engineers keep making changes to the car, and these find their way to the die design. No matter how hard the engineers try to freeze the design, they are not able to do so. In Detroit in the 1980’s the cost of changes to the design was 30 – 50% of the total die cost, while in Japan it was 10 – 20% of the total die cost. These numbers seem to indicate the Japanese companies must have been much better at preventing change after the die specs were released to the tool and die shop. But such was not the case.

The US strategy for making a die was to wait until the design specs were frozen, and then send the final design to the tool and die maker, which triggered the process of ordering the block of steel and cutting it. Any changes went through a arduous change approval process. It took about two years from ordering the steel to the time that die would be used in production. In Japan, however, the tool and die makers order up the steel blocks and start rough cutting at the same time the car design is starting. This is called concurrent development. How can it possibly work?

The die engineers in Japan are expected to know a lot about what a die for a front door panel will involve, and they are in constant communication with the body engineer. They anticipate the final solution and they are also skilled in techniques to make minor changes late in development, such as leaving more material where changes are likely. Most of the time die engineers are able to accommodate the engineering design as it evolves. In the rare case of a mistake, a new die can be cut much faster because the whole process is streamlined.

Japanese automakers do not freeze design points until late in the development process, allowing most changes occur while the window for change is still open. When compared to the early design freeze practices in the US in the 1980’s, Japanese die makers spent perhaps a third as much money on changes, and produced better die designs. Japanese dies tended to require fewer stamping cycles per part, creating significant production savings.

The significant difference in time-to-market and increasing market success of Japanese automakers prompted US automotive companies to adopt concurrent development practices in the 1990’s, and today the product development performance gap has narrowed significantly.

Concurrent Software Development
Programming is a lot like die cutting. The stakes are often high and mistakes can be costly, so sequential development, that is, establishing requirements before development begins, is commonly thought of as a way to protect against serious errors. The problem with sequential development is that it forces designers to take a depth-first rather than a breadth-first approach to design. Depth-first forces making low level dependant decisions before experiencing the consequences of the high level decisions. The most costly mistakes are made by forgetting to consider something important at the beginning. The easiest way to make such a big mistake is to drill down to detail too fast. Once you set down the detailed path, you can’t back up, and aren’t likely to realize that you should. When big mistakes can be made, it is best to survey the landscape and delay the detailed decisions.

Concurrent development of software usually takes the form of iterative development. It is the preferred approach when the stakes are high and the understanding of the problem is evolving. Concurrent development allows you to take a breadth-first approach and discover those big, costly problems before it’s too late. Moving from sequential development to concurrent development means starting programming the highest value features as soon as a high level conceptual design is determined, even while detailed requirements are being investigated. This may sound counterintuitive, but think of it as an exploratory approach which permits you to learn by trying a variety of options before you lock in on a direction that constrains implementation of less important features.

In addition to providing insurance against costly mistakes, concurrent development is the best way to deal with changing requirements, because not only are the big decisions deferred while you consider all the options, but the little decisions are deferred as well. When change is inevitable, concurrent development reduces delivery time and overall cost, while improving the performance of the final product.

If this sounds like magic – or hacking – it would be if nothing else changed. Just starting programming earlier, without the associated expertise and collaboration found in Japanese die cutting, is unlikely to lead to improved results. There are some critical skills that must be in place in order for concurrent development to work.

Under sequential development, US automakers considered die engineers to be quite remote from the automotive engineers, and so too, programmers in a sequential development process often have little contact with the customers and users who have requirements and the analysts who collect requirements. Concurrent development in die cutting required US automakers to make two critical changes – the die engineer needed the expertise to anticipate what the emerging design would need in the cut steel, and had to collaborate closely with the body engineer.

Similarly, concurrent software development requires developers with enough expertise in the domain to anticipate where the emerging design is likely to lead, and close collaboration with the customers and analysts who are designing how the system will solve the business problem at hand.

The Last Responsible Moment
Concurrent software development means starting developing when only partial requirements are known and developing in short iterations which provide the feedback that causes the system to emerge. Concurrent development makes it possible to delay commitment until the Last Responsible Moment, that is, the moment at which failing to make a decision eliminates an important alternative. If commitments are delayed beyond the Last Responsible Moment, then decisions are made by default, which is generally not a good approach to making decisions.

Procrastinating is not the same as making decisions at the Last Responsible Moment; in fact, delaying decisions is hard work. Here are some tactics for making decisions at the Last Responsible Moment:

Share partially complete design information. The notion that a design must be complete before it is released is the biggest enemy of concurrent development. Requiring complete information before releasing a design increases the length of the feedback loop in the design process and causes irreversible decisions to be made far sooner than necessary. Good design is a discovery process, done through short, repeated exploratory cycles.
Organize for direct, worker-to-worker collaboration. Early release of incomplete information means that the design will be refined as development proceeds. This requires that upstream people who understand the details of what the system must do to provide value must communicate directly with downstream people who understand the details of how the code works.
Develop a sense of how to absorb changes. In ‘Delaying Commitment,’ IEEE Software (1988), Harold Thimbleby observes that the difference between amateurs and experts is that experts know how to delay commitments and how to conceal their errors for as long as possible. Experts repair their errors before they cause problems. Amateurs try to get everything right the first time and so overload their problem solving capacity that they end up committing early to wrong decisions. Thimbleby recommends some tactics for delaying commitment in software development, which could be summarized as an endorsement of object-oriented design and component-based development:

Use Modules. Information hiding, or more generally behavior hiding, is the foundation of object-oriented approaches. Delay commitment to the internal design of the module until the requirements of the clients on the interfaces stabilize.

Use Interfaces. Separate interfaces from implementations. Clients should not de-pend on implementation decisions.

Use Parameters.Make magic numbers – constants that have meaning – into parameters. Make magic capabilities like databases and third party middleware into parameters. By passing capabilities into modules wrapped in simple interfaces, your dependence on specific implementations is eliminated and testing becomes much easier.

Use Abstractions. Abstraction and commitment are inverse processes. Defer commitment to specific representations as long as the abstract will serve immediate design needs.

Avoid Sequential Programming. Use declarative programming rather than procedural programming, trading off performance for flexibility. Define algorithms in a way that does not depend on a particular order of execution.

Beware of custom tool building. Investment in frameworks and other tooling frequently requires committing too early to implementation details that end up adding needless complexity and seldom pay back. Frameworks should be extracted from a collection of successful implementations, not built on speculation.

Additional tactics for delaying commitment include:

Avoid Repetition. This is variously known as the Don’t Repeat Yourself (DRY) or Once And Only Once (OAOO) principle. If every capability is expressed in only one place in the code, there will be only one place to change when that capability needs to evolve and there will be no inconsistencies.

Separate Concerns. Each module should have a single well defined responsibility. This means that a class will have only one reason to change.

Encapsulate Variation. What is likely to change should be inside, the interfaces should be stable. Changes should not cascade to other modules. This strategy, of course, depends on a deep understanding of the domain to know which aspects will be stable and which variable. By application of appropriate patterns, it should be possible to extend the encapsulated behavior without modifying the code itself.

Defer Implementation of Future Capabilities. Implement only the simplest code that will satisfy immediate needs rather than putting in capabilities you ‘know’ you will need in the future. You will know better in the future what you really need then and simple code will be easier to extend then if necessary.

Avoid extra features. If you defer adding features you ‘know’ you will need, then you certainly want to avoid adding extra features ‘just-in-case’ they are needed. Extra features add an extra burden of code to be tested and maintained, and understood by programmers and users alike. Extra features add complexity, not flexibility.

Much has been written on these delaying tactics, so they will not be covered in detail in this book.

Develop a sense of what is critically important in the domain. Forgetting some critical feature of the system until too late is the fear which drives sequential development. If security, or response time, or fail safe operation are critically important in the domain, these issues need to be considered from the start; if they are ignored until too late, it will indeed be costly. However, the assumption that sequential development is the best way to discover these critical features is flawed. In practice, early commitments are more likely to overlook such critical elements than late commitments, because early commitments rapidly narrow the field of view.
Develop a sense of when decisions must be made. You do not want to make decisions by default, or you have not delayed them. Certain architectural concepts such as usability design, layering and component packaging are best made early, so as to facilitate emergence in the rest of the design. A bias toward late commitment must not degenerate into a bias toward no commitment. You need to develop a keen sense of timing and a mechanism to cause decisions to be made when their time has come.
Develop a quick response capability. The slower you respond, the earlier you have to make decisions. Dell, for instance, can assemble computers in less than a week, so they can decide what to make less than a week before shipping. Most other computer manufacturers take a lot longer to assemble computers, so they have to decide what to make much sooner. If you can change your software quickly, you can wait to make a change until customers’ know what they want.

Cost Escalation
Software is different from most products in that software systems are expected to be upgraded on a regular basis. On the average, more than half of the development work that occurs on a software system occurs after it is first sold or placed into production. In addition to internal changes, software systems are subject to a changing environment – a new operating system, a change in the underlying database, a change in the client used by the GUI, a new application using the same database, etc. Most software is expected to change regularly over its lifetime, and in fact once upgrades are stopped, software is often nearing the end of its useful life. This presents us with a new category of waste, that is, waste caused by software that is difficult to change.

In 1987 Barry Boehm wrote, “Finding and fixing a software problem after delivery costs 100 times more than finding and fixing the problem in early design phases”. This observation became been the rational behind thorough up front requirements analysis and design, even though Boehm himself encouraged incremental development over “single-shot, full product development.” In 2001, Boehm noted that for small systems the escalation factor can be more like 5:1 than 100:1; and even on large systems, good architectural practices can significantly reduce the cost of change by confining features that are likely to change to small, well-encapsulated areas.

There used to be a similar, but more dramatic, cost escalation factor for product development. It was once estimated that a change after production began could cost 1000 times more than if the change had been made in the original design. The belief that the cost of change escalates as development proceeds contributed greatly to the standardizing the sequential development process in the US. No one seemed to recognize that the sequential process could actually be the cause of the high escalation ratio. However, as concurrent development replaced sequential development in the US in the 1990’s, the cost escalation discussion was forever altered. The discussion was no longer how much a change might cost later in development; the discussion centered on how to reduce the need for change through concurrent engineering.

Not all change is equal. There are a few basic architectural decisions that you need to get right at the beginning of development, because they fix the constraints of the system for its life. Examples of these may be choice of language, architectural layering decisions, or the choice to interact with an existing database also used by other applications. These kinds of decisions might have the 100:1 cost escalation ratio. Because these decisions are so crucial, you should focus on minimizing the number of these high stakes constraints. You also want to take a breadth-first approach to these high stakes decisions.

The bulk of the change in a system does not have to have a high cost escalation factor; it is the sequential approach that causes the cost of most changes to escalate exponentially as you move through development. Sequential development emphasizes getting all the decisions made as early as possible, so the cost of all changes is the same – very high. Concurrent design defers decisions as late as possible. This has four effects:

Reduces the number of high-stake constraints.
Gives a breadth-first approach to high-stakes decisions, making it more likely that they will be made correctly.
Defers the bulk of the decisions, significantly reducing the need for change.
Dramatically decreases the cost escalation factor for most changes.

A single cost escalation factor or curve is misleading. Instead of a chart showing a single trend for all changes, a more appropriate graph has at least two cost escalation curves, as show in Figure 3-1. The agile development objective is to move as many changes as possible from the top curve to the bottom curve.

Figure 3-1. Two Cost Escalation Curves

Returning for a moment to the Toyota die cutting example, the die engineer sees the conceptual design of the car and knows roughly the size of door panel is necessary. With that information, a big enough steel block can be ordered. If the concept of the car changes from a small, sporty car to a mid-size family car, the block of steel may be too small, and that would be a costly mistake. But the die engineer knows that once the overall concept is approved, it won’t change, so the steel can be safely ordered, long before the details of the door emerge. Concurrent design is a robust design process because the die adapts to whatever design emerges.

Lean software development delays freezing all design decisions as long as possible, because it is easier to change a decision that hasn’t been made. Lean software development emphasizes developing a robust, change-tolerant design, one that accepts the inevitability of change and structures the system so that it can be readily adapted to the most likely kinds of changes.

The main reason why software changes throughout its lifecycle is that the business process in which it is used evolves over time. Some domains evolve faster than others, and some domains may be essentially stable. It is not possible to build in flexibility to accommodate arbitrary changes cheaply. The idea is to build tolerance for change into the system along domain dimensions that are likely to change. Observing where the changes occur during iterative development gives a good indication of where the system is likely to need flexibility in the future. If changes of certain types are frequent during development, you can expect that these types of changes will not end when the product is released. The secret is to know enough about the domain to maintain flexibility, yet avoid making things any more complex than they must be.

If a system is developed by allowing the design to emerge through iterations, the design will be robust, adapting more readily to types of changes that occur during development. More importantly, the ability to adapt will be built-in to the system, so that as more changes occur after its release, they can be readily incorporated. On the other hand, if systems are built with a focus on getting everything right at the beginning in order to reduce the cost of later changes, their design is likely to be brittle and not accept changes readily. Worse, the chance of making a major mistake in the key structural decisions is increased with a depth-first, rather than a breadth-first approach.

Book Excerpt: Lean Software Development: An Agile Toolkit

Published in Software Development Magazine, August 2003

March 18, 2002

Lean Design

For over a decade, a manufacturing metaphor has been used to bring about improvements in software development practices. But even the originators of the metaphor recognize that it’s time for a change. From SEI’s COTS-Based Systems (CBS) Initiative we hear:[1]

“Indeed, to many people software engineering and software process are one and the same thing. An entire industry has emerged to support the adoption of CMM or ISO-9000 models, and process improvement incentives have played a dominant role in defining roles and behavior within software development organizations. The resulting roles and behaviors constitute what we refer to as the process regime.

“The process regime was born of the software crisis at a time when even large software systems were built one line of code at a time. With some logic it established roles and behaviors rooted in a manufacturing metaphor, where software processes are analogous to manufacturing processes, programmers are analogous to assembly-line workers, and the ultimate product is lines of code. When viewed in terms of software manufacturing, improvements in software engineering practice are equated with process improvement, which itself is centered on improving programmer productivity and reducing product defects. Indeed, the manufacturing metaphor is so strong that the term software factory is still used to denote the ideal software development organization.

“The process regime might have proven adequate to meet the software crisis, or at least mitigate its worst effects, but for one thing: the unexpected emergence of the microprocessor and its first (but not last!) offspring, the personal computer (PC). The PC generated overwhelming new demand for software far beyond the capacity of the conventional software factory to produce.

“The response to the growing gap between supply and demand spawned an impressive range of research efforts to find a technological “silver bullet.” The US government funded several large-scale software research efforts totaling hundreds of millions of dollars with the objective of building software systems “better, faster and cheaper.” While the focused genius of software researchers chipped away at the productivity gap, the chaotic genius of the free market found its own way to meet this demand—through commercial software components.

“The evidence of a burgeoning market in software components is irrefutable and overwhelming. Today it is inconceivable to contemplate building enterprise systems without a substantial amount of the functionality of the system provided by commercial software components such as operating systems, databases, message brokers, Web browsers and servers, spreadsheets, decision aids, transaction monitors, report writers, and system managers.

“As many organizations are discovering, the traditional software factory is ill equipped to build systems that are dominated by commercial software components. The stock and trade of the software factory—control over production variables to achieve predictability and then gradual improvement in quality and productivity—is no longer possible. The software engineer who deals with component-based systems no longer has complete control over how a system is partitioned, the interfaces are between these partitions, or how threads of control are passed or shared among these partitions. Traditional software development processes espoused by the process regime and software factory that assume control over these variables are no longer valid. The process regime has been overthrown, but by what?

“Control has passed from the process regime to the market regime. The market regime consists of component producers and consumers, each behaving, in the aggregate, according to the laws of the marketplace.

“The organizations that have the most difficulty adapting to the component revolution are those that have failed to recognize the shift from the process to the market regime and the loss of control that is attendant in this shift. Or, having recognized the shift, they are at a loss for how to accommodate it.

So it’s official, the manufacturing metaphor for software development improvement is needs to be replaced, but with what? Let’s look to Lean Thinking for a suggestion.

How Programmers Work
A fundamental principle of Lean Thinking is that the starting point for improvement is to understand, in detail, how people actually do their work. If we look closely at how software developers spend their time, we see that they do these things in sequence: {analyze–code–build–test}. First they figure out how they are going to address a particular problem, then they write code, then they do a build and run the code to see if it indeed solves the problem, and finally, they repeat the cycle. Many times. This is how programmers work.

An interesting thing about software development is that this cycle: {analyze–code–build–test}, occurs both in the large and in the small. Every large section of software will pass through this cycle (many times), but so will every small section of code. A developer may go through these steps several times a day, or even, many times per hour. Generally there is no particular effort, nor any good reason, to get the code exactly right the first time. Try it, test it, fix it is a far more efficient approach to programming than perfection in the first draft. Just as writers go through several drafts to create a finished piece of work, so do software developers.

The Wrong Metaphor
Because the software development cycle occurs both in the large and in the small, there have been attempts to divide the software development cycle and give each piece of the cycle to a different person. So for instance, someone does the analysis, another person does the design, someone else writes code, a clerk does an occasional build, and QC people run tests. This ‘assembly line’ approach to software development comes from the manufacturing metaphor, and basically, it just doesn’t work.

The reason a manufacturing metaphor does not work for development is because development is not sequential, it is a cycle of discovery. The {analyze–code–build–test} cycle is meant to be repeated, not to happen only once. Further, as ideas and information move through this cycle, two things must be assured. First, information must not be lost through handoffs, and second, feedback from the cycle must be as short as possible.

The manufacturing metaphor violates both of these requirements. First of all, handing off information in a written format will convey at best half of the information known to those who write the documents. The tacit knowledge buried in the minds of the writers simply does not make it into written reports. To make matters worse, writing down information to pass along to the next step in the cycle introduces enormous waste and dramatically delays feedback from one cycle to the next.

This second point is critically important. The cycle time for feedback from the test phase of the cycle should be in minutes or hours; a day or two at the outside. Dividing the development cycle among different functions with written communication between them stretches the cycle out to the point of making feedback difficult, if not impossible.

Some may argue with the premise that software development is best done by using a discovery cycle. They feel that developers should be able to write code and ‘Get it right the First Time”. This might make sense in manufacturing, where people make the same thing repeatedly. Software development, however, is a creative activity. You would never want a developer to be writing the same code over and over again. That’s what computers are for.

The Difference Between Designing and Making
Glenn Ballard of the Lean Construction Institute (LCI) sheds some light on this topic in a paper called “Positive vs Negative Iteration in Design”. He draws a clear distinction between the two activities of designing and making. He points out, “This is the ancient distinction between thinking and acting, planning and doing. One operates in the world of thought; the other in the material world.” Ballard summarizes the difference between designing and making in this manner:

The important thing to notice is that the goals of ‘designing’ and ‘making’ are quite different. Designing an artifact involves understanding and interpreting the purpose of the artifact. Making an artifact involves conforming to the requirements expressed in the design, on the assumption that the design accurately realizes the purpose.

A striking difference between designing and making is the fact that variability of outcomes is desirable during design, but not while making. In fact, design is a process of finding and evaluating multiple solutions to a problem, and if there were no variability, the design process would not be adding much value. As a corollary, Ballard suggests that iteration creates value in design, while it creates waste (rework) in making. To put it another way, the slogan “Do it Right the First Time” applies to making something after the design is complete, but it should not be applied to the design process.

In the {analyze–code–build–test} cycle, notice that both analyzing and coding are design work. There are many ways to create a line of code; individual developers are making decisions every minute they are writing code. There is no recipe to tell them exactly how to do things. They are writing the recipe for the computer to follow. It is not until we get to the ‘build’ stage of the cycle that we find ‘making’ activity. And indeed, all of the rules of ‘making’ apply to a software build: No one should break the build, and every build with the same inputs, had better get the same outputs.

Cycles
This brings us to the last step of the software development cycle: test. Is testing ‘designing’ or ‘making’ or yet a third element? In fact, designing tests is a creative activity, often part of the design. Further, the results of tests are continually fed back into the design to improve it. So in a very real sense, the test step is ‘designing’, not ‘making’. Further, the ‘test’ step is what causes the cycle to loop back and repeat, it is what makes development work into a cycle in the first place. In making, it is not desirable to test and rework; however, in development, repeating the cycle is the essence of doing work. Development is basically an experimental activity.

There are other well-known cycles that bear mentioning here, and all of them end with a step which causes the cycle to repeat. Some examples are:

The Scientific Method: {Observe – Create a Theory – Predict from the Theory – Test the Predictions} Graduate students know this well.
The Development Approach: {Discover – Assemble – Assess} This bears a striking (and not accidental) resemblance to the software development cycle.
The Demming Cycle: {Plan – Do – Check} – Act. This is a three-step cycle {Plan – Do – Check}, followed by – Act once the cycle yields results. A more complete definition of the Demming Cycle is: {Identify Root Causes of Problems – Develop and Try a Solution – Measure the Results} Repeat Until a Solution is Proven, then – Standardize the Solution. Demming taught that all manufacturing processes should be continually improved using this cycle.

In Search of Another Metaphor
If software developers spend their days in a continual {cycle of design–code–build–test}, we might gain insight if we find other workers who use a similar cycle. In this quest we might eliminate workers in manufacturing, who are not involved in designing the product they produce. On the other hand, in Lean Manufacturing, workers are continually involved in redesigning their work processes. Despite this, it seems that software developers more closely resemble product designers than product makers, because a large portion of software development time involves designing the final product, both in the large and in the small. But unlike product developers, software developers not only design, but also produce and test their product.

We might compare software developers to the skilled workers in construction, who often do a lot of on-site design before they actually produce work. An electrician, for instance, must understand the use of the room to locate outlets, and must take framing, HVAC and plumbing into account when routing wires.   Software developers might also be thought of as artists and craftsmen, who routinely extend the design process right into the making process.

Learning Lessons from Metaphors
New Product Development, Skilled Construction Workers, Artists and Craftsmen – as we attempt to learn from these metaphors we must also take care not to go too far, as happened with the manufacturing metaphor. The careful use of a metaphor involves abstracting to a common base between disciplines, and then applying the abstraction to the new discipline (software development) in a manner appropriate to the way work actually occurs in that discipline.

Three useful abstractions come immediately to mind as we apply design and development metaphors to software development:

Abstraction 1: Emphasize ‘Designing’ Values, not ‘Making’ Values
Code should not be expected to “Conform to Requirements” or be “Right the First Time”. These are ‘making’ values. Instead, software should be expected to be “Fit for Use” and “Realize the Purpose” of those who will be using it.   Disparaging software changes as ‘rework’ exemplifies the misuse of a ‘making’ value. Since software development is mostly about designing, not making, the correct value for software development is precisely the opposite. Iterations are good, not bad. They lead to a better design.

Ballard states that: “Designing can be likened to a good conversation, from which everyone leaves with a better understanding than anyone brought with them… Design development makes successively better approaches on the whole design, likegrinding a gem, until it gets to the desired point….”

Abstraction 2: Compress the {Design–Code–Build–Test} Cycle Time
Once we recognize that the {design–code–build–test cycle} is the fundamental element of work in software development, then principles of lean thinking suggest that compressing this cycle will generate the best results. Compressing cycle time makes feedback immediate, and thus allows for a system to rapidly respond to both defects and change.

Based on this hypothesis, we may predict that the effectiveness of Extreme Programming comes from its dramatic compression of the {design–code–build–test cycle}. Pair programming works to shorten design time because design reviews occur continuously, just as design occurs continuously. Writing test code before production code radically reduces the time from coding to testing, since tests are run immediately after code is written.   The short feedback loop of the {design–code–build–test cycle} in all agile practices is a key reason why they produce working code very quickly.

Abstraction 3: Use Lean Design Practices to Reduce Waste
Not all design iteration is good; iterations must add value and lead to convergence. Many times a design will pass from one function to another, each adding comments and changes, causing more comments and changes, causing another round of comments and changes, in a never-ending cycle. This kind of iteration does not produce value, and is thus ‘negative iteration’ or ‘waste’.

Ballard suggests the following ‘Lean Design’ techniques to reduce negative iteration, or in other words, obtain design convergence:

Design Structure Matrix. Steven Eppinger’s article “Innovation at the Speed of Information” in the January 2001 issue of Harvard Business Review suggests that design management should focus on information flows, not task completions, to achieve the most effective results. The Design Structure Matrix is a tool that answers the question: “What information do I need from other tasks before I can complete this one?”
Cross Functional Teams. Cross-functional teams which collaborate and solve problems are today’s standard approach for rapid and robust design with all interested parties contributing to decisions. One thing to remember is to ‘let the team manage the team’.
Concurrent Design / Shared Incomplete Information. Sequential processing results in part from the assumption that only complete information should be shared. Sharing incomplete information allows concurrent design to take place. This both shortens the feedback loop and allows others to start earlier on their tasks.
Reduced Batch Sizes. Releasing small batches of work allows downstream work to begin earlier batches and provides for more level staffing. It also is the best mechanism for finding and fixing problems early, while they are small, rather than after they have multiplied across a large batch.
Pull Scheduling. Ballard notes: “The Lean Construction Institute recommends producing such a work sequence by having the team responsible for the work being planned to work backwards from a desired goal; i.e., by creating a 'pull schedule'. Doing so avoids incorporation of customary but unnecessary work, and yields tasks defined in terms of what releases work and thus contributes to project completion.”
Design Redundancy. When it is necessary to make a design decision in order to proceed, but the task sequencing cannot be structured to avoid future changes, then the best strategy may be to choose a design to handle a range of options, rather than wait for precise quantification. For example, when I was a young process control engineer, I used to specify all process control computers with maximum memory and disk space, on the theory that you could never have enough. In construction, when structural loads are not known precisely, the most flexible approach is often to design for maximum load.
Deferred Commitment / Least Commitment. Ballard writes: “Deferred commitment is a strategy for avoiding premature decisions and for generating greater value in design. It can reduce negative iteration by simply not initiating the iterative loop. A related but more extreme strategy is that of least commitment; i.e., to systematically defer decisions until the last responsible moment; i.e., until the point at which failing to make the decision eliminates an alternative. Knowledge of the lead times required for realizing design alternatives is necessary in order to determine last responsible moment.
Shared Range of Acceptable Solutions / Set-Based Design. The most rapid approach to arriving at a solution to a design problem is for all parties to share the range of acceptable solutions and look for an overlap. This is also called set-based design, and is widely credited for speeding up development at Toyota, decreasing the need for communication, and increasing the quality of the final products.

These eight Lean Construction techniques, particularly set-based design, is being tested in construction and expected to result in dramatic improvements in design time (~50%) and construction time (~30%). In addition, work can be leveled throughout the project, better, more objective decisions are expected.

The following two additional Lean Design techniques are particularly applicable to software development:

Frequent Synchronization. It is widely recognized in software development that daily (or more frequent) builds with automated testing is the best way to build a robust system rapidly.
The Simplest ‘Spanning Application’ Possible. This is a software development technique particularly good for testing component ensembles and legacy system upgrades. The idea is not to implement module-by-module, but implement a single thread across the entire system, so as to test the interactions of all parts of a system along a narrow path.

____________________

[1] From draft version of Chapter 1 of Building Systems from Commercial Components, [Addison-Wesley, 2001] by Kurt Wallnau, Scott Hissam, and Robert Seacord; downloaded from SEI COTS-Based Initiative website.

Screen Beans Art, © A Bit Better Corporation