Lean Essays: Friction

One third of the fuel that goes into a car is spent overcoming friction. By comparison, an electric car loses half as much energy - one sixth - to friction. Who knew electric cars had such an advantage?

Friction is the force that resists motion when the surface of one object comes into contact with the surface of another. You can imagine parts moving against one another in cars, but do you ever wonder what happens when your products and services come in contact with customers? Might this create some friction? Could there be competing offerings that create considerably less friction? If so, you can be sure your customers will find the low friction offering more attractive than yours.

Friction in the Customer Journey

Think of friction as the cognitive overhead that a system places on those who use it. Let’s consider the friction involved in taking a taxi from an airport to a hotel. When I arrive at most airports, I get in a taxi queue, heeding the conspicuous warnings not to ride with unauthorized drivers. When I reach the front of the queue, I take the next taxi in line, and I assume that the cost is the same no matter which taxi I take. But this is not true in Stockholm, where taxis can charge any rate they wish simply by posting it in the window. Nor is it true in many other locations, so I have learned to research the taxi systems in every city I visit. That’s cognitive load. I also bring enough local currency to pay for a taxi ride to my hotel and check on whether a tip is expected. More cognitive load.

Uber set out to remove the friction from taking a taxi by reimagining the entire experience, from hailing to routing to paying; from drivers and cars to insurance and regulations. By removing as many points of friction as possible for riders, Uber has become wildly popular in a very short time. In January 2015, four years after launch, Uber reported that its revenue its home city of San Francisco had grown to three times the size of the entire taxi market in that city. Uber has recently opened a robotics center in Pittsburgh and joins Google in working to create a practical driverless car. Its intent is to bring the cost and convenience of ride services to a point where owning a car becomes the expensive option.

Full Stack Startups

Uber is among the largest of a new crop of startups – investor Chris Dixon calls them full stack startups – that bypass incumbents and reinvent the entire customer experience from start to finish with the aim of making it as frictionless as possible. Full stack startups focus on creating a world that works the way it should work, given today’s technology, rather than optimizing the way it does work, given yesterday’s mental models. Because these companies are creating a new end-to-end experience, they rarely leverage existing capabilities aimed at serving their market; they develop a “full stack” of new capabilities.

"The challenge with the full stack approach is you need to get good at many different things: software, hardware, design, consumer marketing, supply chain management, sales, partnerships, regulation, etc. The good news is that if you can pull this off, it is very hard for competitors to replicate so many interlocking pieces." Chris Dixon

Large companies have the same full stack of capabilities as startups, but these capabilities lie in different departments and there is friction at every department boundary. Moreover, incumbents are deeply invested in the way things work today, so large incumbent companies are usually incapable of truly reinventing a customer journey. As hard as they try to be innovative, incumbents tend to be blind to the friction embedded in the customer journey that they provide today.

Consider banks. They have huge, complex back end systems-of-record that are expensive to maintain and keep secure. But customers expect mobile access to their bank accounts, so banks have added “front end teams” to build portals (mobile apps) to access the back end systems. Typically banks end up with what Gartner calls “Bimodal IT.” One part of IT handles the backend systems using traditional processes, while a different group uses different processes to deliver web and mobile apps. As a result, the front end teams are not able to reimagine the customer journey; they are locked into the practices and revenue models embedded in the back end systems. So in the end, banks have done little to change the customer journey, the fee structure, or anything else fundamental to banking.

For example, in the US it is nearly impossible for me to transfer money to my granddaughter’s bank account without physically mailing a check or paying an exorbitant wire transfer fee. Not only that, but I cannot use my chip-and-pin card at many places in Europe because US banks don’t let me enter a pin with my card (they still depend on signatures!), while unstaffed European kiosks always require a pin. Anyone who banks in Europe would find US banking practices archaic. I find that they generate a lot of friction and I expect that they cost a lot of money.

Creative Friction

Why do banks adopt Bimodal IT? According to Gartner, “There is an inherent tension between doing IT right and doing IT fast.” I respectfully disagree; there is nothing inherently wrong about being fast. In fact, when software development is done right, speed, quality and low cost are fully compatible. Hundreds of enterprises, including Amazon and Google (whose systems manage billions of dollars of revenue every month), have demonstrated that the safest approach to software development is automated, it is adaptive, and it is fast.

It is true that there is tension between different disciplines: front end and back end; dev and ops; product and technology. But the best way to leverage these tensions is not to separate the parties, but to put them together on the same team with a common goal. You will never have a great product, or a great process, without making tradeoffs – that is the nature of difficult engineering problems. If your teams lack multiple perspectives on a problem, they will be unable to make consistently good tradeoff decisions, and their results will be mediocre.

Friction in the Code

The Prussian general and military theorist Carl von Clausewitz (1780-1831) thought of friction as the thing which tempers the good intentions of generals with the reality of the battlefield. He was thinking of the friction caused by boggy terrain that horses cannot cross, soldiers exhausted by heat and heavy burdens, fog that obscures enemy positions, supplies that don’t keep pace with military movements. He noted that battalions are made up of many individuals moving at a different rates with different amounts of confusion and fear, each one affecting the others around him in unpredictable ways. It is impossible for the thousands of individual agents on the battlefield to behave exactly according to a theoretical plan, Clausewitz wrote. Unless generals have actually experienced war, he said, they will not be able to account for the accumulated friction created by all of these agents interacting with each other and their environment.

Anyone who has ever looked closely at a large code base would be forgiven for thinking that Clausewitz was writing about software systems. Over time, any code base acquires lots of moving parts and increasing amounts of friction develops between these parts, until eventually the situation becomes hopeless and the system is either replaced or abandoned. Unless, of course, the messy parts are systematically cleaned up and friction is kept in check. But who is allowed to take time for this sort of refactoring if the decision-makers have never written any code, never been surprised by hidden dependencies, never been bitten by the unintended consequences of seemingly innocuous changes?

Failure

Not long ago the New York Stock Exchange was shut down for half a day due to “computer problems.” It’s not uncommon for an airline reservation systems suffer from “computer problems” so severe that planes are grounded. But we don’t expect to hear about “computer problems” at Twitter or Dropbox or Netflix or similar systems – maybe they had problems a few years ago, but they seem to be reasonably reliable these days. The truth is, cloud-based systems fail all the time, because they are built on unreliable hardware running over unreliable communication links. So they are designed to fail, to detect failure, and to recover quickly, without interrupting or corrupting the services they provide. They appear to be reliable because their robust failure detection and recovery mechanisms isolate users from the unreliable infrastructure.

The first hint of this approach was Google’s early strategy for building a server farm. They used cheap off-the-shelf components that would fail at a known rate, and then they automated failure detection and recovery. They replicated server contents so nothing was lost during a failure, and they automated the monitoring, detection, and recovery process. Amazon built its cloud with the same philosophy – they knew that at the scale they intended to pursue, everything would fail sooner rather than later, so automated failure detection and recovery had to be designed into the system.

Designing failure recovery into a system requires a special kind of software architecture and approach to development. To compensate for unreliable communication channels, messaging is usually asynchronous and on a best-efforts basis. Because servers are expected to fail, interfaces are idempotent so you get the same results on a retry as you get the first time. Since distributed data may not always match, software is written to deal with the ambiguities and produce eventual consistency.

Fault tolerance is not a new concept. Back in the days before solid state components, computer hardware was expected to fail, so vast amounts of time and energy were dedicated to failure detection and recovery. My first job was programming the Number 2 ESS (Electronic Switching System) being built in Naperville, IL by Bell Labs about the time I got out of college. This system, built out of discrete components prior to the days of integrated circuits, had a design goal of a maximum downtime of two hours in forty years. The hardware was completely duplicated and easily half of the software was dedicated to detecting faults, switching out the bad hardware, and identifying the defective discrete component so it could be replaced. This allowed a system built on unreliable electronic components to match the reliability of the electro-mechanical switching systems that were commonly in use at the time.

Situational Awareness

Successful cloud-based systems have a LOT of moving parts – that pretty much comes as a byproduct of success. With all of these parts moving around, designing for failure hardly seems like an adequate explanation for the robustness of these systems. And it isn’t. At the heart of a reliable cloud-based system are small teams (you might call them “full stack” teams) of people who are fully responsible for their piece of the system. They pay attention to how their service is performing, they fix it when it fails, and they continuously improve it to better serve its consumers.

Full stack teams that maintain end-to-end responsibility for a software service do not fit the model we used to have of the “right” way to develop software. These are not project teams that write code according to spec, turn it over to testing, and disband once it’s tossed over the wall to operations. They are engineering teams that solve problems and make frequent changes to the code they are responsible for. Code bases created and maintained by full stack teams are much more resilient than the large and calcified code bases created by the project model precisely because people pay attention to (and change!) the internal workings of “their” code on an on-going basis.

Limited Surface Area

Clearly, many small teams making independent changes to a large code base can generate a lot of Clausewitzian friction. But since friction occurs when the surfaces of two objects come in contact with each other, strictly limiting the surface area of the code exposed by each team can dramatically reduce friction. In cloud-based systems, services are designed to be as self-contained as possible and interactions with other services are strictly limited to hardened interfaces. Teams are expected to limit changes to the surface area (interfaces) of their code and proactively test any changes that might make it through that surface to other services.

Modern software development includes automated testing strategies and automated deployment pipelines that take the friction out of the deployment process, making it practical and safe to independently deploy small services. Containers are used to standardize the surface area that services expose to their environment, reducing the friction that comes from unpredictable surroundings. Finally, when small changes are made to a live system, the impact of each change is monitored and measured. Changes are typically deployed to a small percentage of users (limiting the deployment surface area), and if any problems are detected small changes can be rolled back quickly. We know that the best way to change a complex system is to probe and adapt, and we know that software systems are inherently complex. This explains why the small rapid deployments common in cloud-based systems turn out to be much safer and more robust than the large releases that we used to think were the “right” way to deliver software.

Shared Learning

Do you ever wonder how the sophisticated testing and deployment tools used at companies like Netflix actually work? Would you like to know how Netflix stores and analyzes data or how it monitors the performance of its platform? Just head over to the Netflix Open Source Center on GitHub; it’s all there for you to see – and use if you’d like. Want to analyze a lot of data? You will undoubtedly consider Hadoop, originally developed at Yahoo! based on Google research papers, open sourced through Apache, and now at the core of many open source tools that abstract its interface and extend its capability.

The world-wide software engineering community has developed a culture of sharing intellectual property, in stark contrast to the more common practice of keeping innovative ideas and novel tools proprietary. The rapid growth of large, reliable, secure software systems can be directly linked to the fact that software engineers routinely contribute to and build upon the work of their world-wide colleagues. Because of this, methods and tools for building highly reliable complex software systems have advanced extraordinarily quickly and are widely available.

Friction in the Process

Between 2004 and 2010, the FBI tried twice to develop an electronic case management system, and it failed both times, squandering hundreds of millions of dollars. UK’s National Health system lost similar amounts of money on a patient booking system that was eventually abandoned, and multiple billions of pounds on a patient record system that never worked. In 2012 Sweden decided to scrap and rewrite PUST, a police automation system that actually worked quite well, but not well enough for those who chose to have it rewritten the “right” way. The rewrite never worked and was eventually abandoned, an expensive fiasco that left the police without any system at all.

I could go on and on – just about every country has its story about an expensive government-funded computer system that cost extraordinary amounts of money and never actually worked. The reason? Broadly speaking, these fiascoes are caused by the process most governments use to procure software systems – a high friction process with a very high rate of failure.

One country that does not have an IT fiasco story is Estonia, probably the most automated country in the world. A few years ago British MP Francis Maude visited Estonia to find out how they managed to implement such sophisticated automation on a small budget. He discovered that Estonia automated its government because it had such a small budget, and properly automated government services are much less expensive than their manual counterparts.

Estonia’s process is simple: small internal teams work directly with consumers, understand their journey, and remove the friction. Working software is delivered in small increments to a small number of consumers, adjustments are made to make it work better, and once things work well the new capability is rolled out more broadly. Then another capability is added in the same manner, and thus the system grows steadily in small steps over time. (Incidentally, when this process is used, it is almost impossible to spend a lot of money only to find out the system doesn’t work.)

The UK government formed a consortium with Estonia and three other countries (called the Digital 5) to “provide a focused forum to share best practice [and] identify how to improve the participants’ digital services.” Maude started up the UK’s Government Digital Services, where small internal teams focus on making the process of obtaining government information and services as frictionless as possible. If you want to see how the UK Government Digital Services actually works, check out its Design Principles which summarize a new mental model for creating digital services, and the Governance approach, which outlines an effective, low friction software development process.

The HealthCare.gov fiasco in the US in 2013 led to the creation of US Digital Services, which is working in partnership with UK Digital Services to rethink government software development and delivery strategies. The US Digital Services Playbook is a great place for any organization to find advice on implementing a low friction development process.

DIGITAL SERVICE PLAYS:

Understand what people need

Address the whole experience, from start to finish

Make it simple and intuitive

Build the service using agile and iterative practices

Structure budgets and contracts to support delivery

Assign one leader and hold that person accountable

Bring in experienced teams

Choose a modern technology stack

Deploy in a flexible hosting environment

Automate testing and deployments

Manage security and privacy through reusable processes

Use data to drive decisions

Default to open

US Digital Services Playbook

The New Mental Model

The UK government changed – seemingly overnight – from high friction processes orchestrated by procurement departments to small internal teams governed by simple metrics. Instead of delivering “requirements” that someone else thinks up, teams are required to track four key performance indicators and figure out how to move these metrics in the right direction over time.

UK Digital Service’s four core KPIs:

Cost per transaction

User satisfaction

Completion rate

Digital take-up

See Gov.UK’s Performance Dashboard.

This is an entirely new mental model about how to develop effective software – one that removes all of the intermediaries between an engineering team and its consumers. It is a model that makes no attempt to define requirements, make estimates, or limit changes; instead it assumes that digital services are best developed through experimentation and require on-going improvement.

This is the mental model used by those who developed the first PUST system in Sweden, the one that was successful and appreciated by the police officers who used it. But unfortunately, conventional wisdom said it was not developed the “right” way, so the working system was shut down and rebuilt using the old mental model. And thus Sweden snatched failure from the jaws of success, proving once again that when it comes to developing interactive services, the old mental model simply Does. Not. Work.

Unexpected Points of Friction

It turns out that when governments move from the old mental model to the new mental model, many of the things that were considered “good” or “essential” in the past turn out to be “questionable” or “to be avoided” going forward. It’s a bit jarring to look at the list of good ideas that should be abandoned, but when you consider the friction that these ideas generate, it’s easier to see why forward-looking governments have eliminated them.

1. Requirements generate friction. The concept that requirements are specified by [someone] and implemented by “the team” has to be abandoned. Rather a team of engineers should explore hypotheses, testing and modifying ideas until they are proven or abandoned. Engineering teams should be expected to figure out how to make a positive impact on business metrics within valid constraints.

2. Handovers generate friction. The engineering team should have direct contact with at least a representative sample of the people whose journey they are automating. Just about any intermediary is problematic, whether the go-between is a procurement officer, business analyst, or product owner.

3. Organizational boundaries generate friction. There is a reason why the UK and US use internal teams to develop Digital Services. Going through a procurement office creates insurmountable friction – especially when procurement is governed by laws passed in the days of the old mental model. The IT departments of enterprises often generate similar friction, especially when they are cost centers.

4. Estimates generate friction. Very little useful purpose is served by estimates at the task level. Teams should have a good idea of their capacity by measuring the rate at which they complete their current work or the time it takes work to move through their workflow. Teams should be asked "What can be completed in this time-frame?" rather than "How long will this take?" The UK Digital Services funds service development incrementally, with a general time limit for each phase. If a service does not fall within the general time boundaries, it is usually broken down into smaller services.

5. Multitasking generates friction. Teams should do one thing at a time and get it done, because task switching burns up a lot of cognitive overhead. Moreover, partially done work that has been put aside gums up the workflow and slows things down.

6. Backlogs generate friction. A long "to do" list takes time to compile and time to prioritize, while everything on the list grows old and whoever put it there grows impatient. Don't prioritize - decide! Either the capacity exists to do the work, or it doesn't. Teams need only three lists: Now, Next, and Never. There is no try.

If Governments can do it, so can Enterprises

If governments can figure out how to design award-winning services [GOV.UK won the Design Museum Design of the Year Award in 2013] while moving quickly and saving money, surely enterprises can do the same. But first there is a lot of inertia to overcome. Once upon a time, governments assumed that obtaining software systems through a procurement process was essential, because it would be impossible to hire the people needed to design and develop these systems internally. They were wrong. They assumed that having teams scattered about in various government agencies would lead to a bunch of unconnected one-of systems. They were wrong. They were afraid that without detailed requirements and people accountable for estimates, there would be no governance. They were wrong. Once they abandoned these assumptions and switched to the low friction approach pioneered by Estonia, governments got better designs, more satisfied consumers, lower cost, and far more predictable results.

Your organization can reap the same benefits, but first you will have to check your assumptions at the door and question some comforting things like requirements, estimates, IT departments, contracts, backlogs – you get the idea. Read the US Digital Services Playbook. Could you run those 13 plays in your organization? If not, you need to uncover the assumptions that are keeping you in the grasp of the old mental model.

August 19, 2015

Friction