Lean Essays: Customers

Showing posts with label Customers. Show all posts

June 11, 2022

When Demand Exceeds Capacity

We are getting solar panels on our roof. Someday. We signed a contract last November and the original estimate for installation was May. Since no one does outside work in the winter here in Minnesota, this was about two months after the earliest possible installation date. But May has come and gone, and we still do not have an installation date. Why? The electrical design of for our 40-year-old house is complex and the young solar company has limited electrical design capability. So does our mature electrical utility company. Between the two of them, our project has been held up for a long time.

Solar systems have become widely popular since the electrical grid meltdown in Texas and the sharp rise in energy prices. So, it is not a surprise that the nascent solar industry is dealing with far more demand than the young solar companies and staid utility companies are equipped to handle. It is difficult to find enough talented people to do the complex system design, project management, and installation of solar systems. Thus, solar customers experience long delays that cost us months of potential savings, while they suffer from months of delayed revenue.

Our solar company has an eager sales force and a handful of project managers. Once contracts are signed, sales people hand them off to project managers so they can focus on pursuing more sales. Since the rate of signed contracts greatly exceeds the rate of completed installations, the number of customers assigned to each project manager is constantly increasing. It is impossible for a project manager to manage dozens of complex solar projects at the same time, but that is what they are expected to do.

Clearly, our solar company is unwilling to reduce its sales rate to match its project completion rate and unable to increase the completion rate to match the rate at which contracts are signed. Thus, the length of time a project takes continually increases, putting customers and project managers in a lose-lose situation. In fact, the solar industry in hardly the first industry in the world to experience far more demand that it has the capacity to supply; many other industries have navigated these waters. Typically, companies that do not learn how to manage excess demand end up failing because they generate annoyed customers faster than they generate revenue. Companies that became giants in rapidly growing industries – for example, Airbnb, AWS, Google, Netflix, and SpaceX – learned quickly to increase their delivery capacity to match the rate at which they generated sales. Successful solar companies will have to learn the same thing.

How do companies rapidly increase delivery capacity? Hiring more people might help, but it is not nearly enough. Young companies in a growth industry must figure out how to complete more jobs faster while doing less work or they will not be able to keep up with the rapid increase in demand. There are three essential strategies that successful industry giants have used to increase efficiency which other companies would do well to understand.

1. Focus on Customers Outcomes

The first strategy for increasing efficiency is to pay attention to the metrics that drive behavior. Startups tend to measure success by counting the number sales, while more efficient companies tend to measure the number of satisfied customers. When success is measured by the number of satisfied customers rather than booked revenue, several good things happen. The company learns not to accept work at a faster rate than it can complete work, because doing so does little more than stretch out the time it takes for its people to be successful while lengthening the queue of increasingly impatient customers. Work teams also learn to focus on completing installations as quickly as possible, because more installations will be seen as better performance. Finally, teams are unlikely to develop products that customers are not interested in – a very common source of wasted effort in startup companies.

It is extraordinarily difficult for an ambitious young company to turn away people that are beating a path to its door. Everyone’s instinct is to hire more salespeople and let the work pour in. But let’s face it, the most important thing that young company needs to learn is how to COMPETE work quickly, not how to ACCEPT work quickly. If it opens the doors wide and lets everyone in, there will be so many customers milling about that workers will be distracted trying to deal with the chaos rather than focused on getting work done. The most important thing for that young company to do at this point is to stop spending so much time on managing excess work and start concentrating on how to get jobs done as fast as they arrive.

Every growth industry has gone through this learning curve. Once upon a time, mail order catalogs (Sears, for example) accepted orders as fast as they could and shipped them in about two weeks. Because of the delay, they had to deal with people changing their mind and a complex order change system was necessary. Then L.L.Bean had a novel idea – why not ship products within 24 hours of receiving the order? It worked – customers were delighted, and no order change system was needed. Today almost all fulfillment systems work on the same principle – focus on building the capacity to ship product at the same rate orders arrive. Once shipping capacity matches demand, the company can ship immediately after an order arrives, delighting customers and reducing congestion while giving customers little time to change their minds.

2. Manage FLOW, not tasks

The second strategy for increasing efficiency is counterintuitive; it requires a shift in the company’s mental model of how work gets done. Consider our solar company. As a small startup, it thinks of every project as a separate entity and assigns a project manager to sequence the steps in an installation so all necessary work gets done. Unfortunately, this approach is not readily scalable, and even if it were, there are not enough experienced project managers available to hire. Industry leaders do things differently; they manage packages of items as they move through their system, not individual projects or tasks.

What does it mean to manage groups of items? Consider the shipping industry. Until the 1950’s, ships were loaded with individual items packed carefully to keep the ship stable. Each item had to be traced through every port as it moved off the ship and onto another ship or a train or a truck. Then the shipping container was invented, and it was no longer necessary to manage individual items moving through the supply chain. Instead, shippers managed containers as they moved from origin to destination.

Some years ago, we bought a sofa that was heavily discounted because of a missing headrest. A few months later the store called to say it was filling a container of furniture from their supplier in Norway and they could add our missing headrest for a small fee. It would have been too complex (and expensive) to order and track the headrest individually; but it was simply assigned to the container in which many items moved as a package from Norway to Minnesota. The cost of shipping was negligible. It's easy to see how containers created a reduction in complexity that dramatically reduced the cost of shipping goods and led to substantial changes in the global economy.

How do you group and manage workflow in a growing solar company? First, you think of a contract as a number of steps that need to be completed. Certainly, some of the steps can be done in parallel – for example solar panel layout can be done in parallel with electrical design. Other steps have prerequisites that must be completed first. You organize work so that each step has its own responsible team, its own queue, and its own completion rate. As each new contract is signed, it is coded with the various steps it must move through, including any required sequencing. Then you release the contract to the starting step (or steps). When a step is done, work should automatically move to the next step(s) until all work is complete. Your job is to manage the FLOW of work through the system rather that the progress of individual items through their various steps.

How do you manage the flow of work through a system? Start by managing the amount of time a job spends waiting for attention in the queue of each step. Queueing theory is clear: if you limit the length of a queue and send work through in FIFO (first-in-first-out) order, then time in a queue is the number of items in the queue divided by the average completion rate. For example, if you limit the items in a work queue to twenty and each item takes two days, work will wait in the queue – if it is FIFO – for no more than ten days. (Note: If a FIFO queue is not acceptable, then the queue is too long! Make it shorter.)

Once a job reaches the end of a queue, it should flow quickly through the work step. The best approach is for a responsible team begin a job, focus on it until it is done, then go on to the next job. Focusing on one thing at a time improves the quality of the work, eliminates wasted time due to multitasking and gets everything done faster. (See Single Responsibility Teams, below.)

If a step in the workflow is completed inside of your company, you have a lot of control over the rate at which work moves through the step (queue time + work time). Your goal is to create short, predictable completion rates for every internal step. If a step occurs outside your company (the utility company, for example, or the city permitting process) you do not have control of the time that takes, but you can measure it – you should track average time and variation. Note that it is important to separate internal steps from external steps – otherwise variations in the external completion rate will make it impossible to control internal rates of completion.

It is also important to ensure that downstream steps have capacity in their queues to handle work as it is completed upstream. If this sounds complicated, an analogy might be helpful: Consider an order coming into Amazon. Once upon a time, orders were handled by a central database which directed each step in the processing of orders as they moved through the company. But in the early 2000’s Amazon outgrew this process; they could not buy large enough database computers to keep up with surges in demand. Amazon needed a new approach, so it abandoned the central databases and thought of each order as moving through a series of independent services. When an order is placed, it is coded with an initial chain of necessary services (Do we need to do a credit check? Does the product require special shipping? ...) and sent on its way through the various processing steps. A service can divert the order or add additional steps (Oops, product is out of stock! ...) and downstream processes are expected to have the capacity to accept work from upstream processes. (If the product can be picked from a warehouse, it should be possible to package and ship it immediately.)

Armed with a set of steps each contract needs and the time each step will take, you, like Amazon, should be able to predict the expected installation date very early – ideally when the contract is signed. You should also have a handle on variation due to outside queues and dependencies such as parts availability and supply chain delays. You can send the contract into your workflow, track how well it is coming along and flag exceptions as soon as they occur.

In addition, you can spot bottlenecks that need to be improved and generally manage the capacity of your organization. You can be alerted when you need to limit sales to avoid overcommitment: Do not accept work faster than the rate at which work moves through your bottleneck process. Focus on the bottleneck until it is improved, when of course a new bottleneck will appear. Find the next bottleneck and limit sales of projects that use it to the rate the bottleneck can handle. Repeat...

By far the biggest benefit of managing the flow of groups of items is this: You will get a lot more done with a lot less effort. How can this be? When you no longer have half-done work clogging up your system, demanding attention, and distracting workers, everyone can spend much more of their time actually getting useful work done.

3. Form Single-Responsibility Teams

The third big change you should make is to charter single-responsibility teams, organized and led by single-responsibility leaders. What is a single-responsibility team? It is a team that has one and only one thing to do, a team that can focus all its efforts on accomplishing one thing at a time and doing that one thing well. Be careful to minimize dependencies among these teams; each team should be able to begin and complete its work independently; teams should not need to coordinate with or get permission from managers or other teams. Finally, work should not appear in the queue of a single-responsibility team until everything is in place for that work to be done, otherwise the team will have to finish someone else’s work before they can focus on their own.

How is it possible to organize complex work into pieces that can be accomplished by single-responsibility teams? While there is no such thing as the “best” organizational structure for companies in diverse domains, it is easy to find examples that demonstrate the concept. Let’s start with AWS (Amazon Web Services). Over the past decade AWS has replaced most enterprise software companies with an array of independent services, each managed by a single-responsibility team and strung together through hardened interfaces. No one has ever seen an AWS roadmap; instead, teams identify (or are given) a customer problem and they start by writing a “press release” describing how their solution will solve the problem. Then they proceed to develop the solution, testing it with customers as often as possible. AWS has released multiple new enterprise-level services every year since 2012, each aimed at solving a specific customer problem. These services have gradually lured companies away from the complex, integrated packages of traditional vendors, often prompting AWS customers to organize their own work around autonomous, focused teams.

Now let’s look at SpaceX, which is organized very differently than AWS. SpaceX is fundamentally an engineering organization; its teams are organized around the components of booster rockets that launch things into space. There is a team for each stage of the booster (eg. payload, stage 1, stage 2, engine cluster…) with sub-teams for each sub-component of the stage (Merlin engine, landing legs…).

You might wonder: How do teams know what to do and when to do it? How do they coordinate with other teams? How do they know they have done a good job? SpaceX operates on the principle of responsibility, which says that each team / sub-team is expected to understand the role its component must play in the next launch and have it ready by the scheduled launch date (perhaps a couple of months away). The team, led by a responsible engineer, has one job: Ensure that the component works as well as possible during the launch and does its job as part of the overall system.

How does SpaceX manage FLOW? For a large, multi-team development effort involving both hardware and software, managing flow is usually accomplished by scheduling a series of frequent integration events. The purpose of each event is to take a well-defined step forward in the overall development effort to learn what works, locate unintended integration effects, and discover ways to improve the system. A series of integration events sets the cadence for all teams involved in the development effort. In the case of SpaceX a test launch is scheduled every few months throughout a development cycle; importantly, launch dates Do. Not. Change. Each launch is an integration event that moves all teams forward together, creating a steady flow of real progress toward a working system.

Of necessity, based on the structure of a booster, SpaceX teams have more dependencies than AWS teams. They also have a clear responsibility to understand and account for those dependencies. Since SpaceX teams work to meet fixed launch deadlines, their systems sometimes fail during a launch. Failures are part of the process, but teams must carefully document launch behavior to be sure that the cause of any failure is identified and eliminated. Believe it or not, this approach to development is much faster and less costly (and usually safer!) than trying to think through every failure mode and eliminate any possibility of failure before attempting a test launch.

Does SpaceX focus on customer outcomes? The company undertook the difficult challenge of developing a reusable booster because it was clear that sending things into space was too expensive for many potential customers. Reusing boosters was an essential step in dramatically reducing the cost of a launch, significantly increasing the number of customers with access to space. This need to reduce launch costs has driven most major booster development decisions at SpaceX.

SpaceX’s reusable booster has reduced the cost of launching a kilo of payload into space by a factor of 7, while spending an order of magnitude less money on development than previous booster rockets.

Summary

A lot of companies have more work than they can possibly handle – and that is not a good thing. It creates long delays, excess complexity, and eventually, dissatisfied customers. The secret is to limit demand to capacity and then increase capacity to meet additional demand by creating a steady flow of work while teams focus on a single responsibility. You also need to stop doing work that does not contribute to great customer outcomes: Stop multitasking, stop prioritizing, get rid of backlogs, stop shepherding individual tasks. If companies as large and successful as AWS and SpaceX can organize work around customer outcomes, steady flow, and focused, single-responsibility teams – then so can you.

January 19, 2019

An Interview

Recently I was asked to complete an interview via e-mail. I found the questions quite interesting - so I decided to post them here.

Caution: The answers are brief and lack context. Some of them are probably controversial, and the the interview format didn't provide space for going below the surface. Send me an e-mail (mary@poppendieck.com) if you'd like to explore any of these topics further.

When did you first start applying Lean to your software development work? Where did you get the inspiration from?

I think its important to set the record straight – most early software engineering was done in a manner we now call ‘Lean.’ My first job as a programmer was working on the Number 2 Electronic Switching System when it was under development at Bell Telephone Labs. Not long after that, I was assisting a physicist do research into high energy particle tracing. The computer I worked on was a minicomputer that he scrounged up from a company that had gone bankrupt. With a buggy FORTRAN compiler and a lot of assembly language, we controlled a film scanner that digitized thousands of frames of bubble chamber film, projected the results into three dimensional space, and identified unique events for further study.

My next job was designing automated vehicle controls in an advanced engineering department of General Motors. From there I moved to an engineering department in 3M where we developed control systems for the big machines that make tape. In every case, we used good engineering practices to solve challenging technical problems.

In a very real sense, I believe that lean ideas are simply good engineering practices, and since I began writing code in good engineering departments, I have always used lean ideas when developing software.

From the organizations you've worked with, what have been some of the most common challenges associated with Lean transformations?

Far and away the most common problem occurs when companies head into a transformation for the sake of the transformation, instead of clearly and crisply identifying the business outcomes that are expected as a result of the transformation. You don’t do agile to do agile. You don’t do lean to do lean. You don’t do digital to do digital. You do these things to create a more engaging work environment, earn enough money to support that environment, and build products or services that truly delight customers.

So an organization that sets out on a transformation should be looking at these questions:

Is the transformation unlocking the potential of everyone who works here? How do we know?
Are we creating products and services that customers love and will pay for? How do we know?
Are we creating the reputation and revenue necessary to sustain our business over the long run?

There's lots of talk now around scaled Agile frameworks such as SAFe, Nexus, LESS, etc. with mixed results. How do you approach the challenge of scaling this way of working?

Every large agile framework that I know of is an excuse to avoid the difficult and challenging work of sorting out the organization’s system architecture so that small agile teams can work independently. You do not create smart, innovative teams by adding more process, you create them by breaking dependencies.

What we have learned from the Internet and from the Cloud is very simple – really serious scale can only happen when small teams independently leverage local intelligence and creativity. Companies that think scaled agile processes will help them scale will discover that these processes are not the right path to truly serious scale.

One of the common complaints from developers on Agile teams have is they don't feel connected to customers, and there is sometimes a feeling of working on outputs, rather than customer outcomes. How might this be changed?

This is the essential problem of organizations that consider agile a process rather than a way to empower teams to do their best work. The best way to fix the problem is to create a direct line of sight from each team to its consumers.

When the Apple iPhone was being developed, small engineering teams worked in short cycles that were aimed at a demo of a new feature. Even though the demo group was limited due to security, it was representative of future consumers. Each team was completely focused on making the next demo more pleasing and comfortable for their audience than the last one. These quick feedback loops over two and a half years led directly to a device that pretty much everyone loved. [1]

At our meetup last year, you spoke about resisting proxies, and one of those proxies is the Product Owner. What alternative approaches have you seen work for Lean or Agile teams, as opposed to having a Product Owner?

Why do software engineers need someone to come up with ideas for them? Ken Kocienda was a software engineer who ‘signed up’ to be responsible for developing the iPhone’s keypad. In the book Creative Selection [1], he describes how he developed the design, algorithms, and heuristics that created a seamless experience when typing on the iPhone keyboard, even though it was too small for most people’s fingers.

Similarly, at SpaceX, every component has a ‘responsible engineer’ who figures out how to make that component do its proper job as part of the launch system. John Muratore, SpaceX Launch Director, says “SpaceX operates on a philosophy of Responsibility – no engineering process in existence can replace this for getting things done right, efficiently.” [2]

The Chief Engineer approach is common in engineering departments from Toyota to GE Healthcare. It works very well. There is nothing about software that would exempt it from the excellent results you get when you give engineers the responsibility of understanding and solving challenging problems.

What is the most common thing you've seen recently which is slowing down organizations' concept-to-cash loop?

Friction. For example, the dependencies generated by the big back end of a banking system are a huge source of friction for product teams. The first thing organizations need to do is to learn how to recognize friction and stop thinking of it as necessary. When Amazon moved to microservices (from 2001 to 2006) the company had to abandon the idea that transactions are managed by a central database – which was an extremely novel idea at the time.

Over time, Amazon learned how to recognize friction and reduce it. Today, Amazon Web Services (AWS) launches a new enterprise-level service perhaps once a month and about two new features per day. Even more remarkable, AWS has worked at a similar pace for over a decade. If you look closely, an Amazon service is owned by a small team led by someone who has 'signed up' to be responsible for delivering and supporting a service that addresses a distinct customer need at a cost that is both extremely attractive and provides enough revenue to sustain the service over time.
_____
Footnotes:

1. See Creative Selection by Ken Kocienda.

2. John Muratore, System Engineering: A Traditional Discipline in a Non-traditional Organization, Talk at 2012 Complex Aerospace Systems Exchange Event.

January 22, 2018

Official Intelligence

Every morning I pick up a small black remote, push a button and quietly say, “Alexa, turn on Mary’s Desk.” In the distance, I hear “Ok” and my desk lights come on. I never imagined that I would use voice control. I don’t like shouting at devices and don’t like announcing what I’m doing to those around me. My experience with voice in cars and smartphones has been mediocre. An Echo sat in our house for three years before I began talking to it.

Our home has been highly automated since the 1980’s, but my low voltage desk lamps are not compatible with our automation system. Last fall we connected our system to Alexa and bought an Alexa-compatible power strip for the lamps. Viola! we could control everything by voice – in fact, voice was the only way to control my desk lamps remotely. So I had to talk to a device. After I got used to it, I tried Alexa’s shopping list and found it convenient. Then I discovered that Alexa’s timers are well suited to cooking, especially with full hands. Soon I got an Echo Spot so I could see the timers as they counted down – I was hooked.

A Killer App for IoT

When devices are scattered throughout a physical space, controlling them by voice is a killer app for the Internet of Things [1] – but only if a simple control standard is embedded in every device. For the past couple of years, Amazon has been making it very easy for just about any Wi-Fi enabled device to be controlled through an Alexa skill. Better yet, taking a page out of the old Intel playbook, Amazon sells inexpensive kits and supplies testing support, so designers can easily embed microphones and Alexa intelligence inside their devices. True, these devices compete directly with Amazon Echos, but as a platform company, Amazon understands that this is a good thing.

Amazon was once considered the weakest of the voice assistant competitors, which include Google, Apple, Samsung, Microsoft, and Facebook. Most voice assistants are, well, assistants. They reserve movie tickets, arrange transportation, find answers to questions. Amazon’s Echo has a different focus: always-on listening and hands-free control of an exploding array of internet devices. This turned out to be a good choice. Google, playing fast follower, quickly introduced Google Home, while Microsoft’s Cortana is being integrated with Alexa. With its first mover advantage, Alexa has captured a commanding share of the voice control market, which could eventually become a fourth ‘pillar’ of Amazon’s success.[2]

I have to wonder: Is it an accident that Amazon discovered the most attractive use of voice while its biggest competitors were heading in a different direction? Or is there something about Amazon that gives it an edge, something that we might learn from? How does such a massively large company foster the kind of innovation that leads to completely new markets?

Day 1

In his letter to shareholders last spring, Jeff Bezos explained his longstanding mantra “It’s still Day 1” by describing Day 2: “Day 2 is stasis. Followed by irrelevance. Followed by excruciating, painful decline. Followed by death. And that is why it is always Day 1.” Then Bezos laid out his principles for keeping the vitality of Day 1 alive:

True Customer Obsession
A Skeptical View of Proxies
Eager Adoption of External Trends
High-Velocity Decision Making

True Customer Obsession

We’ve heard so much about Amazon’s customer obsession that it can get boring. After all, doesn’t every company focus on customers?

Actually, no. Most executives lose a lot more sleep over profits, or shareholders, or competitors than they do worrying about customers. Imagine you worked at an airline, for example, and you had an idea of how to make customers really happy: “Let’s eliminate the baggage fees!” Your manager frowns at you and says: “Have you any idea how much money those baggage fees bring into this airline every month?” And that would be the last customer-obsessed suggestion you would make.

But consider the Amazon team that came up with Lambda. Some customers report up to an order of magnitude reduction in cost when they switch to Lambda. Yet the Lambda team did not have to answer the sobering question: “Do you know how much revenue Lambda might cannibalize?” Everyone understood that lower prices are good for customers, so they are good for Amazon.

How does customer obsession get all the way from a statement in a shareholder letter to the actions of front line employees? Amazon does this by creating a direct line of sight between small teams and the customers they are supposed to be obsessed with, then making the teams responsible for improving the lives of those customers in some way.

Too Big to Communicate

Around 2001, Amazon’s growth was outstripping the capability of its internal systems to keep up. The leadership team came to a pretty standard conclusion – better communication was needed. Jeff Bezos was wise enough to realize that if communication was the problem, the solution had to be less communication, not more. He wanted the company to grow much larger, and if communication was impeding growth at this early stage, they had better figure out how to operate with a lot less of it.

How did the Internet grow so large? Through a lot of independent agents following their own agendas. How does Open Source software grow? The same way. Bezos decided that Amazon should transition to the independent agent model by organizing into small, independent teams. “If you can arrange to do big things with a multitude of small teams – that takes a lot of effort to organize, but if you can figure that out – the communication on those small teams will be very natural and easy,”[3] Bezos observed.

What, exactly, does Bezos mean by a team? At Amazon: [3,4]

Teams are groups of 6-12 people with a leader who acts something like a team CEO. The leader often recruits the rest of the team, and members usually stay with a team for two or more years.
Teams are ‘separable’ [separated organizationally] and ‘single-threaded’ [work on a single thing].
Teams are responsible for a measurable set of external outcomes, usually focused on customers.
Team decide internally both what they will work on and how they do the work.
Dependencies between teams are kept to an absolute minimum.

Once Amazon decided to structure a company composed of small, autonomous teams responsible for small, independent services, it had to figure out how to build an extensible infrastructure with these teams. That took a lot of time and experimentation, but in the end, it worked. In fact, it worked so well that Amazon decided to sell the infrastructure rather than keep it proprietary. And thus, we have Amazon Web Services (AWS), also known as ‘the Cloud’. As more and more companies move to the cloud they would be wise to understand that before it was a system architecture, the Cloud was an organizational architecture designed to streamline communication.

The Cathedral and the Bazaar

You are probably saying to yourself about now, “Cloud architectures are fine for a digital world, but how can they possibly work for a large company?” When I first heard about AWS, I asked the same question. To echo Eric Raymond’s “The Cathedral and The Bazar,” [5]

I used to believe there was a certain critical complexity above which a centralized, a priori approach to running a company was required. I thought that successful large companies were built like cathedrals, carefully crafted by individual wizards or small bands of magicians who orchestrated successful strategies.

Amazon’s style of organization – assemble hundreds upon hundreds of autonomous teams that decide for themselves what they are going to work on – seemed to resemble a great babbling bazaar of differing agendas and approaches out of which a coherent and stable business could seemingly emerge only by a succession of miracles.

The fact that this bazaar style seems to work, and work well, came as a distinct shock. As I learned more about AWS, I worked hard at trying to understand why the Amazon world not only didn’t fly apart in confusion but seemed to go from strength to strength at a speed barely imaginable to cathedral-builders.

Knowledge Workers

In 1999, Peter Drucker published the paper "Knowledge-Worker Productivity: The Biggest Challenge." The productivity gains of the 20th century, he noted, applied to manual labor. In the 21st century, the challenge will be increasing knowledge worker productivity, and this requires an approach opposite to the one we have been using for manual labor productivity. He wrote: [6]

“Knowledge-worker productivity requires that the knowledge worker is both seen and treated as an ‘asset’ rather than a ‘cost.’ It requires that knowledge workers want to work for the organization in preference to all other opportunities.”

“Knowledge workers [unlike manual workers] own the means of production. That knowledge between their ears is a totally portable and enormous capital asset.”

“Economic theory and most business practice sees manual workers as a cost. To be productive, knowledge workers must be considered a capital asset. Costs need to be controlled and reduced. Assets need to be made to grow.”

Drucker pointed out that there are many jobs which we might consider manual labor that involve a lot of knowledge work. One good example would be retail clerks, the subject of Zeynep Ton’s book "The Good Jobs Strategy." Ton illustrates how several retail chains have generated strong growth and higher than normal profits by paying well, training extensively, and expecting their intelligent employees to create a great experience for customers.

Knowledge workers are everywhere in our companies, and they represent a huge opportunity for improved performance, if only we learn how to see them as assets and grow their potential.

Volunteers

In his 2001 book "Management Challenges for the 21st Century," Peter Drucker pointed out that knowledge workers must be managed as if they were volunteers, because in fact, they are volunteers. During my last three years at 3M, I used the then-classic approach at 3M of ‘bootlegging’ the efforts of dozens of scientists and engineers, and led volunteer team developing a product called ‘Light Fiber’ and the process needed to manufacture it. I learned a lot about what it takes to lead a large team of volunteer knowledge workers, and it boils down to this: Understand what energizes every person on the team and arrange for each person to do as much of what energizes them as possible. This works particularly well because people tend to be energized by what they are good at, and focusing people’s work on what they are good at creates a win-win situation.

The Light Fiber team met every Wednesday morning before regular work hours (when everyone was free). I supplied breakfast and a couple dozen people showed up to coordinate their efforts – every week, for three years! The meeting was essentially a forum where everyone got brag about their accomplishments and make promises to each other about what they would do in the future. Even though there were no work assignments, only promises, the team accomplished amazing things.

Promises

I was reminded of this experience by the book "Thinking in Promises" by Mark Burgess. He defines a promise as a public declaration of intention by an agent. Agents can only make promises for themselves – they cannot make promises for (impose intentions on) other agents. Agents communicate about what is necessary to achieve shared goals, and then make promises to each other about their intention to contribute to the shared goal. Trust develops when agents are observed routinely keeping promises. Agents can make promises contingent on the trusted behavior of other agents, but they need a fallback plan, since the best of intentions can go awry. This sounded very familiar to me – it is a good description of what happened every week at our Light Fiber meeting. And I agree with Burgess that a system built on promises can be very reliable and robust.

Think of the Bazaar approach as a marketplace where knowledge workers can find the best places to utilize their strengths. The currency of the Bazaar is the promises made to colleagues and the trust built up by promises that are kept. Companies that function as Bazaars have discovered a secret: “Peer pressure is much more powerful than a concept of a boss. Many, many times more powerful.”[7] When people make promises to colleagues and customers, they feel a personal commitment to keep the promise. When managers impose obligations on their teams, there are many points of failure.

A Skeptical View of Proxies

Jeff Bezos’ second principle for vitality is to take a skeptical view of proxies. What are proxies? Bezos cites process as a typical example – he doesn’t think “I followed the process” is a good excuse for poor results.

The most vexing proxies in the development world are the project metrics of cost, schedule, and scope. Teams that can focus directly on the desired outcome usually perform a lot better than teams constrained by these proxies. For IT departments, ‘The Business’ is a proxy. For many businesses, profits are a proxy for delighted customers. [8] Be skeptical.

Even if someone does their homework and proves that delivering specific proxy results will surely deliver the desired end results, a direct line of sight to the desired outcomes is much better. Why? 1) Things change, but proxies prevent the team from changing accordingly. 2) Proxies tend to mask the intent or purpose of the work, diminishing engagement. 3) Proxies interfere with feedback and therefore slow things down.

High-Velocity Decision Making

This brings us to another of Bezos’ principles for maintaining company vitality: high-velocity decision making. Fast decisions are local decisions, because when a decision must be made immediately, there is no time to push it up the chain of command. In military organizations, where high-velocity decision making is a matter of life and death, front line units make local decisions based on situational awareness and their understanding of command intent. Wise commanders get very good at communicating their intent and the desired end state, so that the rapid decisions made on the front lines will be good decisions. This well-tested approach is probably the best model we have for making fast decisions in rapidly changing environments.

There are three things that get in the way of high-velocity decisions at the local (team) level:

Proxies rather than a clear understanding of the desired end state.
Permission required from management or other teams.
Punishment if the decision is wrong.

We’ve already discussed proxies.

Permission

The State of DevOps report in 2017 found that the best performing teams are those that operate without the need to obtain permission from outside the team. Obviously, wise managers should try to step back and let teams make their own decisions. But there is a bigger issue here. All too often, there are significant dependencies between teams, requiring multiple teams to coordinate their actions across a large set of interconnections. This was once the reason for long delays between software releases, but we now know that breaking dependencies is a far better strategy than catering to them.

Dependencies can be subtle, and are usually based on the system architecture. In the late 20th century, companies spent years pursuing the holy grail of integration, only to discover that integrated systems create a legacy of intertwined processes. This interconnectedness makes it almost impossible for teams to make changes without getting permission from a lot of other teams. So much for high-velocity decision making.

Punishment

Some years ago, 3M’s visionary CEO, William McKnight, made clear what he thought about punishing mistakes: [9]

“As our business grows, it becomes increasingly necessary to delegate responsibility and to encourage men and women to exercise their initiative. This requires considerable tolerance. Those men and women, to whom we delegate authority and responsibility, if they are good people, are going to want to do their jobs in their own way.”

“Mistakes will be made. But if a person is essentially right, the mistakes he or she makes are not as serious in the long run as the mistakes management will make if it undertakes to tell those in authority exactly how they must do their jobs.”

“Management that is destructively critical when mistakes are made kills initiative. And it’s essential that we have many people with initiative if we are to continue to grow.”

Eager Adoption of External Trends

To wrap up our discussion of Day 1 principles, let’s consider why Bezos thinks it’s important to adopt external trends. He says that if you pay attention to how external trends are likely to affect you and your customers, you will have a tail wind that can push you toward some interesting opportunities.

Here is an example: Today’s big trends are artificial intelligence (AI) and machine learning – voice recognition is a trendy use of artificial intelligence. Verbal interaction has not been a large part of Amazon’s retail and web services businesses, but that did not keep the company from making very strong investments in voice technologies. By late 2017, the AWS re:Invent conference centered on voice recognition and machine learning embedded in devices at the edge of the cloud. You could feel the tail wind.

Let’s assume you want to take Bezos’ advice and embrace today’s big trend – artificial intelligence. You might start by asking: Can artificial intelligence be used to help understand customers better? [We know of a company using Watson to filter social media comments in order to find key customer frustrations.] Could it be used to improve the development process? [Think automation on steroids.] What could machine learning do to increase the reliability of deployed systems? [All the data needed to discover the causes of crashes is probably in logs somewhere.] If you are not asking yourself these questions, you’re asking for a headwind.

Official Intelligence

When (not if) you find uses for artificial intelligence, it’s time to hit the pause button. If you find jobs that could be replaced by smart machines, you should ask yourself: Why? Why aren’t the intelligent people who currently do those jobs being challenged to think, to find innovative solutions to problems, to be obsessed with customers? Your challenge, should you accept it, is not to embrace artificial intelligence, it is to uncover the official intelligence that is going to waste in your organization. Don’t worry about how AI might reduce the cost of development, focus on how it might be used to leverage the knowledge and creativity of all those intelligent people in your organization.

As Peter Drucker pointed out, we know how to make manual labor more productive – in fact, artificial intelligence can be quite helpful there. But the real advances of the 21st Century will come when we figure out how to make sure that everyone in our organization is officially considered an intelligent person. Officially intelligent people don’t ask permission, they make promises. Officially intelligent people don’t need proxies, they are challenged with the end game. Officially intelligent people have jobs that are augmented by artificial intelligence not replaced by it.

Unleashing the potential of all the bright, creative people in our organizations is the central challenge of the digital age.

_________________________
Footnotes:

See Alexa, the Killer App
The first three pillars are Marketplace, Prime, and AWS. See Alexa Could Be Amazon's "Fourth Pillar" and Why Amazon is the new Microsoft
See Leadership advice: How Amazon maintains focus while competing in so many industries at once. The video is worth watching.
See Amazon’s “two-pizza teams”: The ultimate divisional organization
I take great liberties paraphrasing Eric Raymond’s “The Cathedral and the Bazaar”
Knowledge-Worker Productivity: The Biggest Challenge by Peter F. Drucker, California Management Review vol. 41,no. 2 winter 1999. Italics from the original.
In The Tipping Point, Malcolm Gladwell wrote about Gore and Associates, a well-known Bazaar company. Gladwell attributes this quote to Jim Buckley of Gore and Associates.
I wrote more about proxies in this post: The Cost Center Trap
McKnight Principles

January 14, 2017

The End of Enterprise IT

In 2015, the employees at ING Netherlands headquarters – over 3,000 people from marketing, product management, channel management, and IT development – were told that their jobs had disappeared. Their old departments would no longer exist; small squads would replace them, each with end-to-end responsibility for making an impact on a focused area of the business. Existing employees would fill the new jobs, but they needed to apply for the positions.[1]

It was a bold move for the Netherlands bank. The leaders were giving up their traditional hierarchy, detailed planning and “input steering” (giving directions). Instead they would trust empowered teams, informal networks, and “output steering” (responding to feedback) to move the bank forward. The bank was not in trouble; it did not really need to go through such a dramatic change. What prompted this bet-your-company experiment?

The change had been years in the making. After initial experiments in 2010, the IT organization put aside waterfall development in favor of agile teams. As successful as this change was, it did not make much difference to the bank, so Continuous Delivery and DevOps teams were added to increase feedback and stability. But still, there was not enough impact on business results. Although there were ample opportunities for business involvement on the agile teams and input into development priorities, the businesses were not organized to take full advantage of the agile IT organization. Eventually, according to CIO Ron van Kemenade (CIO of ING Netherlands from 2010 until he became CIO of ING Bank in 2013):[2]

The business took it upon itself to reorganize in ways that broke down silos and fostered the necessary end-to-end ownership and accountability. Making this transition … proved highly challenging for our business colleagues, especially culturally. But I tip my hat to them. They had the guts to do it.

The leadership team at ING Netherlands had examined its business model and come to an interesting conclusion: their bank was no longer a financial services company, it was a technology company in the financial services business. The days of segmenting customers by channel were over. The days of push marketing were over. Thinking forward, they understood that winning companies would use technology to provide simple, attractive customer journeys across multiple channels. This was true for companies in the media business, the search business, most retail businesses, and it was certainly true for companies in the financial services business. Moreover, expectations for engaging customer interactions were not being set by banks – they were being set by media and search and retail companies. Banks had to meet these expectations just to stay in the online game.

ING Netherlands’ leadership team decided to look to other technology companies, rather than banks, for inspiration. For example, on a trip to the Google IO developers conference Ron van Kemenade was impressed by the amazing number of enthusiastic, engaged engineers at Google. He realized that such enthusiasm could not surface in his company, because the culture did not value good engineering.

Let’s be clear, engineering is about using technology to solve tough problems; problems like how can we process a mortgage with a minimum of hassle for customers? How can we reduce the cost of currency exchange and still make a profit? How might we leverage Europe’s movement to open API’s for our customers’ advantage? These are the kinds of questions that are best answered by a small team of crack engineers working closely with people who deeply understand the customer journey. But at ING Netherlands, technology improvements were being worked out by people in the commercial business who would then tell the engineers what to develop. Not only is this a poor way to attract top engineers, it is the wrong way to create innovative solutions inspired by the latest technology.

The leaders at ING Netherlands decided to investigate how top technology companies attract talented people and come up with engaging products. Through concentrated visits to some of the most attractive technology companies, they saw a common theme – these companies did not have traditional enterprise IT departments even though they were much bigger than any bank. Nor did they have much of a hierarchical structure. Instead, they were organized in teams – or squads – that had a common purpose, worked closely with customers, and decided for themselves how they would accomplish their purpose.

ING Netherlands decided that if it was going to be a successful technology company and attract talented engineers, it had to be organized like a technology company. Studying the best technology companies convinced them that they needed to change – and the change had to include the whole company, not just IT. The bank had already modularized its architecture, streamlined and automated provisioning and deployment, moved to frequent deployments, and formed agile teams. But this was done within the IT department rather than across the organization, and the results were not exceptional. Now it was time to create a digital company across all functions.

They chose to adopt an organizational structure in which small teams – ING calls them squads – accept end-to-end responsibility for a consumer-focused mission. Squads are expected to make their own decisions based on a shared purpose, the insight of their members, and rapid feedback from their work. Squads are grouped into tribes of perhaps 150 people that share a value stream (e.g. mortgages), and within each tribe, chapter leads provide functional leadership. Along with the new organizational structure, ING’s leadership team worked to create a culture that values technical excellence, experimentation, and customer-centricity.

So how well did this major organizational change work? Certainly, it was not without problems. Some people did not want to work in the new environment, and there was not necessarily a role for everyone. So there were layoffs. But the people who stayed were intrigued by the new way of working and quickly became acclimated to their new jobs.

Another problem involved answering the question “What makes a ‘good’ engineer?” For this the bank adopted the Dreyfus model of skill acquisition (novice, advanced beginner, competent, proficient, and expert). It set up an internal ‘academy’ of classes – usually taught by senior engineers – to help everyone develop the skills needed for a future in a technology company.

Perhaps the biggest issue is one that anyone with a background in organizational development would expect – creating alignment across the many autonomous teams has been a formidable challenge. The bank needs to make major changes and develop breakthrough innovations; but these require coordinated action across multiple, supposedly autonomous, teams. Even the top technology companies ING bank studied have not really solved this problem. Ron van Kemenade summarized the problem this way:[2]

We had assumed that alignment would occur naturally because teams would view things from an enterprise-wide perspective rather than solely through the lens of their own team. But we’ve learned that this only happens in a mature organization, which we’re still in the process of becoming.

Centrally driven program management is now used to arbitrate priority conflicts and create alignment, while standardization of back end systems (e.g. data centers) and support functions helps maintain the operational excellence and regulatory compliance necessary at a large bank.

Despite the challenges, ING Netherlands views its new organizational structure as a significant success with sizable benefits to the company. The strategy adopted by ING Netherlands – an organizational structure composed of small, integrated teams, along with an emphasis on simple customer journeys, automated processes, and highly skilled engineers – is expected to spread to other parts of ING Bank.

The moral of this story is simple: agile transformations are not about transforming IT, they are about transforming organizations. If you are going through an agile transformation in your IT department, you are thinking too narrowly. Digitization must be an organization-wide experience.
_________
Footnotes:

[1] From: ING’s agile transformation, an interview with Peter Jacobs, CIO of ING Netherlands, and Bart Schlatmann, former COO of ING Netherlands, in McKinsey Quarterly, January 2017. See also: Software Circus Cloudnative Conference keynote by Peter Jacobs. (Peter Jacobs, replaced Ron van Kemenade as CIO of ING Netherlands in 2013.)

[2] From: Building a Cutting-Edge Banking IT Function, An Interview with Ron van Kemenade, CIO ING Bank, by Boston Consulting Group. See also talks by Ron van Kemenade: Nothing Beats Engineering Talent…The AGILE Transformation at ING and The End of Traditional IT.

June 16, 2016

Integration Does. Not. Scale.

In times past, there was a difference between the front office of a business – designed to make a good impression – and the back office – a utilitarian place where most of the routine work got done. The first (and for a long time the predominant) use of computers in business centered around automating back office processes, so of course, IT was relegated to the back office.

As businesses grew, various back office functions developed their own computer systems – one for purchasing, one for payroll, one for manufacturing, and so on. The manufacturing system in vogue when I was in a factory was called MRP – Material Requirements Planning. As time went on, MRP systems were expanded to the supply chain, and then to the rest of the business, where they acquired the name ERP – Enterprise Resource Planning.

Over time it became obvious that the disparate systems for each function were handling the same data in different ways, making it difficult to coordinate across functions. So IT departments worked to create a single data repository, which quite often resided in the ERP system. The ERP suite of tools expanded to include most back office processes, including customer relationship management, order processing, human resources, and financial management.

The good news was that now all the enterprise data could be found in the single database managed by the ERP system. The bad news was that the ERP system became complex and slow. Even worse, enterprise processes had to either conform to “best practices” supported by the ERP suite or the ERP system had to be customized to support unique processes. In either case, these changes took a long time.

ERP Systems Meet Digital Organizations

As enterprise IT focused on implementing ERP suites and developing an authoritative system of record, the Internet became a platform for a whole new category of software, spawning new business models that did not fit into the traditional processes managed by ERP systems. Here are a few examples:

Many software offerings that used to be sold as products are now being sold “as a service”. However, ERP systems were designed to manage the manufacture and distribution of physical products; they don’t generally manage subscription services.
Some companies (Google for example) give away their services and sell advertising. Other companies (such as EBay and Airbnb) create platforms that unite consumers with suppliers, often disrupting traditional industries. In a platform business, the most critical processes focus on driving network effects by facilitating interactions between buyers and sellers. Although ERP systems can manage both suppliers and customers, they usually do not focus on the interactions between them.
The Internet of Things (IoT) brings real time data into many processes, changing the way they are best executed. For example, predictive maintenance of heavy equipment can be scheduled based on sensor data, resulting in better outcomes for customers and thus for the enterprise. ERP suites are intended to support standard practices; they struggle to support processes that change dynamically in response to digital input.
Capitalizing on the availability of data generated by products, companies are moving to selling business outcomes rather than individual products (GE is an example). When you are selling engine thrust or lighting costs, rather than engines or lightbulbs, processes need to be focused on the customer context. ERP systems generally focus on internal processes.
ERP systems are supposed to provide a single, integrated record of important enterprise data, but that data rarely includes dynamic product performance data, information about consumer characteristics and preferences, or other information that has come to be called “Big Data”. This kind of information is becoming an extremely valuable resource, but there isn’t room in ERP databases to store and manage the massive amount of interesting data that is available.

In summary, digitization is bringing the back office much closer to the front office, providing the data for dynamic decision-making, and substituting short feedback loops and data-driven interactions for “best practices.” Since enterprise ERP suites were not built for speed or rapidly changing processes, they are increasingly being supplemented with other systems that manage critical enterprise processes.

Postmodern ERP

In the last few years, in the wake of the success of Salesforce.com, many cloud-based software services have become available. Some target the entire enterprise (NetSuite for example), but many are focused on particular areas (e.g. human resources) or particular industries (e.g. construction). These services are finding an eager audience – even in companies that have existing ERP systems. Today, about 30% of the spend for IT systems is coming from business units outside of IT [1]. If they cannot get the software they need from their IT departments, business leaders are likely to purchase cloud-based services instead.

The cloud reduces dependence on a company’s IT department, so it has become quite easy for various areas of the enterprise to independently adopt “best-of-breed” solutions specifically targeted at their needs, rather than use a single ERP suite across the enterprise. These best-of-breed systems are usually selected by line business leaders and hosted in the cloud. They tend to be faster to implement and more responsive to changing business situations than the enterprise ERP suite – partly because they are decoupled from the rest of the enterprise. Gartner calls the movement from a single ERP suite to a collection of ERP modules from multiple vendors “Postmodern ERP”[2].

Gartner warns that a multi-vendor ERP approach can lead to significant integration problems, and recommends that multiple vendors should not be used until the integration issues are sorted out. Of course, business leaders want to know why integration is important. IT departments typically respond that the ERP’s central database is the enterprise system-of-record; other ERP modules – financial reporting, for example – depend on this database for critical data. Without an integrated database, how will the rest of the enterprise be able to operate? How will the accounting department produce its required financial reports?

Integration Does. Not. Scale

But hold on. There are plenty of very large companies that work remarkably well – and produce financial reports on time – without an integrated system-of-record. In fact, internet-scale companies have discovered that integration does not scale. If we go back to the year 2000, we find that Amazon.com had a traditional architecture – a big front end and a big back end – which got slower and slower as volume grew. Eventually Amazon abandoned its integrated backend database in the early 2000’s, in favor of independent services that manage their own data and communicate with each other exclusively through clearly defined interfaces.

If we have learned one thing from internet-scale players, it’s that true scale is not about integration, it is about federation. Amazon runs a massive order fulfillment business on a platform built out of small, independently deployable, horizontally scalable services. Each service is owned by a responsible team that decides what data the service will maintain and how that data will be exposed to other services. Netflix operates with the same architecture, as do many other internet-scale companies. In fact, adopting federated services is a proven approach for organizations that wish to scale to beyond their current limitations.

Let’s revisit the enterprise where business units prefer to run best-of-breed ERP modules to handle the specific needs of their business. This enterprise has two choices:

Integrate the various ERP modules and store their data in a single ERP database.
Coordinate independently-maintained enterprise data through API contracts.

The problem with the first option is that integration creates dependencies across the enterprise. Each time a data definition in the central database is added or changed, every software module that uses the database must be updated to match the new schema. This makes the integrated database a massive dependency generator; the result is a monolithic code base where changes are slow and painful.

Enterprises that want to move fast will select the second option. They will move to a federated architecture in which each module owns and maintains its own data, with data moving between modules via very well defined and stable interfaces. As radical as this approach may seem, internet-scale businesses have been living with services and local data stores for quite a while now, and they have found that managing interface contracts is no more difficult than managing a single, integrated database.

What Scales

Assume that every team responsible for a process can choose its own best-of-breed software module and is responsible for maintaining its own data in appropriately secure data stores. Then maintaining an authoritative source of data becomes an API problem, not a database problem. When the system-of-record for each process is contained within its own modules, new modules can be added for handling software-as-a-service, two-sided platforms, data from IoT sensors, customer outcomes or other new business model that may evolve. These modules will exchange a limited amount of data through well-defined API’s with the credit, order fulfillment, human resources, and financial modules. Internally, the new modules will collect, store, and act upon as much unstructured data and real time information as may be useful. More importantly, these modules can be updated at any time, independent of other modules in the system. In addition, they can be replicated horizontally as scale demands.

It is the API contract, not the central database, that assures each part of the company looks at the same data in the same way. Make no mistake, these API contracts are extremely important and must be carefully vetted by each data provider with all of its consumers. API contracts take the place of database schema, and data providers must ensure that their data meets the standards of a valid system-of-record. However, changes to an API contract are handled differently than most database schema changes. Each change creates a new version of the API; both old and new versions remain valid while other software modules are gradually updated to use the new version. A wise API versioning strategy eliminates the tight coupling that makes database changes so slow and cumbersome. The reason why federation scales – while a central database approach does not scale – is because with a well-defined API’s strategy, individual modules are not dependent on other modules, so each module can be deployed independently and (usually) scaled horizontally.

When you think of Enterprise ERP as a federation of independent modules communicating via API’s (rather than a database), the problems with multi-vendor ERP systems fade because the system-of-record is no longer a massive dependency-generator that requires lockstep deployments. With a federated approach, business leaders can move fast and experiment with different systems as they become available, and still synchronize critical enterprise data with the rest of the company. In addition, similar processes in different parts of the enterprise can use different applications to meet their unique needs without the significant tailoring expense encountered when a single ERP suite is imposed on the entire enterprise.

What about Standardization?

Won’t separate ERP modules lead to different processes in different parts of the enterprise? Yes, certainly. But the question is – under what circumstances are standard processes important? In the days of manual back office processes, there was lot of labor-intensive work: drafting, accounting, phone calls, people moving paperwork from one desk to another. Standardization in this kind of operating environment made sense and could lead to significant efficiencies. But in a digitized world, the important thing is not uniformity; it is rapid and continuous improvement in each business area. Different processes for different problems in different contexts can be a very good thing.

Jeff Bezos agrees; he believes that the only path to serious scale is to have a lot of independent agents making their own decisions about the best way to do things. This belief was a key factor in the birth of Amazon Web Services, a $10 billion business that keeps on growing. Amazon began its journey away from a big back end by creating small, cross-functional teams with end-to-end responsibility for a service. These teams designed their own processes to fit their particular environment. Amazon then developed a software architecture and data center infrastructure that allowed these teams to operate and deploy independently. The rest is history.

In Conclusion

It is time for enterprise processes become federated instead of integrated. This is not a new path – embedded software has used a similar architecture for decades. Today, almost every successful internet-scale business has adopted some type of federated approach because it is the only way to scale beyond the limitations of the enterprise.

As digitization brings back-office teams closer to consumers and providers, they must join with their front-office colleagues and form teams that are fully capable of designing and improving a process or a line of business. These “full stack” teams should be responsible for managing their own practices, technology and data, meeting industry standards for their particular areas. They should communicate with other areas of the enterprise on demand through well-defined interfaces.

The good news is that you can gradually migrate to a federation from almost any starting point, including an enterprise-wide ERP system. Even better, as IT moves from enforcing compliance with the company’s ERP system to brokering interface contracts and ensuring data security, it becomes a business enabler rather than a bottleneck. And best of all, responsible full stack teams that solve their own problems will create attractive jobs for talented engineers and give business units control over their own digital destiny.

January 22, 2016

Five World-Changing Software Innovations

On the 15th anniversary of the Agile Manifesto, let's look at what else was happening while we were focused on spreading the Manifesto’s ideals. There have been some impressive advances in software technology since Y2K:
1. The Cloud
2. Big Data
3. Antifragile Systems
4. Content Platforms
5. Mobile Apps

The Cloud

In 2003 Nicholas Carr’s controversial article “IT Doesn’t Matter” was published in Harvard Business Review. He claimed that “the core functions of IT– data storage, data processing, and data transport” had become commodities, just like electricity, and they no longer provided differentiation. It’s amazing how right – and how wrong – that article turned out to be. At the time, perhaps 70% of an IT budget was allocated to infrastructure, and that infrastructure rarely offered a competitive advantage. On the other hand, since there was nowhere to purchase IT infrastructure as if it were electricity, there was a huge competitive advantage awaiting the company that figured out how package and sell such infrastructure.

At the time, IT infrastructure was a big problem – especially for rapidly growing companies like Amazon.com. Amazon had started out with the standard enterprise architecture: a big front end coupled to a big back end. But the company was growing much faster than this architecture could support. CEO Jeff Bezos believed that the only way to scale to the level he had in mind was to create small autonomous teams. Thus by 2003, Amazon had restructured its digital organization into small (two-pizza) teams, each with end-to-end responsibility for a service. Individual teams were responsible for their own data, code, infrastructure, reliability, and customer satisfaction.

Amazon’s infrastructure was not set up to deal with the constant demands of multiple small teams, so things got chaotic for the operations department. This led Chris Pinkham, head of Amazon’s global infrastructure, to propose developing a capability that would let teams manage their own infrastructure – a capability that might eventually be sold to outside companies. As the proposal was being considered, Pinkham decided to return to South Africa where he had gone to school, so in 2004 Amazon gave him the funding to hire a team in South Africa and work on his idea. By 2006 the team’s product, Elastic Compute Cloud (EC2), was ready for release. It formed the kernel of what would become Amazon Web Services (AWS), which has since grown into a multi-billion-dollar business.

Amazon has consistently added software services on top of the hardware infrastructure – services like databases, analytics, access control, content delivery, containers, data streaming, and many others. It’s sort of like an IT department in a box, where almost everything you might need is readily available. Of course Amazon isn’t the only cloud company – it has several competitors.

So back to Carr’s article – Does IT matter? Clearly the portion of a company’s IT that could be provided by AWS or similar cloud services does not provide differentiation, so from a competitive perspective, it doesn’t matter. If a company can’t provide infrastructure that matches the capability, cost, accessibility, reliability, and scalability of the cloud, then it may as well outsource its infrastructure to the cloud.

Outsourcing used to be considered a good cost reduction strategy, but often there was no clear distinction between undifferentiated context (that didn’t matter) and core competencies (that did). So companies frequently outsourced the wrong things – critical capabilities that nurtured innovation and provided competitive advantage. Today it is easier to tell the difference between core and context: if a cloud service provides it then anybody can buy it, so it’s probably context; what’s left is all that's available to provide differentiation. In fact, one reason why “outsourcing” as we once knew it has fallen into disfavor is that today, much of the outsourcing is handled by cloud providers.

The idea that infrastructure is context and the rest is core helps explain why internet companies do not have IT departments. For the last two decades, technology startups have chosen to divide their businesses along core and infrastructure lines rather than along technology lines. They put differentiating capabilities in the line business units rather than relegating them to cost centers, which generally works a lot better. In fact, many IT organizations might work better if they were split into two sections, one (infrastructure) treated as a commodity and the rest moved into (or changed into) a line organization.

Big Data

In 2001 Doug Cutting released Lucene, a text indexing and search program, under the Apache software license. Cutting and Mike Cafarella then wrote a web crawler called Nutch to collect interesting data for Lucerne to index. But now they had a problem – the web crawler could index 100 million pages before it filled up the terabyte of data they could easily fit on one machine. At the time, managing large amounts of data across multiple machines was not a solved problem; most large enterprises stored their critical data in a single database running on a very large computer.

But the web was growing exponentially, and when companies like Google and Yahoo set out to collect all of the information available on the web, currently available computers and databases were not even close to big enough to store and analyze all of that data. So they had to solve the problem of using multiple machines for data storage and analysis.

One of the bigger problems with using multiple machines is the increased probability that one of machines will fail. Early in its history, Google decided to accept the fact that at its scale, hardware failure was inevitable, so it should be managed rather than avoided. This was accomplished by software which monitored each computer and disk drive in a data center, detected failure, kicked the failed component out of the system, and replaced it with a new component. This process required keeping multiple copies of all data, so when hardware failed the data it held was available in another location. Since recovering from a big failure carried more risk than recovering from a small failure, the data centers were stocked with inexpensive PC components that would experience many small failures. The software needed to detect and quickly recover from these “normal” hardware failures was perfected as the company grew.

In 2003 Google employees published two seminal papers describing how the company dealt with the massive amounts of data it collected and managed. Web Search for a Planet: The Google Cluster Architecture by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle described how Google managed it’s data centers with their inexpensive components. The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung described how the data was managed by dividing it into small chunks and maintaining multiple copies (typically three) of each chunk across the hardware. I remember that my reaction to these papers was “So that’s how they do it!” And I admired Google for sharing these sophisticated technical insights.

Cutting and Cafarella had approximately the same reaction. Using the Google File System as a model, they spent 2004 working on a distributed file system for Nutch. The system abstracted a cluster of storage into a single file system running on commodity hardware, used relaxed consistency, and hid the complexity of load balancing and failure recovery from users.

In fall, 2004, the next piece of the puzzle – analyzing massive amounts of stored data – was addressed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. Cutting and Cafarella spent 2005 rewriting Nutch and adding MapReduce, which they released as Apache Hadoop in 2006. At the same time, Yahoo decided it needed to develop something like MapReduce, and settled on hiring Cutting and building Apache Hadoop into software that could handle its massive scale. Over the next couple of years, Yahoo devoted a lot of effort to converting Apache Hadoop – open source software – from a system that could handle a few servers to a system capable of dealing with web-scale databases. In the process, their data scientists and business people discovered that Hadoop was as useful for business analysis as it was for web search.

By 2008, most web scale companies in Silicon Valley – Twitter, Facebook, LinkedIn, etc. – were using Apache Hadoop and contributing their improvements. Then startups like Cloudera were founded to help enterprises use Hadoop to analyze their data. What made Hadoop so attractive? Until that time, useful data had to be structured in a relational database and stored on one computer. Space was limited, so you only kept the current value of any data element. Hadoop could take unlimited quantities of unstructured data stored on multiple servers and make it available for data scientists and software programs to analyze. It was like moving from a small village to a megalopolis – Hadoop opened up a vast array of possibilities that are just beginning to be explored.

In 2011 Yahoo found that its Hadoop engineers were being courted by the emerging Big Data companies, so it spun off Hortonworks to give the Hadoop engineering team their own Big Data startup to grow. By 2012, Apache Hadoop (still open source) had so many data processing appendages built on top of the core software that MapReduce was split off from the underlying distributed file system. The cluster resource management that used to be in MapReduce was replaced by YARN (Yet Another Resource Negotiator). This gave Apache Hadoop another growth spurt, as MapReduce joined a growing number of analytical capabilities that run on top of YARN. Apache Spark is one of those analytical layers which supports data analysis tools that are more sophisticated and easier to use than MapReduce. Machine learning and analytics on data streams are just two of the many capabilities that Spark offers – and there are certainly more Hadoop tools to come. The potential of Big Data is just beginning to be tapped.

In the early 1990’s Tim Burners Lee worked to ensure that CERN made his underlying code for HTML, HTTP and URL’s available on a royalty free basis, and because of that we have the world wide web. Ever since, software engineers have understood that the most influential technical advances come from sharing ideas across organizations, allowing the best minds in the industry to come together and solve tough technical problems. Big Data is as capable as it is because Google and Yahoo and many others companies were willing to share their technical breakthroughs rather than keep them proprietary. In the software industry we understand that we do far better as individual companies when the industry as a whole experiences major technical advances.

Antifragile Systems

It used to be considered unavoidable that as software systems grew in age and complexity, they became increasingly fragile. Every new release was accompanied by fear of unintended consequences, which triggered extensive testing and longer periods between releases. However, the “failure is not an option” approach is not viable at internet scale – because things will go wrong in any very large system. Ignoring the possibility of failure – and focusing on trying to prevent it – simply makes the system fragile. When the inevitable failure occurs, a fragile system is likely to break down catastrophically.[1]

Rather than prevent failure, it is much more important to identify and contain failure, then recover with a minimum of inconvenience for consumers. Every large internet company has figured this out. Amazon, Google, Esty, Facebook, Netflix and many others have written or spoken about their approach to failure. Each of these companies has devoted a lot of effort to creating robust systems that can deal gracefully with unexpected and unpredictable situations.

Perhaps the most striking among these is Netflix, which has a good number of reliability engineers despite the fact that it has no data centers. Netflix’s approach was described in 2013 by Ariel Tseitlin in the article The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability. The main way Netflix increases the resilience of its systems is by regularly inducing failure with a “Simian Army” of monkeys: Chaos Monkey does some damage twice an hour, Latency Monkey simulates instances that are sick but still working, Conformity Monkey shuts down instances that don’t adhere to best practices, Security Monkey looks for security holes, Janitor Monkey cleans up clutter, Chaos Gorilla simulates failure of an AWS availability zone and Chaos Kong might take a whole Amazon region off line. I was not surprised to hear that during a recent failure of an Amazon region, Netflix customers experienced very little disruption.

A Simian Army isn’t the only way to induce failure. Facebook’s motto “Move Fast and Break Things” is another approach to stressing a system. In 2015, Ben Maurer of Facebook published Fail at Scale – a good summary of how internet companies keep very large systems reliable despite failure induced by constant change, traffic surges, and hardware failures.

Maurer notes that the primary goal for very large systems is not to prevent failure – this is both impossible and dangerous. The objective is to find the pathologies that amplify failure and keep them from occurring. Facebook has identified three failure-amplifying pathologies:

1. Rapidly deployed configuration changes
Human error is amplified by rapid changes, but rather than decrease the number of deployments, companies with antifragile systems move small changes through a release pipeline. Here changes are checked for known errors and run in a limited environment. The system quickly reverts to a known good configuration if (when) problems are found. Because the changes are small and gradually introduced into the overall system under constant surveillance, catastrophic failures are unlikely. In fact, the pipeline increases the robustness of the system over time.

2. Hard dependencies on core services
Core services fail just like anything else, so code has to be written with that in mind. Generally hardened API’s that include best practices are used to invoke these services. Core services and their API’s are gradually improved by intentionally injecting failure into a core service to expose weaknesses that are then corrected as failure modes are identified.

3. Increased latency and resource exhaustion
Best practices for avoiding the well-known problem of resource exhaustion include managing server queues wisely and having clients track outstanding requests. It’s not that these strategies are unknown, it’s that they must become common practice for all software engineers in the organization.

Well-designed dashboards, effective incident response, and after-action reviews that implement countermeasures to prevent re-occurrence round out Facebook's toolkit for keeping its very large systems reliable.

We now know that fault tolerant systems are not only more robust, but also less risky than systems which we attempt to make failure-free. Therefore, common practice for assuring the reliability of large-scale software systems is moving toward software-managed release pipelines which orchestrate frequent small releases, in conjunction with failure induction and incident analysis to produce hardened infrastructure.

Content Platforms

Video is not new; television has been around for a long time, film for even longer. As revolutionary as film and TV have been, they push content to a mass audience; they do not inspire engagement. An early attempt at visual engagement was the PicturePhone of the 1970’s – a textbook example of a technical success and a commercial disaster. They got the PicturePhone use case wrong – not many people really wanted to be seen during a phone call. Videoconferencing did not fare much better – because few people understood that video is not about improving communication, it’s about sharing experience.

In 2005, amidst a perfect storm of increasing bandwidth, decreasing cost of storage, and emerging video standards, three entrepreneurs – Chad Hurley, Steve Chen, and Jawed Karim – tried out an interesting use case for video: a dating site. But they couldn’t get anyone to submit “dating videos,” so they accepted any videos clips people wanted to upload. They were surprised at the videos they got: interesting experiences, impressive skills, how-to lessons – not what they expected, but at least it was something. The YouTube founders quickly added a search capability. This time they got the use case right and the rest is history. Video is the printing press of experience, and YouTube became the distributor of experience. Today, if you want to learn the latest unicycle tricks or how to get the back seat out of your car, you can find it on YouTube.

YouTube was not the first successful content platform. Blogs date back to the late 1990’s where they began as diaries on personal web sites shared with friends and family. Then media companies began posting breaking news on their web sites to get their stories out before their competitors. Blogger, one of the earliest blog platforms, was launched just before Y2K and acquired by Google in 2003 – the same year WordPress was launched. As blogging popularity grew over the next few years, the use case shifted from diaries and news articles to ideas and opinions – and blogs increasingly resembled magazine articles. Those short diary entries meant for friends were more like scrapbooks; they came to be called tumbleblogs or microblogs. And – no surprise – separate platforms for these microblogs emerged: Tumblr in 2006 and Twitter in 2007.

One reason why blogs drifted away from diaries and scrapbooks is that alternative platforms emerged aimed at a very similar use case – which came to be called social networking. MySpace was launched in 2003 and became wildly popular over the next few years, only to be overtaken by Facebook, which was launched in 2004.

Many other public content platforms have come (and gone) over the last decade; after all, a successful platform can usually be turned into a significant revenue stream. But the lessons learned by the founders of those early content platforms remain best practices for two-sided platforms today:

Get the use case right on both sides of the platform. Very few founders got both use cases exactly right to begin with, but the successful ones learned fast and adapted quickly.
Attract a critical mass to both sides of the platform. Attracting enough traffic to generate network effects requires a dead simple contributor experience and an addictive consumer experience, plus a receptive audience for the initial release.
Take responsibility for content even if you don’t own it. In 2007 YouTube developed ContentID to identify copyrighted audio clips embedded in videos and make it easy for contributors to comply with attribution and licensing requirements.
Be prepared for and deal effectively with stress. Some of the best antifragile patterns came from platform providers coping with extreme stress such as the massive traffic spikes at Twitter during natural disasters or hectic political events.

In short, successful platforms require insight, flexibility, discipline, and a lot of luck. Of course, this is the formula for most innovation. But don't forget – no matter how good your process is, you still need the luck part.

Mobile Apps

It’s hard to imagine what life was like without mobile apps, but they did not exist a mere eight years ago. In 2008 both Apple and Google released content platforms that allowed developers to get apps directly into the hands of smart phone owners with very little investment and few intermediaries. By 2014 (give or take a year, depending on whose data you look at) mobile apps had surpassed desktops as the path people take to the internet. It is impossible to ignore the importance of the platforms that make mobile apps possible, or the importance of the paradigm shift those apps have brought about in software engineering.

Mobile apps tend to be small and focused on doing one thing well – after all, a consumer has to quickly understand what the app does. By and large, mobile apps do not communicate with each other, and when they do it is through a disciplined exchange mediated by the platform. Their relatively small size and isolation make it natural for each individual app to be owned by a single, relatively small team that accepts the responsibility for its success. As we saw earlier, Amazon moved to small autonomous teams a long time ago, but it took a significant architectural shift for those teams to be effective. Mobile apps provide a critical architectural shift that makes small independent teams practical, even in monolithic organizations. And they provide an ecosystem that allows small startups to compete effectively with those organizations.

The nature of mobile apps changes the software development paradigm in other ways as well. As one bank manager told me, “We did our first mobile app as a project, so we thought that when the app was released, it was done. But every time there was an operating system update, we had to update the app. That was a surprise! There are so many phones to test and new features coming out that our apps are in a constant state of development. There is no such thing as maintenance – or maybe it's all maintenance.”

The small teams, constant updates, and direct access to the deployed app have created a new dynamic in the IT world: software engineers have an immediate connection with the results of their work. App teams can track usage, observe failures and track metrics – then make changes accordingly. More than any other technology, mobile platforms have fostered the growth of small, independent product teams – with end-to-end responsibility – that use short feedback loops to constantly improve their offering.

Let’s return to luck. If you have a large innovation effort, it probably has a 20% chance of success at best. If you have five small, separate innovation efforts, each with 20% chance of success, you have a much better chance that one of them will succeed – as long as they are truly autonomous and are not tied to an inflexible back end or flawed use case. Mobile apps create an environment where it can be both practical and advisable to break products into small, independent experiments, each owned by its own “full stack” team.[2] The more of these teams you have pursuing interesting ideas, the more likely you are that some of the ideas will become the innovative offerings that propel your company into the future.

What about “Agile”?

You might notice that “Agile” is not on my list of innovations. And yet, agile values are found in every major software innovation since the Agile Manifesto was articulated in 2001. Agile development does not cause innovation; it is meant to create the conditions necessary for innovation: flexibility and discipline, customer understanding and rapid feedback, small teams with end-to-end responsibility. Agile processes do not manufacture insight and they do not create luck. That is what people do.
____________________________
Footnotes:
1. “the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks… Such environments eventually experience massive blowups… catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state. Indeed, the longer it takes for the blowup to occur, the worse the resulting harm…” Antifragile, Nassim Taleb p 106

2. A full stack team contains all the people necessary to make things happen in not only the full technology stack, but also in the full stack of business capabilities necessary for the team to be successful.