Friday, September 30, 2016

The Two Sides of Teams

Collective wisdom outweighs individual insights

Most of us believe that collective wisdom outweighs individual insights – or do we?

Perhaps the biggest shortcoming of agile development practices is the way in which teams decide what to do. What product should be built? What features are most important? What consumer experiences will work best? These are the most important questions for the success of any product, and yet for the longest time, answering these questions have not been considered the responsibility of the development team or the DevOps team.

Historically, someone with the role of business analyst, project manager, or product manager made the critical decisions about what to build. Or maybe some third party wrote a specification. While the technical team might question or push back on product decisions, too often the ideas and priorities were expected to come from outside. For example, the Scrum Product Owner role is often implemented in a way that favors individual insight over collective wisdom when it comes to critical product ideas and priorities.

Until recently, there hasn’t been a practical process for tapping into the collective wisdom of everyone on the development team when making key product decisions. But now there is: it’s called the “Design Sprint.”[1]  Combining a design thinking approach with the timeboxing of an agile sprint, this is a process that captures the collective wisdom of a diverse group of people. During the five-day process, the group not only makes critical product decisions, it creates prototypes and validates hypotheses with real customers as part of the process.

Design sprints were developed by Google Ventures to help the companies in its portfolio uncover a variety of product ideas and quickly sort out the good ideas out from the mediocre ones. Design sprints have been used at hundreds of companies with amazing success. While the Lean Startup approach starts by building a Minimum Viable Product (MVP) to test ideas, design sprints are a way to avoid building the MVP until you are sure you are starting with a good idea. They help you sort through a lot more ideas before starting to code.

Where do all those good ideas come from? Design sprints do not depend on individuals or roles to generate ideas; the ideas are generated and validated by a diverse team tackling a tough problem. The insights of engineering and operations and support are combined with those of product and business and marketing to create true collective wisdom.

There are a couple of roles in design sprint; one is a “decider.” The decider generally avoids making any decisions unless called upon by teams that do not have enough information to make the decision themselves, yet need to make a choice in order to proceed. In a small company, this might be the CEO; in a larger company it is more likely a product manager. But let’s be clear – the decider is a leader who articulates a vision and strategy, but she does not usually come up with ideas, set priorities or select features. That is what teams do.

Another recommended role is someone Google calls a “sprintmaster” – a facilitator who plans, leads, and follows up on a five-day design sprint. This person is almost always a designer, because the facilitator’s job is to help teams use design thinking and design techniques to answer key product questions. For example, on the second day of the sprint, everyone develops their own ideas through a series of individual sketches; on the third day, teams review the sketches jointly and create a storyboard for a prototype – or maybe a few prototypes. On the fourth day, the prototypes are created, usually with design tools. On the fifth day, the prototypes are tested with real consumers as the team observes. When most of the people on a team have no design experience, it helps to have a designer lead them through the design process.

Really good teams generate a lot of ideas. These ideas are quickly validated with real consumers and perhaps 10 or 20% of the ideas survive. This low survival rate is a good thing; investigating a lot of ideas dramatically increases the chances that one of them will be a winner. The trick is to have a very fast way for teams to generate, validate, and select the ideas that are worth pursuing – and the design sprint provides one good option.

Of course, success requires a lot more than a diverse team and a good process.

Deliberation Makes a Group Dumber

Most of us would be surprised by the idea that deliberation makes a group dumber. But that is the conclusion reached by respected authors Cass Sunstein and Reid Hastie in their sobering book Wiser: Getting Beyond Groupthink to Make Groups Smarter. The two set out to study the cognitive biases of teams, and found that groupthink plays a bigger role in group decision-making than most of us realize.

There is no advantage in diversity on a team if those who are in the minority – those who are different or soft-spoken or are working in their second language – do not feel comfortable about sharing their unique perspective. Yet Sunstein and Hastie note that in most groups, deliberation is very likely to suppress insights that diverge from the first ideas expressed (anchoring bias) or the majority viewpoint (conformity bias).

Brainstorming has come under criticism – for good reason – as a technique that favors talkative and confident team members over thoughtful members and those with undeveloped hunches. Brainwriting[2] is an alternative to brainstorming that gives individuals time to think individually about the problem at hand and come up with ideas based on their unique background. Brainwriting is used during on the second day of a design sprint, when individuals sketch their solution to the chosen problem. This gives everyone the time and space to develop their ideas, as well as a way to have these ideas anonymously presented to and discussed by the group.

After a brainwriting exercise, a group will have generated maybe 40% more ideas than brainstorming. Typically, a technique such as dot voting is used to prioritize the many ideas and select the best ones to pursue. Unfortunately, this is another technique that favors groupthink. Voting is likely to weed out hunches and fragile ideas before they have time to develop, so outlier ideas that come from those who think differently tend to be lost in a voting process.

The lean approach to product development is pretty much the opposite of voting. Instead of narrowing options early, the lean strategy is to pursue multiple ideas that span the design space, gradually eliminating the ones that do not work. In a lean world, teams would not prioritize and select the most popular ideas after brainwriting – selection at this stage would be premature. Instead, teams would identify several very different ideas to explore, making sure to include outliers.

It is important to ensure that the ideas which survive the selection process span a wide range of possibilities – otherwise much of the benefit of brainwriting is lost. One way to do this is to select ideas that have a champion eager to pursue the idea and one or two people interested in joining her. If small sub-teams are encouraged to explore the possibilities of outlier ideas, the group is more likely to benefit from its diversity. By giving those with minority opinions not only the opportunity to present their ideas but also the time and space to give their ideas a try, a much wider variety of ideas will be seriously considered.

Consider this example: Matthew Ogle joined Spotify’s New York office in early 2015. For years he had been working on the problem of helping people discover appealing music, most recently in his own startup. He joined a Spotify team developing a discovery page, but he thought the process involved too much work – he thought discovery should be automatic. This was a radical idea at Spotify – so luckily, Ogle’s team did not vote on whether it ought to be pursued, because it would probably have died.

Instead, Ogle joined Edward Newett, an engineer and expert at deep learning who was experimenting with the idea of a discovery playlist, to explore the possibility. When Ogle realized that algorithms could generate a playlist that was uncannily well matched to his tastes, he knew they were on to a good idea. The next step was to find a way to check out these magic playlists with more people.

They tried an unusual approach – they generated playlists matched to Spotify employees’ tastes and sent them out with an email asking for feedback. Almost everyone loved their playlist, and it became clear that this idea was a winner. Through a lot of quick experiments, the idea was improved, and soon playlists were delivered to a few customers under the name “Discover Weekly.” As it scaled up, Discover Weekly proved to be wildly popular and has become a dramatic success.

The Two Sides of Teams

There are two sides to teams. There is the side that needs to make its own decisions and the side that can turn decision-making into groupthink. There is the side that wants to leverage diversity and the side that tends to ignore the input from team members who are different. The point is this: if you believe in collective wisdom, be sure to collect all of the wisdom that is available. If you look closely and honestly at your current processes and team dynamics, you might be surprised at how much wisdom is locked in the minds of individuals who don’t feel comfortable participating in the give and take of a dynamic team.

____________________________
Footnotes:

[1] See: Sprint: How to Solve Big Problems and Test New Ideas in Just Five Days  by Jake Knapp, John Zeratsky, and Braden Kowitz. For a quick “how to” summary, see: https://developers.google.com/design-sprint/downloads/DesignSprintMethods.pdf

[2] Brainstorming Doesn't Work; Try This Technique Instead

Thursday, June 16, 2016

Integration Does. Not. Scale.

In times past, there was a difference between the front office of a business – designed to make a good impression – and the back office – a utilitarian place where most of the routine work got done. The first (and for a long time the predominant) use of computers in business centered around automating back office processes, so of course, IT was relegated to the back office. 

As businesses grew, various back office functions developed their own computer systems – one for purchasing, one for payroll, one for manufacturing, and so on. The manufacturing system in vogue when I was in a factory was called MRP – Material Requirements Planning. As time went on, MRP systems were expanded to the supply chain, and then to the rest of the business, where they acquired the name ERP – Enterprise Resource Planning.

Over time it became obvious that the disparate systems for each function were handling the same data in different ways, making it difficult to coordinate across functions. So IT departments worked to create a single data repository, which quite often resided in the ERP system. The ERP suite of tools expanded to include most back office processes, including customer relationship management, order processing, human resources, and financial management. 

The good news was that now all the enterprise data could be found in the single database managed by the ERP system. The bad news was that the ERP system became complex and slow. Even worse, enterprise processes had to either conform to “best practices” supported by the ERP suite or the ERP system had to be customized to support unique processes. In either case, these changes took a long time.

ERP Systems Meet Digital Organizations  

As enterprise IT focused on implementing ERP suites and developing an authoritative system of record, the Internet became a platform for a whole new category of software, spawning new business models that did not fit into the traditional processes managed by ERP systems. Here are a few examples:

  1. Many software offerings that used to be sold as products are now being sold “as a service”. However, ERP systems were designed to manage the manufacture and distribution of physical products; they don’t generally manage subscription services.
  2. Some companies (Google for example) give away their services and sell advertising. Other companies (such as EBay and Airbnb) create platforms that unite consumers with suppliers, often disrupting traditional industries. In a platform business, the most critical processes focus on driving network effects by facilitating interactions between buyers and sellers. Although ERP systems can manage both suppliers and customers, they usually do not focus on the interactions between them.
  3. The Internet of Things (IoT) brings real time data into many processes, changing the way they are best executed. For example, predictive maintenance of heavy equipment can be scheduled based on sensor data, resulting in better outcomes for customers and thus for the enterprise. ERP suites are intended to support standard practices; they struggle to support processes that change dynamically in response to digital input.
  4. Capitalizing on the availability of data generated by products, companies are moving to selling business outcomes rather than individual products (GE is an example). When you are selling engine thrust or lighting costs, rather than engines or lightbulbs, processes need to be focused on the customer context. ERP systems generally focus on internal processes.
  5. ERP systems are supposed to provide a single, integrated record of important enterprise data, but that data rarely includes dynamic product performance data, information about consumer characteristics and preferences, or other information that has come to be called “Big Data”. This kind of information is becoming an extremely valuable resource, but there isn’t room in ERP databases to store and manage the massive amount of interesting data that is available.
In summary, digitization is bringing the back office much closer to the front office, providing the data for dynamic decision-making, and substituting short feedback loops and data-driven interactions for “best practices.” Since enterprise ERP suites were not built for speed or rapidly changing processes, they are increasingly being supplemented with other systems that manage critical enterprise processes.


Postmodern ERP

In the last few years, in the wake of the success of Salesforce.com, many cloud-based software services have become available. Some target the entire enterprise (NetSuite for example), but many are focused on particular areas (e.g. human resources) or particular industries (e.g. construction). These services are finding an eager audience – even in companies that have existing ERP systems. Today, about 30% of the spend for IT systems is coming from business units outside of IT [1]. If they cannot get the software they need from their IT departments, business leaders are likely to purchase cloud-based services instead.

The cloud reduces dependence on a company’s IT department, so it has become quite easy for various areas of the enterprise to independently adopt “best-of-breed” solutions specifically targeted at their needs, rather than use a single ERP suite across the enterprise. These best-of-breed systems are usually selected by line business leaders and hosted in the cloud. They tend to be faster to implement and more responsive to changing business situations than the enterprise ERP suite – partly because they are decoupled from the rest of the enterprise. Gartner calls the movement from a single ERP suite to a collection of ERP modules from multiple vendors “Postmodern ERP”[2]. 

Gartner warns that a multi-vendor ERP approach can lead to significant integration problems, and recommends that multiple vendors should not be used until the integration issues are sorted out. Of course, business leaders want to know why integration is important. IT departments typically respond that the ERP’s central database is the enterprise system-of-record; other ERP modules – financial reporting, for example – depend on this database for critical data. Without an integrated database, how will the rest of the enterprise be able to operate? How will the accounting department produce its required financial reports? 


Integration Does. Not. Scale

But hold on. There are plenty of very large companies that work remarkably well – and produce financial reports on time – without an integrated system-of-record. In fact, internet-scale companies have discovered that integration does not scale. If we go back to the year 2000, we find that Amazon.com had a traditional architecture – a big front end and a big back end – which got slower and slower as volume grew.  Eventually Amazon abandoned its integrated backend database in the early 2000’s, in favor of independent services that manage their own data and communicate with each other exclusively through clearly defined interfaces. 

If we have learned one thing from internet-scale players, it’s that true scale is not about integration, it is about federation. Amazon runs a massive order fulfillment business on a platform built out of small, independently deployable, horizontally scalable services. Each service is owned by a responsible team that decides what data the service will maintain and how that data will be exposed to other services. Netflix operates with the same architecture, as do many other internet-scale companies. In fact, adopting federated services is a proven approach for organizations that wish to scale to beyond their current limitations. 

Let’s revisit the enterprise where business units prefer to run best-of-breed ERP modules to handle the specific needs of their business. This enterprise has two choices: 

  1. Integrate the various ERP modules and store their data in a single ERP database.
  2. Coordinate independently-maintained enterprise data through API contracts. 

The problem with the first option is that integration creates dependencies across the enterprise. Each time a data definition in the central database is added or changed, every software module that uses the database must be updated to match the new schema. This makes the integrated database a massive dependency generator; the result is a monolithic code base where changes are slow and painful. 

Enterprises that want to move fast will select the second option. They will move to a federated architecture in which each module owns and maintains its own data, with data moving between modules via very well defined and stable interfaces. As radical as this approach may seem, internet-scale businesses have been living with services and local data stores for quite a while now, and they have found that managing interface contracts is no more difficult than managing a single, integrated database.


What Scales

Assume that every team responsible for a process can choose its own best-of-breed software module and is responsible for maintaining its own data in appropriately secure data stores. Then maintaining an authoritative source of data becomes an API problem, not a database problem. When the system-of-record for each process is contained within its own modules, new modules can be added for handling software-as-a-service, two-sided platforms, data from IoT sensors, customer outcomes or other new business model that may evolve. These modules will exchange a limited amount of data through well-defined API’s with the credit, order fulfillment, human resources, and financial modules. Internally, the new modules will collect, store, and act upon as much unstructured data and real time information as may be useful. More importantly, these modules can be updated at any time, independent of other modules in the system. In addition, they can be replicated horizontally as scale demands. 

It is the API contract, not the central database, that assures each part of the company looks at the same data in the same way. Make no mistake, these API contracts are extremely important and must be carefully vetted by each data provider with all of its consumers. API contracts take the place of database schema, and data providers must ensure that their data meets the standards of a valid system-of-record. However, changes to an API contract are handled differently than most database schema changes. Each change creates a new version of the API; both old and new versions remain valid while other software modules are gradually updated to use the new version. A wise API versioning strategy eliminates the tight coupling that makes database changes so slow and cumbersome. The reason why federation scales – while a central database approach does not scale – is because with a well-defined API’s strategy, individual modules are not dependent on other modules, so each module can be deployed independently and (usually) scaled horizontally. 

When you think of Enterprise ERP as a federation of independent modules communicating via API’s (rather than a database), the problems with multi-vendor ERP systems fade because the system-of-record is no longer a massive dependency-generator that requires lockstep deployments. With a federated approach, business leaders can move fast and experiment with different systems as they become available, and still synchronize critical enterprise data with the rest of the company. In addition, similar processes in different parts of the enterprise can use different applications to meet their unique needs without the significant tailoring expense encountered when a single ERP suite is imposed on the entire enterprise.


What about Standardization?

Won’t separate ERP modules lead to different processes in different parts of the enterprise? Yes, certainly. But the question is – under what circumstances are standard processes important? In the days of manual back office processes, there was lot of labor-intensive work: drafting, accounting, phone calls, people moving paperwork from one desk to another. Standardization in this kind of operating environment made sense and could lead to significant efficiencies. But in a digitized world, the important thing is not uniformity; it is rapid and continuous improvement in each business area. Different processes for different problems in different contexts can be a very good thing.

Jeff Bezos agrees; he believes that the only path to serious scale is to have a lot of independent agents making their own decisions about the best way to do things. This belief was a key factor in the birth of Amazon Web Services, a $10 billion business that keeps on growing. Amazon began its journey away from a big back end by creating small, cross-functional teams with end-to-end responsibility for a service. These teams designed their own processes to fit their particular environment. Amazon then developed a software architecture and data center infrastructure that allowed these teams to operate and deploy independently. The rest is history.


In Conclusion

It is time for enterprise processes become federated instead of integrated. This is not a new path – embedded software has used a similar architecture for decades. Today, almost every successful internet-scale business has adopted some type of federated approach because it is the only way to scale beyond the limitations of the enterprise. 

As digitization brings back-office teams closer to consumers and providers, they must join with their front-office colleagues and form teams that are fully capable of designing and improving a process or a line of business. These “full stack” teams should be responsible for managing their own practices, technology and data, meeting industry standards for their particular areas. They should communicate with other areas of the enterprise on demand through well-defined interfaces. 

The good news is that you can gradually migrate to a federation from almost any starting point, including an enterprise-wide ERP system. Even better, as IT moves from enforcing compliance with the company’s ERP system to brokering interface contracts and ensuring data security, it becomes a business enabler rather than a bottleneck. And best of all, responsible full stack teams that solve their own problems will create attractive jobs for talented engineers and give business units control over their own digital destiny. 

Wednesday, February 10, 2016

The New Technology Stack



Over the last two decades, the software technology stack has undergone a rapid evolution, as this diagram from Docker.io lays out.


The evolution continues. Today’s world of smart phones is giving way to tomorrow’s world of smart devices with sensors and actuators and not much more. The app layer will only get thinner.

If you think this trend will not affect your organization, think again. Tony Scott, CIO of the US federal government, advised CIO’s throughout the country to move to the cloud as fast as possible. Why? Because the large cloud providers can provide more secure, less expensive, and more reliable infrastructure than most organizations can provide for themselves. Major industries, from banking to health care, are discovering the benefits of moving to the cloud. Thin apps and assembled services running on off-premises hardware will soon become the norm for most organizations, probably even yours.

What does the cloud have to do with software development? Quite a bit, it turns out. In the cloud:

1. The development team is responsible for product design.
Assembling services is a dynamic process, not a one-time affair.
The thin app is often the only differentiator in the stack.

2. The development team is responsible for its own infrastructure.
When infrastructure is code, one team does it all:
               design/code/test/deploy/monitor/maintain.
Keeping things running is a new challenge for many software engineers.

3. Apps must be immune to infrastructure and service failure.
Stateless designs replace object-oriented designs.
Distributed, immutable data sets replace databases.
Things get done through producer/consumer chains.

So here’s the point: Practices designed for the problems of 1995 are not going to work for the problems of 2020.  We need to frame today’s and tomorrow’s problems in a way that helps us to identify and tackle them effectively; we need to use fundamental principles to help us ask the right questions. [1]

What are the right questions?  Consider this guidance from Taiichi Ohno, the father of Lean:
All we are doing is looking at the time line, from the moment the customer gives us an order to the point when we collect the cash. And we are reducing the time line by reducing the non-value adding wastes.
In the product development world, our timeline starts with a consumer problem instead of a customer order:
We look at the time line from the moment our consumers experience a problem until that problem is resolved. And we reduce the time line by reducing the non-value adding friction.
The technology stack of 1995 generated different kinds of friction than you will find in a modern technology stack. When banks moved to mobile apps a few years ago, they discovered that app development requires an agile approach because the underlying platforms change all the time. While the old technology stack resisted agile practices, the cloud demands them. There is no place for large projects or long release cycles in the new technology stack; agile development is simply table stakes - you need it to play the cloud game.

The new technology stack produces its own friction, a different kind of friction than was typically found in the old stack. This friction is particularly strong in organizations moving from the old to the new technology stack because the transition brings a lot of change to software development. Unfortunately, that change is not always well supported by the organization or welcomed by the software engineers.
Friction Generator #1: Since the new technology stack virtually requires small deployments, the development team can - and should - become deeply involved in designing differentiated products using tight feedback loops. In short, the development team becomes a product team. But frequently this product team does not have the right people (designers, for example), the authority, or the process to make dynamic product decisions. Too often development teams are told what to develop, rather than being asked to move business measures in the right direction. A lot of friction can occur if the organizational structure does not support the concept of fully responsible product teams.
Friction Generator #2: The development team must engineer solutions to quality, reliability and resilience issues that arise after deployment. This requires a different mindset than was common with the old technology stack, when the development team sent their code to the ops department, whose job it was to keep the system running. In the cloud, a team procures and releases to its own infrastructure, and there is no one else to deal with the inevitable problems that occur. Product teams must have the capability, the charter, and the mindset to accept 24/7 responsibility for their deployed code.
Friction Generator #3: The new technology stack is designed to be fault tolerant, not failure proof. This means that any service or app must be able to fail and get restarted at any time, and not produce problems due to these  interruptions. But writing "restartable" code [idempotent modules with immutable data sets] is new to most software engineers and is rarely taught in schools. Software engineers skilled at writing code for the new technology stack are in short supply and demand is intense. Good leadership, training, and support are required to help interested software engineers transition to the new languages and paradigms needed to thrive in the cloud.
Friction Generator #4: The old technology stack and associated batch processes encouraged extensive outsourcing, leaving many IT departments without software engineers or even data centers. Today, as software drives differentiation, many firms are attempting to bring software technology back in-house. But they often lack the management experience, organizational structure and personnel policies necessary to attract and retain the skilled software and reliability engineers they need for the new technology stack. 
Today, almost every business has to face the fact that their most serious competition is likely to come from companies living in the new technology stack, unencumbered by the old way of doing things. Governments and non-profits must realize that the people they serve have their expectations set by experiences with the cloud. If your organization is living in the old paradigm, it’s time to move on; big back end systems are rapidly becoming the COBOL of the 21st century.

To assess the current situation, take a look at the value stream – the stream of activities that deliver value to customers – and identify areas of friction. In the modern technology stack, friction generators tend to be either deeply technical or highly organizational in nature, as you can see from the discussion above. Unfortunately, these are not usually the problems that companies tackle when they move to modern software development. Why? Quite often the organizational structure is so entrenched that changing it is not considered. Or perhaps the people leading the transition do not understand the underlying technology and the problems presented by the new stack. In either case, the underlying problem becomes an elephant in the room that everyone ignores, while easier challenges - like adopting agile processes - are taken up.

It is important to confront the deep-seated friction generators that people would rather ignore. Start by talking about the elephant, and then actively imagine what your world would be like without that elephant. Once you have a clear vision of the future, you can work out how to move constantly toward that vision by eliminating the most pernicious friction generators, one step at a time. This approach has helped teams and organizations around the world make steady progress in the right direction, and eventually the steady progress adds up to amazing accomplishments.

Identifying, addressing, and overcoming challenging problems is one of the most engaging activities there is. People thrive when their day-to-day work involves getting good at conquering meaningful challenges. Companies do much better when they wake up the sleeping giant in each employee by encouraging them to reduce the friction that gets in the way of delivering value to customers.

If your company is not the highly successful leader-in-its-field that you hoped it would be (and no company ever is), then waiting around for things to change is not likely to make the situation better. Round up your colleagues and assess the situation. Find the elephant in the room and imagine what things would be like if it were gone. And then – since you are smart engineers – you need to engineer a way to get that elephant out of the room. Quit waiting for someone else to do this for you. You’re on.
______________________
Footnote:
1. One proven set of principles for tackling tough technology problems are the Lean principlesFocus on Customers, Energize Workers, Reduce Friction, Enhance Learning, Increase Flow, Build Quality In, Keep Getting Better.

Friday, January 22, 2016

Five World-Changing Software Innovations

On the 15th anniversary of the Agile Manifesto, let's look at what else was happening while we were focused on spreading the Manifestos ideals. There have been some impressive advances in software technology since Y2K:
          1.   The Cloud
          2.   Big Data
          3.   Antifragile Systems
          4.   Content Platforms
          5.   Mobile Apps 

The Cloud

In 2003 Nicholas Carr’s controversial article “IT Doesn’t Matter” was published in Harvard Business Review. He claimed that “the core functions of IT– data storage, data processing, and data transport” had become commodities, just like electricity, and they no longer provided differentiation. It’s amazing how right – and how wrong – that article turned out to be. At the time, perhaps 70% of an IT budget was allocated to infrastructure, and that infrastructure rarely offered a competitive advantage. On the other hand, since there was nowhere to purchase IT infrastructure as if it were electricity, there was a huge competitive advantage awaiting the company that figured out how package and sell such infrastructure. 

At the time, IT infrastructure was a big problem – especially for rapidly growing companies like Amazon.com. Amazon had started out with the standard enterprise architecture: a big front end coupled to a big back end. But the company was growing much faster than this architecture could support. CEO Jeff Bezos believed that the only way to scale to the level he had in mind was to create small autonomous teams. Thus by 2003, Amazon had restructured its digital organization into small (two-pizza) teams, each with end-to-end responsibility for a service. Individual teams were responsible for their own data, code, infrastructure, reliability, and customer satisfaction.

Amazon’s infrastructure was not set up to deal with the constant demands of multiple small teams, so things got chaotic for the operations department. This led Chris Pinkham, head of Amazon’s global infrastructure, to propose developing a capability that would let teams manage their own infrastructure – a capability that might eventually be sold to outside companies. As the proposal was being considered, Pinkham decided to return to South Africa where he had gone to school, so in 2004 Amazon gave him the funding to hire a team in South Africa and work on his idea. By 2006 the team’s product, Elastic Compute Cloud (EC2), was ready for release. It formed the kernel of what would become Amazon Web Services (AWS), which has since grown into a multi-billion-dollar business.

Amazon has consistently added software services on top of the hardware infrastructure – services like databases, analytics, access control, content delivery, containers, data streaming, and many others. It’s sort of like an IT department in a box, where almost everything you might need is readily available. Of course Amazon isn’t the only cloud company – it has several competitors.

So back to Carr’s article – Does IT matter?  Clearly the portion of a company’s IT that could be provided by AWS or similar cloud services does not provide differentiation, so from a competitive perspective, it doesn’t matter. If a company can’t provide infrastructure that matches the capability, cost, accessibility, reliability, and scalability of the cloud, then it may as well outsource its infrastructure to the cloud.

Outsourcing used to be considered a good cost reduction strategy, but often there was no clear distinction between undifferentiated context (that didn’t matter) and core competencies (that did). So companies frequently outsourced the wrong things – critical capabilities that nurtured innovation and provided competitive advantage. Today it is easier to tell the difference between core and context: if a cloud service provides it then anybody can buy it, so it’s probably context; what’s left is all that's available to provide differentiation. In fact, one reason why “outsourcing” as we once knew it has fallen into disfavor is that today, much of the outsourcing is handled by cloud providers. 

The idea that infrastructure is context and the rest is core helps explain why internet companies do not have IT departments. For the last two decades, technology startups have chosen to divide their businesses along core and infrastructure lines rather than along technology lines. They put differentiating capabilities in the line business units rather than relegating them to cost centers, which generally works a lot better. In fact, many IT organizations might work better if they were split into two sections, one (infrastructure) treated as a commodity and the rest moved into (or changed into) a line organization. 


Big Data

In 2001 Doug Cutting released Lucene, a text indexing and search program, under the Apache software license. Cutting and Mike Cafarella then wrote a web crawler called Nutch to collect interesting data for Lucerne to index. But now they had a problem – the web crawler could index 100 million pages before it filled up the terabyte of data they could easily fit on one machine. At the time, managing large amounts of data across multiple machines was not a solved problem; most large enterprises stored their critical data in a single database running on a very large computer. 

But the web was growing exponentially, and when companies like Google and Yahoo set out to collect all of the information available on the web, currently available computers and databases were not even close to big enough to store and analyze all of that data. So they had to solve the problem of using multiple machines for data storage and analysis. 

One of the bigger problems with using multiple machines is the increased probability that one of machines will fail. Early in its history, Google decided to accept the fact that at its scale, hardware failure was inevitable, so it should be managed rather than avoided. This was accomplished by software which monitored each computer and disk drive in a data center, detected failure, kicked the failed component out of the system, and replaced it with a new component. This process required keeping multiple copies of all data, so when hardware failed the data it held was available in another location. Since recovering from a big failure carried more risk than recovering from a small failure, the data centers were stocked with inexpensive PC components that would experience many small failures. The software needed to detect and quickly recover from these “normal” hardware failures was perfected as the company grew. 

In 2003 Google employees published two seminal papers describing how the company dealt with the massive amounts of data it collected and managed. Web Search for a Planet: The Google Cluster Architecture by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle described how Google managed it’s data centers with their inexpensive components. The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung described how the data was managed by dividing it into small chunks and maintaining multiple copies (typically three) of each chunk across the hardware. I remember that my reaction to these papers was “So that’s how they do it!” And I admired Google for sharing these sophisticated technical insights. 

Cutting and Cafarella had approximately the same reaction. Using the Google File System as a model, they spent 2004 working on a distributed file system for Nutch. The system abstracted a cluster of storage into a single file system running on commodity hardware, used relaxed consistency, and hid the complexity of load balancing and failure recovery from users. 

In fall, 2004, the next piece of the puzzle – analyzing massive amounts of stored data – was addressed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. Cutting and Cafarella spent 2005 rewriting Nutch and adding MapReduce, which they released as Apache Hadoop in 2006. At the same time, Yahoo decided it needed to develop something like MapReduce, and settled on hiring Cutting and building Apache Hadoop into software that could handle its massive scale. Over the next couple of years, Yahoo devoted a lot of effort to converting Apache Hadoop – open source software – from a system that could handle a few servers to a system capable of dealing with web-scale databases. In the process, their data scientists and business people discovered that Hadoop was as useful for business analysis as it was for web search. 

By 2008, most web scale companies in Silicon Valley – Twitter, Facebook, LinkedIn, etc. – were using Apache Hadoop and contributing their improvements. Then startups like Cloudera were founded to help enterprises use Hadoop to analyze their data. What made Hadoop so attractive? Until that time, useful data had to be structured in a relational database and stored on one computer. Space was limited, so you only kept the current value of any data element. Hadoop could take unlimited quantities of unstructured data stored on multiple servers and make it available for data scientists and software programs to analyze. It was like moving from a small village to a megalopolis – Hadoop opened up a vast array of possibilities that are just beginning to be explored.

In 2011 Yahoo found that its Hadoop engineers were being courted by the emerging Big Data companies, so it spun off Hortonworks to give the Hadoop engineering team their own Big Data startup to grow. By 2012, Apache Hadoop (still open source) had so many data processing appendages built on top of the core software that MapReduce was split off from the underlying distributed file system. The cluster resource management that used to be in MapReduce was replaced by YARN (Yet Another Resource Negotiator). This gave Apache Hadoop another growth spurt, as MapReduce joined a growing number of analytical capabilities that run on top of YARN. Apache Spark is one of those analytical layers which supports data analysis tools that are more sophisticated and easier to use than MapReduce. Machine learning and analytics on data streams are just two of the many capabilities that Spark offers – and there are certainly more Hadoop tools to come. The potential of Big Data is just beginning to be tapped. 

In the early 1990’s Tim Burners Lee worked to ensure that CERN made his underlying code for HTML, HTTP and URL’s available on a royalty free basis, and because of that we have the world wide web. Ever since, software engineers have understood that the most influential technical advances come from sharing ideas across organizations, allowing the best minds in the industry to come together and solve tough technical problems. Big Data is as capable as it is because Google and Yahoo and many others companies were willing to share their technical breakthroughs rather than keep them proprietary. In the software industry we understand that we do far better as individual companies when the industry as a whole experiences major technical advances. 


Antifragile Systems

It used to be considered unavoidable that as software systems grew in age and complexity, they became increasingly fragile. Every new release was accompanied by fear of unintended consequences, which triggered extensive testing and longer periods between releases. However, the “failure is not an option” approach is not viable at internet scale – because things will go wrong in any very large system. Ignoring the possibility of failure – and focusing on trying to prevent it – simply makes the system fragile. When the inevitable failure occurs, a fragile system is likely to break down catastrophically.[1]  

Rather than prevent failure, it is much more important to identify and contain failure, then recover with a minimum of inconvenience for consumers. Every large internet company has figured this out. Amazon, Google, Esty, Facebook, Netflix and many others have written or spoken about their approach to failure. Each of these companies has devoted a lot of effort to creating robust systems that can deal gracefully with unexpected and unpredictable situations.

Perhaps the most striking among these is Netflix, which has a good number of reliability engineers despite the fact that it has no data centers. Netflix’s approach was described in 2013 by Ariel Tseitlin in the article The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability.  The main way Netflix increases the resilience of its systems is by regularly inducing failure with a “Simian Army” of monkeys: Chaos Monkey does some damage twice an hour, Latency Monkey simulates instances that are sick but still working, Conformity Monkey shuts down instances that don’t adhere to best practices, Security Monkey looks for security holes, Janitor Monkey cleans up clutter, Chaos Gorilla simulates failure of an AWS availability zone and Chaos Kong might take a whole Amazon region off line. I was not surprised to hear that during a recent failure of an Amazon region, Netflix customers experienced very little disruption.

A Simian Army isn’t the only way to induce failure. Facebook’s motto “Move Fast and Break Things” is another approach to stressing a system. In 2015, Ben Maurer of Facebook published Fail at Scale – a good summary of how internet companies keep very large systems reliable despite failure induced by constant change, traffic surges, and hardware failures. 

Maurer notes that the primary goal for very large systems is not to prevent failure – this is both impossible and dangerous. The objective is to find the pathologies that amplify failure and keep them from occurring. Facebook has identified three failure-amplifying pathologies: 

1. Rapidly deployed configuration changes
Human error is amplified by rapid changes, but rather than decrease the number of deployments, companies with antifragile systems move small changes through a release pipeline. Here changes are checked for known errors and run in a limited environment. The system quickly reverts to a known good configuration if (when) problems are found. Because the changes are small and gradually introduced into the overall system under constant surveillance, catastrophic failures are unlikely. In fact, the pipeline increases the robustness of the system over time.

2. Hard dependencies on core services
Core services fail just like anything else, so code has to be written with that in mind. Generally hardened API’s that include best practices are used to invoke these services. Core services and their API’s are gradually improved by intentionally injecting failure into a core service to expose weaknesses that are then corrected as failure modes are identified.

3. Increased latency and resource exhaustion
Best practices for avoiding the well-known problem of resource exhaustion include managing server queues wisely and having clients track outstanding requests. It’s not that these strategies are unknown, it’s that they must become common practice for all software engineers in the organization. 

Well-designed dashboards, effective incident response, and after-action reviews that implement countermeasures to prevent re-occurrence round out Facebook's toolkit for keeping its very large systems reliable.

We now know that fault tolerant systems are not only more robust, but also less risky than systems which we attempt to make failure-free. Therefore, common practice for assuring the reliability of large-scale software systems is moving toward software-managed release pipelines which orchestrate frequent small releases, in conjunction with failure induction and incident analysis to produce hardened infrastructure.


Content Platforms

Video is not new; television has been around for a long time, film for even longer. As revolutionary as film and TV have been, they push content to a mass audience; they do not inspire engagement. An early attempt at visual engagement was the PicturePhone of the 1970’s – a textbook example of a technical success and a commercial disaster. They got the PicturePhone use case wrong – not many people really wanted to be seen during a phone call. Videoconferencing did not fare much better – because few people understood that video is not about improving communication, it’s about sharing experience. 

In 2005, amidst a perfect storm of increasing bandwidth, decreasing cost of storage, and emerging video standards, three entrepreneurs – Chad Hurley, Steve Chen, and Jawed Karim – tried out an interesting use case for video: a dating site. But they couldn’t get anyone to submit “dating videos,” so they accepted any videos clips people wanted to upload. They were surprised at the videos they got: interesting experiences, impressive skills, how-to lessons – not what they expected, but at least it was something. The YouTube founders quickly added a search capability. This time they got the use case right and the rest is history. Video is the printing press of experience, and YouTube became the distributor of experience. Today, if you want to learn the latest unicycle tricks or how to get the back seat out of your car, you can find it on YouTube. 

YouTube was not the first successful content platform. Blogs date back to the late 1990’s where they began as diaries on personal web sites shared with friends and family. Then media companies began posting breaking news on their web sites to get their stories out before their competitors. Blogger, one of the earliest blog platforms, was launched just before Y2K and acquired by Google in 2003 – the same year WordPress was launched. As blogging popularity grew over the next few years, the use case shifted from diaries and news articles to ideas and opinions – and blogs increasingly resembled magazine articles. Those short diary entries meant for friends were more like scrapbooks; they came to be called tumbleblogs or microblogs. And – no surprise – separate platforms for these microblogs emerged: Tumblr in 2006 and Twitter in 2007.

One reason why blogs drifted away from diaries and scrapbooks is that alternative platforms emerged aimed at a very similar use case – which came to be called social networking. MySpace was launched in 2003 and became wildly popular over the next few years, only to be overtaken by Facebook, which was launched in 2004. 

Many other public content platforms have come (and gone) over the last decade; after all, a successful platform can usually be turned into a significant revenue stream. But the lessons learned by the founders of those early content platforms remain best practices for two-sided platforms today:

  1. Get the use case right on both sides of the platform. Very few founders got both use cases exactly right to begin with, but the successful ones learned fast and adapted quickly. 
  2. Attract a critical mass to both sides of the platform. Attracting enough traffic to generate network effects requires a dead simple contributor experience and an addictive consumer experience, plus a receptive audience for the initial release.
  3. Take responsibility for content even if you don’t own it. In 2007 YouTube developed ContentID to identify copyrighted audio clips embedded in videos and make it easy for contributors to comply with attribution and licensing requirements. 
  4. Be prepared for and deal effectively with stress. Some of the best antifragile patterns came from platform providers coping with extreme stress such as the massive traffic spikes at Twitter during natural disasters or hectic political events.

In short, successful platforms require insight, flexibility, discipline, and a lot of luck. Of course, this is the formula for most innovation. But don't forget  no matter how good your process is, you still need the luck part. 


Mobile Apps

It’s hard to imagine what life was like without mobile apps, but they did not exist a mere eight years ago. In 2008 both Apple and Google released content platforms that allowed developers to get apps directly into the hands of smart phone owners with very little investment and few intermediaries. By 2014 (give or take a year, depending on whose data you look at) mobile apps had surpassed desktops as the path people take to the internet. It is impossible to ignore the importance of the platforms that make mobile apps possible, or the importance of the paradigm shift those apps have brought about in software engineering. 

Mobile apps tend to be small and focused on doing one thing well – after all, a consumer has to quickly understand what the app does. By and large, mobile apps do not communicate with each other, and when they do it is through a disciplined exchange mediated by the platform. Their relatively small size and isolation make it natural for each individual app to be owned by a single, relatively small team that accepts the responsibility for its success. As we saw earlier, Amazon moved to small autonomous teams a long time ago, but it took a significant architectural shift for those teams to be effective. Mobile apps provide a critical architectural shift that makes small independent teams practical, even in monolithic organizations. And they provide an ecosystem that allows small startups to compete effectively with those organizations.  

The nature of mobile apps changes the software development paradigm in other ways as well. As one bank manager told me, “We did our first mobile app as a project, so we thought that when the app was released, it was done. But every time there was an operating system update, we had to update the app. That was a surprise! There are so many phones to test and new features coming out that our apps are in a constant state of development. There is no such thing as maintenance – or maybe it's all maintenance.”

The small teams, constant updates, and direct access to the deployed app have created a new dynamic in the IT world: software engineers have an immediate connection with the results of their work. App teams can track usage, observe failures and track metrics – then make changes accordingly. More than any other technology, mobile platforms have fostered the growth of small, independent product teams – with end-to-end responsibility  that use short feedback loops to constantly improve their offering. 

Let’s return to luck. If you have a large innovation effort, it probably has a 20% chance of success at best. If you have five small, separate innovation efforts, each with 20% chance of success, you have a much better chance that one of them will succeed – as long as they are truly autonomous and are not tied to an inflexible back end or flawed use case. Mobile apps create an environment where it can be both practical and advisable to break products into small, independent experiments, each owned by its own “full stack” team.[2] The more of these teams you have pursuing interesting ideas, the more likely you are that some of the ideas will become the innovative offerings that propel your company into the future. 


What about “Agile”?

You might notice that “Agile” is not on my list of innovations. And yet, agile values are found in every major software innovation since the Agile Manifesto was articulated in 2001. Agile development does not cause innovation; it is meant to create the conditions necessary for innovation: flexibility and discipline, customer understanding and rapid feedback, small teams with end-to-end responsibility. Agile processes do not manufacture insight and they do not create luck. That is what people do.  
____________________________
Footnotes:
1.    “the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks… Such environments eventually experience massive blowups… catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state. Indeed, the longer it takes for the blowup to occur, the worse the resulting harm…”  Antifragile, Nassim Taleb p 106

2.   A full stack team contains all the people necessary to make things happen in not only the full technology stack, but also in the full stack of business capabilities necessary for the team to be successful.