Lean Essays: Discipline

Showing posts with label Discipline. Show all posts

January 22, 2016

Five World-Changing Software Innovations

On the 15th anniversary of the Agile Manifesto, let's look at what else was happening while we were focused on spreading the Manifesto’s ideals. There have been some impressive advances in software technology since Y2K:
1. The Cloud
2. Big Data
3. Antifragile Systems
4. Content Platforms
5. Mobile Apps

The Cloud

In 2003 Nicholas Carr’s controversial article “IT Doesn’t Matter” was published in Harvard Business Review. He claimed that “the core functions of IT– data storage, data processing, and data transport” had become commodities, just like electricity, and they no longer provided differentiation. It’s amazing how right – and how wrong – that article turned out to be. At the time, perhaps 70% of an IT budget was allocated to infrastructure, and that infrastructure rarely offered a competitive advantage. On the other hand, since there was nowhere to purchase IT infrastructure as if it were electricity, there was a huge competitive advantage awaiting the company that figured out how package and sell such infrastructure.

At the time, IT infrastructure was a big problem – especially for rapidly growing companies like Amazon.com. Amazon had started out with the standard enterprise architecture: a big front end coupled to a big back end. But the company was growing much faster than this architecture could support. CEO Jeff Bezos believed that the only way to scale to the level he had in mind was to create small autonomous teams. Thus by 2003, Amazon had restructured its digital organization into small (two-pizza) teams, each with end-to-end responsibility for a service. Individual teams were responsible for their own data, code, infrastructure, reliability, and customer satisfaction.

Amazon’s infrastructure was not set up to deal with the constant demands of multiple small teams, so things got chaotic for the operations department. This led Chris Pinkham, head of Amazon’s global infrastructure, to propose developing a capability that would let teams manage their own infrastructure – a capability that might eventually be sold to outside companies. As the proposal was being considered, Pinkham decided to return to South Africa where he had gone to school, so in 2004 Amazon gave him the funding to hire a team in South Africa and work on his idea. By 2006 the team’s product, Elastic Compute Cloud (EC2), was ready for release. It formed the kernel of what would become Amazon Web Services (AWS), which has since grown into a multi-billion-dollar business.

Amazon has consistently added software services on top of the hardware infrastructure – services like databases, analytics, access control, content delivery, containers, data streaming, and many others. It’s sort of like an IT department in a box, where almost everything you might need is readily available. Of course Amazon isn’t the only cloud company – it has several competitors.

So back to Carr’s article – Does IT matter? Clearly the portion of a company’s IT that could be provided by AWS or similar cloud services does not provide differentiation, so from a competitive perspective, it doesn’t matter. If a company can’t provide infrastructure that matches the capability, cost, accessibility, reliability, and scalability of the cloud, then it may as well outsource its infrastructure to the cloud.

Outsourcing used to be considered a good cost reduction strategy, but often there was no clear distinction between undifferentiated context (that didn’t matter) and core competencies (that did). So companies frequently outsourced the wrong things – critical capabilities that nurtured innovation and provided competitive advantage. Today it is easier to tell the difference between core and context: if a cloud service provides it then anybody can buy it, so it’s probably context; what’s left is all that's available to provide differentiation. In fact, one reason why “outsourcing” as we once knew it has fallen into disfavor is that today, much of the outsourcing is handled by cloud providers.

The idea that infrastructure is context and the rest is core helps explain why internet companies do not have IT departments. For the last two decades, technology startups have chosen to divide their businesses along core and infrastructure lines rather than along technology lines. They put differentiating capabilities in the line business units rather than relegating them to cost centers, which generally works a lot better. In fact, many IT organizations might work better if they were split into two sections, one (infrastructure) treated as a commodity and the rest moved into (or changed into) a line organization.

Big Data

In 2001 Doug Cutting released Lucene, a text indexing and search program, under the Apache software license. Cutting and Mike Cafarella then wrote a web crawler called Nutch to collect interesting data for Lucerne to index. But now they had a problem – the web crawler could index 100 million pages before it filled up the terabyte of data they could easily fit on one machine. At the time, managing large amounts of data across multiple machines was not a solved problem; most large enterprises stored their critical data in a single database running on a very large computer.

But the web was growing exponentially, and when companies like Google and Yahoo set out to collect all of the information available on the web, currently available computers and databases were not even close to big enough to store and analyze all of that data. So they had to solve the problem of using multiple machines for data storage and analysis.

One of the bigger problems with using multiple machines is the increased probability that one of machines will fail. Early in its history, Google decided to accept the fact that at its scale, hardware failure was inevitable, so it should be managed rather than avoided. This was accomplished by software which monitored each computer and disk drive in a data center, detected failure, kicked the failed component out of the system, and replaced it with a new component. This process required keeping multiple copies of all data, so when hardware failed the data it held was available in another location. Since recovering from a big failure carried more risk than recovering from a small failure, the data centers were stocked with inexpensive PC components that would experience many small failures. The software needed to detect and quickly recover from these “normal” hardware failures was perfected as the company grew.

In 2003 Google employees published two seminal papers describing how the company dealt with the massive amounts of data it collected and managed. Web Search for a Planet: The Google Cluster Architecture by Luiz André Barroso, Jeffrey Dean, and Urs Hölzle described how Google managed it’s data centers with their inexpensive components. The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung described how the data was managed by dividing it into small chunks and maintaining multiple copies (typically three) of each chunk across the hardware. I remember that my reaction to these papers was “So that’s how they do it!” And I admired Google for sharing these sophisticated technical insights.

Cutting and Cafarella had approximately the same reaction. Using the Google File System as a model, they spent 2004 working on a distributed file system for Nutch. The system abstracted a cluster of storage into a single file system running on commodity hardware, used relaxed consistency, and hid the complexity of load balancing and failure recovery from users.

In fall, 2004, the next piece of the puzzle – analyzing massive amounts of stored data – was addressed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. Cutting and Cafarella spent 2005 rewriting Nutch and adding MapReduce, which they released as Apache Hadoop in 2006. At the same time, Yahoo decided it needed to develop something like MapReduce, and settled on hiring Cutting and building Apache Hadoop into software that could handle its massive scale. Over the next couple of years, Yahoo devoted a lot of effort to converting Apache Hadoop – open source software – from a system that could handle a few servers to a system capable of dealing with web-scale databases. In the process, their data scientists and business people discovered that Hadoop was as useful for business analysis as it was for web search.

By 2008, most web scale companies in Silicon Valley – Twitter, Facebook, LinkedIn, etc. – were using Apache Hadoop and contributing their improvements. Then startups like Cloudera were founded to help enterprises use Hadoop to analyze their data. What made Hadoop so attractive? Until that time, useful data had to be structured in a relational database and stored on one computer. Space was limited, so you only kept the current value of any data element. Hadoop could take unlimited quantities of unstructured data stored on multiple servers and make it available for data scientists and software programs to analyze. It was like moving from a small village to a megalopolis – Hadoop opened up a vast array of possibilities that are just beginning to be explored.

In 2011 Yahoo found that its Hadoop engineers were being courted by the emerging Big Data companies, so it spun off Hortonworks to give the Hadoop engineering team their own Big Data startup to grow. By 2012, Apache Hadoop (still open source) had so many data processing appendages built on top of the core software that MapReduce was split off from the underlying distributed file system. The cluster resource management that used to be in MapReduce was replaced by YARN (Yet Another Resource Negotiator). This gave Apache Hadoop another growth spurt, as MapReduce joined a growing number of analytical capabilities that run on top of YARN. Apache Spark is one of those analytical layers which supports data analysis tools that are more sophisticated and easier to use than MapReduce. Machine learning and analytics on data streams are just two of the many capabilities that Spark offers – and there are certainly more Hadoop tools to come. The potential of Big Data is just beginning to be tapped.

In the early 1990’s Tim Burners Lee worked to ensure that CERN made his underlying code for HTML, HTTP and URL’s available on a royalty free basis, and because of that we have the world wide web. Ever since, software engineers have understood that the most influential technical advances come from sharing ideas across organizations, allowing the best minds in the industry to come together and solve tough technical problems. Big Data is as capable as it is because Google and Yahoo and many others companies were willing to share their technical breakthroughs rather than keep them proprietary. In the software industry we understand that we do far better as individual companies when the industry as a whole experiences major technical advances.

Antifragile Systems

It used to be considered unavoidable that as software systems grew in age and complexity, they became increasingly fragile. Every new release was accompanied by fear of unintended consequences, which triggered extensive testing and longer periods between releases. However, the “failure is not an option” approach is not viable at internet scale – because things will go wrong in any very large system. Ignoring the possibility of failure – and focusing on trying to prevent it – simply makes the system fragile. When the inevitable failure occurs, a fragile system is likely to break down catastrophically.[1]

Rather than prevent failure, it is much more important to identify and contain failure, then recover with a minimum of inconvenience for consumers. Every large internet company has figured this out. Amazon, Google, Esty, Facebook, Netflix and many others have written or spoken about their approach to failure. Each of these companies has devoted a lot of effort to creating robust systems that can deal gracefully with unexpected and unpredictable situations.

Perhaps the most striking among these is Netflix, which has a good number of reliability engineers despite the fact that it has no data centers. Netflix’s approach was described in 2013 by Ariel Tseitlin in the article The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability. The main way Netflix increases the resilience of its systems is by regularly inducing failure with a “Simian Army” of monkeys: Chaos Monkey does some damage twice an hour, Latency Monkey simulates instances that are sick but still working, Conformity Monkey shuts down instances that don’t adhere to best practices, Security Monkey looks for security holes, Janitor Monkey cleans up clutter, Chaos Gorilla simulates failure of an AWS availability zone and Chaos Kong might take a whole Amazon region off line. I was not surprised to hear that during a recent failure of an Amazon region, Netflix customers experienced very little disruption.

A Simian Army isn’t the only way to induce failure. Facebook’s motto “Move Fast and Break Things” is another approach to stressing a system. In 2015, Ben Maurer of Facebook published Fail at Scale – a good summary of how internet companies keep very large systems reliable despite failure induced by constant change, traffic surges, and hardware failures.

Maurer notes that the primary goal for very large systems is not to prevent failure – this is both impossible and dangerous. The objective is to find the pathologies that amplify failure and keep them from occurring. Facebook has identified three failure-amplifying pathologies:

1. Rapidly deployed configuration changes
Human error is amplified by rapid changes, but rather than decrease the number of deployments, companies with antifragile systems move small changes through a release pipeline. Here changes are checked for known errors and run in a limited environment. The system quickly reverts to a known good configuration if (when) problems are found. Because the changes are small and gradually introduced into the overall system under constant surveillance, catastrophic failures are unlikely. In fact, the pipeline increases the robustness of the system over time.

2. Hard dependencies on core services
Core services fail just like anything else, so code has to be written with that in mind. Generally hardened API’s that include best practices are used to invoke these services. Core services and their API’s are gradually improved by intentionally injecting failure into a core service to expose weaknesses that are then corrected as failure modes are identified.

3. Increased latency and resource exhaustion
Best practices for avoiding the well-known problem of resource exhaustion include managing server queues wisely and having clients track outstanding requests. It’s not that these strategies are unknown, it’s that they must become common practice for all software engineers in the organization.

Well-designed dashboards, effective incident response, and after-action reviews that implement countermeasures to prevent re-occurrence round out Facebook's toolkit for keeping its very large systems reliable.

We now know that fault tolerant systems are not only more robust, but also less risky than systems which we attempt to make failure-free. Therefore, common practice for assuring the reliability of large-scale software systems is moving toward software-managed release pipelines which orchestrate frequent small releases, in conjunction with failure induction and incident analysis to produce hardened infrastructure.

Content Platforms

Video is not new; television has been around for a long time, film for even longer. As revolutionary as film and TV have been, they push content to a mass audience; they do not inspire engagement. An early attempt at visual engagement was the PicturePhone of the 1970’s – a textbook example of a technical success and a commercial disaster. They got the PicturePhone use case wrong – not many people really wanted to be seen during a phone call. Videoconferencing did not fare much better – because few people understood that video is not about improving communication, it’s about sharing experience.

In 2005, amidst a perfect storm of increasing bandwidth, decreasing cost of storage, and emerging video standards, three entrepreneurs – Chad Hurley, Steve Chen, and Jawed Karim – tried out an interesting use case for video: a dating site. But they couldn’t get anyone to submit “dating videos,” so they accepted any videos clips people wanted to upload. They were surprised at the videos they got: interesting experiences, impressive skills, how-to lessons – not what they expected, but at least it was something. The YouTube founders quickly added a search capability. This time they got the use case right and the rest is history. Video is the printing press of experience, and YouTube became the distributor of experience. Today, if you want to learn the latest unicycle tricks or how to get the back seat out of your car, you can find it on YouTube.

YouTube was not the first successful content platform. Blogs date back to the late 1990’s where they began as diaries on personal web sites shared with friends and family. Then media companies began posting breaking news on their web sites to get their stories out before their competitors. Blogger, one of the earliest blog platforms, was launched just before Y2K and acquired by Google in 2003 – the same year WordPress was launched. As blogging popularity grew over the next few years, the use case shifted from diaries and news articles to ideas and opinions – and blogs increasingly resembled magazine articles. Those short diary entries meant for friends were more like scrapbooks; they came to be called tumbleblogs or microblogs. And – no surprise – separate platforms for these microblogs emerged: Tumblr in 2006 and Twitter in 2007.

One reason why blogs drifted away from diaries and scrapbooks is that alternative platforms emerged aimed at a very similar use case – which came to be called social networking. MySpace was launched in 2003 and became wildly popular over the next few years, only to be overtaken by Facebook, which was launched in 2004.

Many other public content platforms have come (and gone) over the last decade; after all, a successful platform can usually be turned into a significant revenue stream. But the lessons learned by the founders of those early content platforms remain best practices for two-sided platforms today:

Get the use case right on both sides of the platform. Very few founders got both use cases exactly right to begin with, but the successful ones learned fast and adapted quickly.
Attract a critical mass to both sides of the platform. Attracting enough traffic to generate network effects requires a dead simple contributor experience and an addictive consumer experience, plus a receptive audience for the initial release.
Take responsibility for content even if you don’t own it. In 2007 YouTube developed ContentID to identify copyrighted audio clips embedded in videos and make it easy for contributors to comply with attribution and licensing requirements.
Be prepared for and deal effectively with stress. Some of the best antifragile patterns came from platform providers coping with extreme stress such as the massive traffic spikes at Twitter during natural disasters or hectic political events.

In short, successful platforms require insight, flexibility, discipline, and a lot of luck. Of course, this is the formula for most innovation. But don't forget – no matter how good your process is, you still need the luck part.

Mobile Apps

It’s hard to imagine what life was like without mobile apps, but they did not exist a mere eight years ago. In 2008 both Apple and Google released content platforms that allowed developers to get apps directly into the hands of smart phone owners with very little investment and few intermediaries. By 2014 (give or take a year, depending on whose data you look at) mobile apps had surpassed desktops as the path people take to the internet. It is impossible to ignore the importance of the platforms that make mobile apps possible, or the importance of the paradigm shift those apps have brought about in software engineering.

Mobile apps tend to be small and focused on doing one thing well – after all, a consumer has to quickly understand what the app does. By and large, mobile apps do not communicate with each other, and when they do it is through a disciplined exchange mediated by the platform. Their relatively small size and isolation make it natural for each individual app to be owned by a single, relatively small team that accepts the responsibility for its success. As we saw earlier, Amazon moved to small autonomous teams a long time ago, but it took a significant architectural shift for those teams to be effective. Mobile apps provide a critical architectural shift that makes small independent teams practical, even in monolithic organizations. And they provide an ecosystem that allows small startups to compete effectively with those organizations.

The nature of mobile apps changes the software development paradigm in other ways as well. As one bank manager told me, “We did our first mobile app as a project, so we thought that when the app was released, it was done. But every time there was an operating system update, we had to update the app. That was a surprise! There are so many phones to test and new features coming out that our apps are in a constant state of development. There is no such thing as maintenance – or maybe it's all maintenance.”

The small teams, constant updates, and direct access to the deployed app have created a new dynamic in the IT world: software engineers have an immediate connection with the results of their work. App teams can track usage, observe failures and track metrics – then make changes accordingly. More than any other technology, mobile platforms have fostered the growth of small, independent product teams – with end-to-end responsibility – that use short feedback loops to constantly improve their offering.

Let’s return to luck. If you have a large innovation effort, it probably has a 20% chance of success at best. If you have five small, separate innovation efforts, each with 20% chance of success, you have a much better chance that one of them will succeed – as long as they are truly autonomous and are not tied to an inflexible back end or flawed use case. Mobile apps create an environment where it can be both practical and advisable to break products into small, independent experiments, each owned by its own “full stack” team.[2] The more of these teams you have pursuing interesting ideas, the more likely you are that some of the ideas will become the innovative offerings that propel your company into the future.

What about “Agile”?

You might notice that “Agile” is not on my list of innovations. And yet, agile values are found in every major software innovation since the Agile Manifesto was articulated in 2001. Agile development does not cause innovation; it is meant to create the conditions necessary for innovation: flexibility and discipline, customer understanding and rapid feedback, small teams with end-to-end responsibility. Agile processes do not manufacture insight and they do not create luck. That is what people do.
____________________________
Footnotes:
1. “the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks… Such environments eventually experience massive blowups… catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state. Indeed, the longer it takes for the blowup to occur, the worse the resulting harm…” Antifragile, Nassim Taleb p 106

2. A full stack team contains all the people necessary to make things happen in not only the full technology stack, but also in the full stack of business capabilities necessary for the team to be successful.

August 1, 2003

Concurrent Development

When sheet metal is formed into a car body, a massive machine called a stamping machine presses the metal into shape. The stamping machine has a huge metal tool called a die which makes contact with the sheet metal and presses it into the shape of a fender or a door or a hood. Designing and cutting these dies to the proper shape accounts for half of the capital investment of a new car development program, and drives the critical path. If a mistake ruins a die, the entire development program suffers a huge set-back. So if there is one thing that automakers want to do right, it is the die design and cutting.

The problem is, as the car development progresses, engineers keep making changes to the car, and these find their way to the die design. No matter how hard the engineers try to freeze the design, they are not able to do so. In Detroit in the 1980’s the cost of changes to the design was 30 – 50% of the total die cost, while in Japan it was 10 – 20% of the total die cost. These numbers seem to indicate the Japanese companies must have been much better at preventing change after the die specs were released to the tool and die shop. But such was not the case.

The US strategy for making a die was to wait until the design specs were frozen, and then send the final design to the tool and die maker, which triggered the process of ordering the block of steel and cutting it. Any changes went through a arduous change approval process. It took about two years from ordering the steel to the time that die would be used in production. In Japan, however, the tool and die makers order up the steel blocks and start rough cutting at the same time the car design is starting. This is called concurrent development. How can it possibly work?

The die engineers in Japan are expected to know a lot about what a die for a front door panel will involve, and they are in constant communication with the body engineer. They anticipate the final solution and they are also skilled in techniques to make minor changes late in development, such as leaving more material where changes are likely. Most of the time die engineers are able to accommodate the engineering design as it evolves. In the rare case of a mistake, a new die can be cut much faster because the whole process is streamlined.

Japanese automakers do not freeze design points until late in the development process, allowing most changes occur while the window for change is still open. When compared to the early design freeze practices in the US in the 1980’s, Japanese die makers spent perhaps a third as much money on changes, and produced better die designs. Japanese dies tended to require fewer stamping cycles per part, creating significant production savings.

The significant difference in time-to-market and increasing market success of Japanese automakers prompted US automotive companies to adopt concurrent development practices in the 1990’s, and today the product development performance gap has narrowed significantly.

Concurrent Software Development
Programming is a lot like die cutting. The stakes are often high and mistakes can be costly, so sequential development, that is, establishing requirements before development begins, is commonly thought of as a way to protect against serious errors. The problem with sequential development is that it forces designers to take a depth-first rather than a breadth-first approach to design. Depth-first forces making low level dependant decisions before experiencing the consequences of the high level decisions. The most costly mistakes are made by forgetting to consider something important at the beginning. The easiest way to make such a big mistake is to drill down to detail too fast. Once you set down the detailed path, you can’t back up, and aren’t likely to realize that you should. When big mistakes can be made, it is best to survey the landscape and delay the detailed decisions.

Concurrent development of software usually takes the form of iterative development. It is the preferred approach when the stakes are high and the understanding of the problem is evolving. Concurrent development allows you to take a breadth-first approach and discover those big, costly problems before it’s too late. Moving from sequential development to concurrent development means starting programming the highest value features as soon as a high level conceptual design is determined, even while detailed requirements are being investigated. This may sound counterintuitive, but think of it as an exploratory approach which permits you to learn by trying a variety of options before you lock in on a direction that constrains implementation of less important features.

In addition to providing insurance against costly mistakes, concurrent development is the best way to deal with changing requirements, because not only are the big decisions deferred while you consider all the options, but the little decisions are deferred as well. When change is inevitable, concurrent development reduces delivery time and overall cost, while improving the performance of the final product.

If this sounds like magic – or hacking – it would be if nothing else changed. Just starting programming earlier, without the associated expertise and collaboration found in Japanese die cutting, is unlikely to lead to improved results. There are some critical skills that must be in place in order for concurrent development to work.

Under sequential development, US automakers considered die engineers to be quite remote from the automotive engineers, and so too, programmers in a sequential development process often have little contact with the customers and users who have requirements and the analysts who collect requirements. Concurrent development in die cutting required US automakers to make two critical changes – the die engineer needed the expertise to anticipate what the emerging design would need in the cut steel, and had to collaborate closely with the body engineer.

Similarly, concurrent software development requires developers with enough expertise in the domain to anticipate where the emerging design is likely to lead, and close collaboration with the customers and analysts who are designing how the system will solve the business problem at hand.

The Last Responsible Moment
Concurrent software development means starting developing when only partial requirements are known and developing in short iterations which provide the feedback that causes the system to emerge. Concurrent development makes it possible to delay commitment until the Last Responsible Moment, that is, the moment at which failing to make a decision eliminates an important alternative. If commitments are delayed beyond the Last Responsible Moment, then decisions are made by default, which is generally not a good approach to making decisions.

Procrastinating is not the same as making decisions at the Last Responsible Moment; in fact, delaying decisions is hard work. Here are some tactics for making decisions at the Last Responsible Moment:

Share partially complete design information. The notion that a design must be complete before it is released is the biggest enemy of concurrent development. Requiring complete information before releasing a design increases the length of the feedback loop in the design process and causes irreversible decisions to be made far sooner than necessary. Good design is a discovery process, done through short, repeated exploratory cycles.
Organize for direct, worker-to-worker collaboration. Early release of incomplete information means that the design will be refined as development proceeds. This requires that upstream people who understand the details of what the system must do to provide value must communicate directly with downstream people who understand the details of how the code works.
Develop a sense of how to absorb changes. In ‘Delaying Commitment,’ IEEE Software (1988), Harold Thimbleby observes that the difference between amateurs and experts is that experts know how to delay commitments and how to conceal their errors for as long as possible. Experts repair their errors before they cause problems. Amateurs try to get everything right the first time and so overload their problem solving capacity that they end up committing early to wrong decisions. Thimbleby recommends some tactics for delaying commitment in software development, which could be summarized as an endorsement of object-oriented design and component-based development:

Use Modules. Information hiding, or more generally behavior hiding, is the foundation of object-oriented approaches. Delay commitment to the internal design of the module until the requirements of the clients on the interfaces stabilize.

Use Interfaces. Separate interfaces from implementations. Clients should not de-pend on implementation decisions.

Use Parameters.Make magic numbers – constants that have meaning – into parameters. Make magic capabilities like databases and third party middleware into parameters. By passing capabilities into modules wrapped in simple interfaces, your dependence on specific implementations is eliminated and testing becomes much easier.

Use Abstractions. Abstraction and commitment are inverse processes. Defer commitment to specific representations as long as the abstract will serve immediate design needs.

Avoid Sequential Programming. Use declarative programming rather than procedural programming, trading off performance for flexibility. Define algorithms in a way that does not depend on a particular order of execution.

Beware of custom tool building. Investment in frameworks and other tooling frequently requires committing too early to implementation details that end up adding needless complexity and seldom pay back. Frameworks should be extracted from a collection of successful implementations, not built on speculation.

Additional tactics for delaying commitment include:

Avoid Repetition. This is variously known as the Don’t Repeat Yourself (DRY) or Once And Only Once (OAOO) principle. If every capability is expressed in only one place in the code, there will be only one place to change when that capability needs to evolve and there will be no inconsistencies.

Separate Concerns. Each module should have a single well defined responsibility. This means that a class will have only one reason to change.

Encapsulate Variation. What is likely to change should be inside, the interfaces should be stable. Changes should not cascade to other modules. This strategy, of course, depends on a deep understanding of the domain to know which aspects will be stable and which variable. By application of appropriate patterns, it should be possible to extend the encapsulated behavior without modifying the code itself.

Defer Implementation of Future Capabilities. Implement only the simplest code that will satisfy immediate needs rather than putting in capabilities you ‘know’ you will need in the future. You will know better in the future what you really need then and simple code will be easier to extend then if necessary.

Avoid extra features. If you defer adding features you ‘know’ you will need, then you certainly want to avoid adding extra features ‘just-in-case’ they are needed. Extra features add an extra burden of code to be tested and maintained, and understood by programmers and users alike. Extra features add complexity, not flexibility.

Much has been written on these delaying tactics, so they will not be covered in detail in this book.

Develop a sense of what is critically important in the domain. Forgetting some critical feature of the system until too late is the fear which drives sequential development. If security, or response time, or fail safe operation are critically important in the domain, these issues need to be considered from the start; if they are ignored until too late, it will indeed be costly. However, the assumption that sequential development is the best way to discover these critical features is flawed. In practice, early commitments are more likely to overlook such critical elements than late commitments, because early commitments rapidly narrow the field of view.
Develop a sense of when decisions must be made. You do not want to make decisions by default, or you have not delayed them. Certain architectural concepts such as usability design, layering and component packaging are best made early, so as to facilitate emergence in the rest of the design. A bias toward late commitment must not degenerate into a bias toward no commitment. You need to develop a keen sense of timing and a mechanism to cause decisions to be made when their time has come.
Develop a quick response capability. The slower you respond, the earlier you have to make decisions. Dell, for instance, can assemble computers in less than a week, so they can decide what to make less than a week before shipping. Most other computer manufacturers take a lot longer to assemble computers, so they have to decide what to make much sooner. If you can change your software quickly, you can wait to make a change until customers’ know what they want.

Cost Escalation
Software is different from most products in that software systems are expected to be upgraded on a regular basis. On the average, more than half of the development work that occurs on a software system occurs after it is first sold or placed into production. In addition to internal changes, software systems are subject to a changing environment – a new operating system, a change in the underlying database, a change in the client used by the GUI, a new application using the same database, etc. Most software is expected to change regularly over its lifetime, and in fact once upgrades are stopped, software is often nearing the end of its useful life. This presents us with a new category of waste, that is, waste caused by software that is difficult to change.

In 1987 Barry Boehm wrote, “Finding and fixing a software problem after delivery costs 100 times more than finding and fixing the problem in early design phases”. This observation became been the rational behind thorough up front requirements analysis and design, even though Boehm himself encouraged incremental development over “single-shot, full product development.” In 2001, Boehm noted that for small systems the escalation factor can be more like 5:1 than 100:1; and even on large systems, good architectural practices can significantly reduce the cost of change by confining features that are likely to change to small, well-encapsulated areas.

There used to be a similar, but more dramatic, cost escalation factor for product development. It was once estimated that a change after production began could cost 1000 times more than if the change had been made in the original design. The belief that the cost of change escalates as development proceeds contributed greatly to the standardizing the sequential development process in the US. No one seemed to recognize that the sequential process could actually be the cause of the high escalation ratio. However, as concurrent development replaced sequential development in the US in the 1990’s, the cost escalation discussion was forever altered. The discussion was no longer how much a change might cost later in development; the discussion centered on how to reduce the need for change through concurrent engineering.

Not all change is equal. There are a few basic architectural decisions that you need to get right at the beginning of development, because they fix the constraints of the system for its life. Examples of these may be choice of language, architectural layering decisions, or the choice to interact with an existing database also used by other applications. These kinds of decisions might have the 100:1 cost escalation ratio. Because these decisions are so crucial, you should focus on minimizing the number of these high stakes constraints. You also want to take a breadth-first approach to these high stakes decisions.

The bulk of the change in a system does not have to have a high cost escalation factor; it is the sequential approach that causes the cost of most changes to escalate exponentially as you move through development. Sequential development emphasizes getting all the decisions made as early as possible, so the cost of all changes is the same – very high. Concurrent design defers decisions as late as possible. This has four effects:

Reduces the number of high-stake constraints.
Gives a breadth-first approach to high-stakes decisions, making it more likely that they will be made correctly.
Defers the bulk of the decisions, significantly reducing the need for change.
Dramatically decreases the cost escalation factor for most changes.

A single cost escalation factor or curve is misleading. Instead of a chart showing a single trend for all changes, a more appropriate graph has at least two cost escalation curves, as show in Figure 3-1. The agile development objective is to move as many changes as possible from the top curve to the bottom curve.

Figure 3-1. Two Cost Escalation Curves

Returning for a moment to the Toyota die cutting example, the die engineer sees the conceptual design of the car and knows roughly the size of door panel is necessary. With that information, a big enough steel block can be ordered. If the concept of the car changes from a small, sporty car to a mid-size family car, the block of steel may be too small, and that would be a costly mistake. But the die engineer knows that once the overall concept is approved, it won’t change, so the steel can be safely ordered, long before the details of the door emerge. Concurrent design is a robust design process because the die adapts to whatever design emerges.

Lean software development delays freezing all design decisions as long as possible, because it is easier to change a decision that hasn’t been made. Lean software development emphasizes developing a robust, change-tolerant design, one that accepts the inevitability of change and structures the system so that it can be readily adapted to the most likely kinds of changes.

The main reason why software changes throughout its lifecycle is that the business process in which it is used evolves over time. Some domains evolve faster than others, and some domains may be essentially stable. It is not possible to build in flexibility to accommodate arbitrary changes cheaply. The idea is to build tolerance for change into the system along domain dimensions that are likely to change. Observing where the changes occur during iterative development gives a good indication of where the system is likely to need flexibility in the future. If changes of certain types are frequent during development, you can expect that these types of changes will not end when the product is released. The secret is to know enough about the domain to maintain flexibility, yet avoid making things any more complex than they must be.

If a system is developed by allowing the design to emerge through iterations, the design will be robust, adapting more readily to types of changes that occur during development. More importantly, the ability to adapt will be built-in to the system, so that as more changes occur after its release, they can be readily incorporated. On the other hand, if systems are built with a focus on getting everything right at the beginning in order to reduce the cost of later changes, their design is likely to be brittle and not accept changes readily. Worse, the chance of making a major mistake in the key structural decisions is increased with a depth-first, rather than a breadth-first approach.

Book Excerpt: Lean Software Development: An Agile Toolkit

Published in Software Development Magazine, August 2003

August 8, 2002

XP in a Safety-Critical Environment

Recently I chanced to meet a gentleman on a plane who audits the software used in medical and pharmaceutical instruments. During our long and interesting conversation, he cited several instances where defects in software had resulted in deaths. One that comes to mind is a machine which mixed a lethal dosage of radiation [1]. We discussed how such deaths could be prevented, and he was adamant – it is a well-known fact that when developing safety-critical software, all requirements must be documented up front and all code must be traced to the requirements. I asked how one could be sure that the requirements themselves would not cause of a problem. He paused and admitted that indeed, the integrity of the requirements is a critical issue, but one which is difficult to regulate. The best hope is that if a shop is disciplined shop in other areas, it will not make mistakes in documenting requirements.

One of the people this auditor might be checking up on is Ron Morsicato. Ron is a veteran developer who writes software for computers that control how devices respond to people. The device might be a weapon or a medical instrument, but often if Ron’s software goes astray, it can kill people. Last year Ron started using Extreme Programming (XP) for a pharmaceutical instrument, and found it quite compatible with a highly regulated and safety-critical environment. In fact, a representative of a worldwide pharmaceutical customer audited his process. This seasoned auditor concluded that Ron’s development team had implemented practices sufficiently good to be on a par with the expected good practices in the field. This was a strong affirmation of the practices used by the only XP team in the company.

However, Ron’s team did not pass the audit. The auditor was disturbed that the team had been allowed to unilaterally implement XP. He noted that the organization did not have policies concerning which processes must be used, and no process, even one which was quite acceptable, should be independently implemented by a development team.

The message that Ron’s team heard was that they had done an excellent job using XP when audited against a pharmaceutical standard. What their management heard was that the XP process had failed the audit. This company probably won’t be using XP again, which is too bad, because Ron thinks it is an important step forward in designing better safety-critical systems.

The Ying and Yang of Safety-Critical Software
Ron points out that there are two key issues with safety-critical systems. First, you have to understand all the situations in which a hazardous condition might occur. The way to discover all of the safety issues in a system is to get a lot of knowledgeable people in a room and have them imagine scenarios that could lead to a breach of safety. In weapons development programs, there is a Systems Safety Working Group that provides a useful forum for this process. Once a dangerous scenario is identified, it’s relatively easy to build into the system a control that will keep it from happening. The hard part is thinking of everything that could go wrong in the first place. Software rarely causes problems that were anticipated, but the literature is loaded with accounts of accidents whose root causes stem from a completely unexpected set of circumstances. Causes of accidents include not only the physical design of the object, but also its operational practices [2]. Therefore, the most important aspect of software safety is making sure that all operational possibilities are considered.

The second issue with safety is to be sure that once dangerous scenarios are identified and controls are designed to keep them from happening, future changes to the system take this prior knowledge into account. The lethal examples my friend on the airplane cited were cases in which a new programming team was suspected of making a change without realizing that the change defeated a safety control. The point is, once a hazard has been identified, it probably will be contained initially, but it may be forgotten in the future. For this reason, it is felt that all changes must be traced to the initial design and requirements.

Ron has noticed an inherent conflict in these two goals. He is convinced that best way to identify all possible hazard scenarios is to continually refactor the design and re-evaluate the safety issues. Yet the traditional way to avoid forgetting about previously identified failure modes is to freeze the design and trace all code back to the original requirements.

Ron notes that up until now there were two approaches: the ‘ad hoc’ approach and the ‘freeze up front’ approach. The ‘ad hoc’ approach might identify more hazards, but it will not insure that they will continue to be addressed through the product lifecycle. The ‘freeze up front’ approach insures that identified failure modes have controls, but it is not good at finding all the failure modes. Theoretically, a good safety program employs both approaches, but when a new hazard is identified there is a strong impetus to pigeonhole a fix into the current design so as not to disturb the audit trails spawned by policing a static design. XP is a third option – one that is much better at finding all the failure modes, yet can contain the discipline to protect existing controls.

Requirements Traceability
My encounter on the plane told me that those who inspected Ron’s XP process would be looking for traceability of code to requirements. Since his XP processes fared well under review, I wondered how he satisfied inspectors that his code was traceable to requirements. Did he trace code to requirements after it was written?

“Just because you’re doing XP doesn’t mean you abandon good software engineering practices,” Ron says. “It means that you don’t have to pretend that you know everything there is to know about the system in the beginning.” In fact, XP is quite explicit about not writing code until there is a user story calling for it to be written. And Ron points out that the user stories are the requirements.

The important thing about requirements, according to Ron, is that they must reflect the customer’s perspective of how the device will be used. In a DOD contract, requirements stem from a document aptly named the Operational Requirements Document, or ORD. In a medical device development, requirements would be customer scenarios about how the instrument will be used. Sometimes initial requirements are broken down into more detail, but that process results in derived requirements, which are actually the start of the design. When developing safety-critical systems, it is necessary to develop a good understanding of how the device will be used, so derived requirements are not the place to start. The ORD or customer scenarios, along with any derived requirements that have crept in, should be broken down into story cards.

In order to use XP in a safety environment, the customers representatives working on story cards should 1) be aware of the ORD and/or needs of system users and able to distinguish between originating and derived requirements, 2) have a firm understanding of system safety engineering, preferably as a member of the System Safety Working Group, and 3) have the ear and confidence of whatever change authority exits. Using XP practices with this kind of customer team puts in place the framework for a process that maintains a system’s fitness for use during its development, continually reassesses the risk inherent in the system, and facilitates the adaptation of risk reduction measures.

Refactoring
Ron finds that refactoring a design is extremely valuable for discovering failure scenarios in embedded software. It is especially important because you never know how the device will work at the beginning of software development. Ron notes that many new weapons systems will be built with bleeding edge technology, and any new pharmaceutical instrument will be subject to the whims of the marketplace. So things change, and there is no way to get a complete picture of all the failure modes of a device at the beginning of the project. There is a subtler but equally important advantage to refactoring. The quality of a safety control will be improved because of the opportunities to simplify its design and deal with the inevitable design flaws that will be discovered.

“It’s all about feedback. You try something, see how it works, refactor it, improve it.” In fact, the positive assessment of the auditor notwithstanding, if there were one thing that Ron’s team would do differently the next time: they would do more refactoring. Often they knew they should refactor, but forged ahead without doing it. “We did a root-cause analysis of our bugs, and concluded that when we think refactoring might be useful, we should go ahead and do it.”

It is dangerous to think that all the safety issues will be exposed during an initial design, according to Ron Morsicato. It is far better to review safety issues on a regular basis, taking into account what has been learned as development proceeds. Ron is convinced that when the team regularly thinks through failure scenarios, invariably new ones will be discovered as time goes on.

Refactoring activities must be made visible to the business side of the planning game, for it is from there that the impetus to reevaluate the new design from a systems safety aspect needs to occur. Ron believes that if the system safety designers feel that impetus and take on an “XP attitude,” then the benefits of both the “ad hoc” and “freeze” approaches can be realized. The customer-developer team will achieve a safer design by keeping the system as simple as possible, helping them to achieve a heightened focus on the safety of its users.

Testing
The most important XP discipline is unit testing, according to Ron Morsicato. He noted that too many managers ignore the discipline of thorough testing during development, which tends to create a ‘hacker’s’ environment. When presented with a ton of untested code, developers are presented with an impossible task. Random fixes are often applied, the overall design gets lost, and the code base becomes increasingly messy.

Ron feels that no code should be submitted to a developing system unless it is completely unit tested, so the systems debuggers need only look at the interfaces for causes of defects. Instead of emphasizing sequential steps in development and thorough documentation, emphasizing rigorous in-line testing will result in better code. When coupled with effective use of the planning game and regular refactoring, on-going testing is the best way to develop safe software.

The XP testing discipline provides a further benefit for safety-critical systems. By assuring that all safety controls have tests that run every time the system is changed, it is easier to be sure that safety controls cannot be broken as the software undergoes inevitable future changes.

Us vs. Them
I asked Ron what single thing was the most important trait of a good manager. He replied without hesitation, “Managers who give you the big picture of what you are supposed to achieve, rather than telling you what to do, are far and away the best managers. Developers really do not like managers telling them how to do their job, but they don’t appreciate a hacking environment either.” One thing Ron has observed in his experiences is that the increasing pressure on developers to conform to a specific process has created an “us” vs. “them” mentality. The word “process” has become tainted among developers; it means something imposed by people who have lost touch with the realities of code development. Developers accuse ‘them’ of imposing processes because they sound good in theory, and are a foolproof way of passing an auditor’s comparison of the practice to a particular standard. Developers find themselves overloaded with work that they feel they don’t have to do in order to produce good code. The unfortunate consequence of this is that anything said by the “process camp” tends to be disregarded by the “developer camp.” This leads to an unwillingness to adopt a good practice just because the process people support it.

According to Ron, XP is a process that doesn’t feel like a process. It’s presented as a set of practices that directly address the problems developers continually run into from their own perspective. When engaged in any of the XP practices, a developer has a sense of incrementally contributing to the quality of the product. This reinforces developers’ commitment to quality and strengthens their confidence that they are doing the right thing. If I were getting some medical treatment from a device that looked like it could kill me if someone made the wrong move, I’d certainly hope that the engineers who developed the gizmo had that confidence and commitment.

Software developers will deliver high quality code if they clearly understand what quality means to their customer, if they can constantly test their code against that understanding, and if they regularly refactor. Keeping up with change, whether the emerging insights into the system design, the inevitable improvements in device technology, or the evolving customer values, is critical to their success. With good management, they will look upon themselves as members of a safety team immersed in a culture of safety.

The Audit
Ron’s team implemented XP practices with confidence and dedication, met deadlines that would have been impossible with any other approach, and delivered a solid product, while adhering to practices that met pharmaceutical standards. And yet even though these practices were praised by a veteran auditor, the organization failed the audit at the policy level. What went wrong?

The theory of punctuated equilibrium holds that biological species are not likely to change over a long period of time because mutations are usually swamped by the genes of the existing population. If a mutation occurs in an isolated spot away from the main population, it has a greater chance of surviving. This is like saying that it is easier for a strange new fish to grow large in a small pond. Similarly, disruptive technologies [3] (new species of technologies) do not prosper in companies selling similar older technologies, nor are they initially aimed at the markets served by the older technologies. Disruptive technologies are strange little fish, so they only grow big in a small pond.

Ron’s project was being run under a military policy, even though it was a commercial project. If the company policy had segmented off a commercial area for software development and explicitly allowed the team to develop its own process in that segment, then the auditor would have been satisfied. He would not have seen a strange little fish swimming around in a big pond, looking different from all the other fish. Instead he would have seen a new little fish swimming in its own, ‘official’ small pond. There XP practices could have thrived and grown mature, at which time they might have invaded the larger pond of traditional practices.

But it was not to be. The project was canceled, a victim not of the audit but of the economy and a distant corporate merger. Today the thing managers remember about the project is that XP did not pass the audit. The little fish did not survive in the big pond.
__________________
Footnotes:

[1] What he was probably was referring to was the Therac-25 series of accidents, where indeed they had suspect software practices, including after-the-fact requirements traceability.

[2] For a comprehensive account of accidents in software based systems, see Safeware: System Safety and Computers, Nancy G. Leveson, Addison-Wesley, 1995

[3] See The Innovator’s Dilemma, by Clayton M. Christensen, Harper-Business edition, 2000.

Published in Cutter IT Journal – September, 2002