Thursday, August 8, 2002

XP in a Safety-Critical Environment

Recently I chanced to meet a gentleman on a plane who audits the software used in medical and pharmaceutical instruments. During our long and interesting conversation, he cited several instances where defects in software had resulted in deaths. One that comes to mind is a machine which mixed a lethal dosage of radiation [1]. We discussed how such deaths could be prevented, and he was adamant – it is a well-known fact that when developing safety-critical software, all requirements must be documented up front and all code must be traced to the requirements. I asked how one could be sure that the requirements themselves would not cause of a problem. He paused and admitted that indeed, the integrity of the requirements is a critical issue, but one which is difficult to regulate. The best hope is that if a shop is disciplined shop in other areas, it will not make mistakes in documenting requirements.

One of the people this auditor might be checking up on is Ron Morsicato. Ron is a veteran developer who writes software for computers that control how devices respond to people. The device might be a weapon or a medical instrument, but often if Ron’s software goes astray, it can kill people. Last year Ron started using Extreme Programming (XP) for a pharmaceutical instrument, and found it quite compatible with a highly regulated and safety-critical environment. In fact, a representative of a worldwide pharmaceutical customer audited his process. This seasoned auditor concluded that Ron’s development team had implemented practices sufficiently good to be on a par with the expected good practices in the field. This was a strong affirmation of the practices used by the only XP team in the company.

However, Ron’s team did not pass the audit. The auditor was disturbed that the team had been allowed to unilaterally implement XP. He noted that the organization did not have policies concerning which processes must be used, and no process, even one which was quite acceptable, should be independently implemented by a development team.

The message that Ron’s team heard was that they had done an excellent job using XP when audited against a pharmaceutical standard. What their management heard was that the XP process had failed the audit. This company probably won’t be using XP again, which is too bad, because Ron thinks it is an important step forward in designing better safety-critical systems.

The Ying and Yang of Safety-Critical Software
Ron points out that there are two key issues with safety-critical systems. First, you have to understand all the situations in which a hazardous condition might occur. The way to discover all of the safety issues in a system is to get a lot of knowledgeable people in a room and have them imagine scenarios that could lead to a breach of safety. In weapons development programs, there is a Systems Safety Working Group that provides a useful forum for this process. Once a dangerous scenario is identified, it’s relatively easy to build into the system a control that will keep it from happening. The hard part is thinking of everything that could go wrong in the first place. Software rarely causes problems that were anticipated, but the literature is loaded with accounts of accidents whose root causes stem from a completely unexpected set of circumstances. Causes of accidents include not only the physical design of the object, but also its operational practices [2]. Therefore, the most important aspect of software safety is making sure that all operational possibilities are considered.

The second issue with safety is to be sure that once dangerous scenarios are identified and controls are designed to keep them from happening, future changes to the system take this prior knowledge into account. The lethal examples my friend on the airplane cited were cases in which a new programming team was suspected of making a change without realizing that the change defeated a safety control. The point is, once a hazard has been identified, it probably will be contained initially, but it may be forgotten in the future. For this reason, it is felt that all changes must be traced to the initial design and requirements.

Ron has noticed an inherent conflict in these two goals. He is convinced that best way to identify all possible hazard scenarios is to continually refactor the design and re-evaluate the safety issues. Yet the traditional way to avoid forgetting about previously identified failure modes is to freeze the design and trace all code back to the original requirements.

Ron notes that up until now there were two approaches: the ‘ad hoc’ approach and the ‘freeze up front’ approach. The ‘ad hoc’ approach might identify more hazards, but it will not insure that they will continue to be addressed through the product lifecycle. The ‘freeze up front’ approach insures that identified failure modes have controls, but it is not good at finding all the failure modes. Theoretically, a good safety program employs both approaches, but when a new hazard is identified there is a strong impetus to pigeonhole a fix into the current design so as not to disturb the audit trails spawned by policing a static design. XP is a third option – one that is much better at finding all the failure modes, yet can contain the discipline to protect existing controls.

Requirements Traceability
My encounter on the plane told me that those who inspected Ron’s XP process would be looking for traceability of code to requirements. Since his XP processes fared well under review, I wondered how he satisfied inspectors that his code was traceable to requirements. Did he trace code to requirements after it was written?

“Just because you’re doing XP doesn’t mean you abandon good software engineering practices,” Ron says. “It means that you don’t have to pretend that you know everything there is to know about the system in the beginning.” In fact, XP is quite explicit about not writing code until there is a user story calling for it to be written. And Ron points out that the user stories are the requirements.

The important thing about requirements, according to Ron, is that they must reflect the customer’s perspective of how the device will be used. In a DOD contract, requirements stem from a document aptly named the Operational Requirements Document, or ORD. In a medical device development, requirements would be customer scenarios about how the instrument will be used. Sometimes initial requirements are broken down into more detail, but that process results in derived requirements, which are actually the start of the design. When developing safety-critical systems, it is necessary to develop a good understanding of how the device will be used, so derived requirements are not the place to start. The ORD or customer scenarios, along with any derived requirements that have crept in, should be broken down into story cards.

In order to use XP in a safety environment, the customers representatives working on story cards should 1) be aware of the ORD and/or needs of system users and able to distinguish between originating and derived requirements, 2) have a firm understanding of system safety engineering, preferably as a member of the System Safety Working Group, and 3) have the ear and confidence of whatever change authority exits. Using XP practices with this kind of customer team puts in place the framework for a process that maintains a system’s fitness for use during its development, continually reassesses the risk inherent in the system, and facilitates the adaptation of risk reduction measures.

Refactoring
Ron finds that refactoring a design is extremely valuable for discovering failure scenarios in embedded software. It is especially important because you never know how the device will work at the beginning of software development. Ron notes that many new weapons systems will be built with bleeding edge technology, and any new pharmaceutical instrument will be subject to the whims of the marketplace. So things change, and there is no way to get a complete picture of all the failure modes of a device at the beginning of the project. There is a subtler but equally important advantage to refactoring. The quality of a safety control will be improved because of the opportunities to simplify its design and deal with the inevitable design flaws that will be discovered.

“It’s all about feedback. You try something, see how it works, refactor it, improve it.” In fact, the positive assessment of the auditor notwithstanding, if there were one thing that Ron’s team would do differently the next time: they would do more refactoring. Often they knew they should refactor, but forged ahead without doing it. “We did a root-cause analysis of our bugs, and concluded that when we think refactoring might be useful, we should go ahead and do it.”

It is dangerous to think that all the safety issues will be exposed during an initial design, according to Ron Morsicato. It is far better to review safety issues on a regular basis, taking into account what has been learned as development proceeds. Ron is convinced that when the team regularly thinks through failure scenarios, invariably new ones will be discovered as time goes on.

Refactoring activities must be made visible to the business side of the planning game, for it is from there that the impetus to reevaluate the new design from a systems safety aspect needs to occur. Ron believes that if the system safety designers feel that impetus and take on an “XP attitude,” then the benefits of both the “ad hoc” and “freeze” approaches can be realized. The customer-developer team will achieve a safer design by keeping the system as simple as possible, helping them to achieve a heightened focus on the safety of its users.

Testing
The most important XP discipline is unit testing, according to Ron Morsicato. He noted that too many managers ignore the discipline of thorough testing during development, which tends to create a ‘hacker’s’ environment. When presented with a ton of untested code, developers are presented with an impossible task. Random fixes are often applied, the overall design gets lost, and the code base becomes increasingly messy.

Ron feels that no code should be submitted to a developing system unless it is completely unit tested, so the systems debuggers need only look at the interfaces for causes of defects. Instead of emphasizing sequential steps in development and thorough documentation, emphasizing rigorous in-line testing will result in better code. When coupled with effective use of the planning game and regular refactoring, on-going testing is the best way to develop safe software.

The XP testing discipline provides a further benefit for safety-critical systems. By assuring that all safety controls have tests that run every time the system is changed, it is easier to be sure that safety controls cannot be broken as the software undergoes inevitable future changes.

Us vs. Them
I asked Ron what single thing was the most important trait of a good manager. He replied without hesitation, “Managers who give you the big picture of what you are supposed to achieve, rather than telling you what to do, are far and away the best managers. Developers really do not like managers telling them how to do their job, but they don’t appreciate a hacking environment either.” One thing Ron has observed in his experiences is that the increasing pressure on developers to conform to a specific process has created an “us” vs. “them” mentality. The word “process” has become tainted among developers; it means something imposed by people who have lost touch with the realities of code development. Developers accuse ‘them’ of imposing processes because they sound good in theory, and are a foolproof way of passing an auditor’s comparison of the practice to a particular standard. Developers find themselves overloaded with work that they feel they don’t have to do in order to produce good code. The unfortunate consequence of this is that anything said by the “process camp” tends to be disregarded by the “developer camp.” This leads to an unwillingness to adopt a good practice just because the process people support it.

According to Ron, XP is a process that doesn’t feel like a process. It’s presented as a set of practices that directly address the problems developers continually run into from their own perspective. When engaged in any of the XP practices, a developer has a sense of incrementally contributing to the quality of the product. This reinforces developers’ commitment to quality and strengthens their confidence that they are doing the right thing. If I were getting some medical treatment from a device that looked like it could kill me if someone made the wrong move, I’d certainly hope that the engineers who developed the gizmo had that confidence and commitment.

Software developers will deliver high quality code if they clearly understand what quality means to their customer, if they can constantly test their code against that understanding, and if they regularly refactor. Keeping up with change, whether the emerging insights into the system design, the inevitable improvements in device technology, or the evolving customer values, is critical to their success. With good management, they will look upon themselves as members of a safety team immersed in a culture of safety.

The Audit
Ron’s team implemented XP practices with confidence and dedication, met deadlines that would have been impossible with any other approach, and delivered a solid product, while adhering to practices that met pharmaceutical standards. And yet even though these practices were praised by a veteran auditor, the organization failed the audit at the policy level. What went wrong?

The theory of punctuated equilibrium holds that biological species are not likely to change over a long period of time because mutations are usually swamped by the genes of the existing population. If a mutation occurs in an isolated spot away from the main population, it has a greater chance of surviving. This is like saying that it is easier for a strange new fish to grow large in a small pond. Similarly, disruptive technologies [3] (new species of technologies) do not prosper in companies selling similar older technologies, nor are they initially aimed at the markets served by the older technologies. Disruptive technologies are strange little fish, so they only grow big in a small pond.

Ron’s project was being run under a military policy, even though it was a commercial project. If the company policy had segmented off a commercial area for software development and explicitly allowed the team to develop its own process in that segment, then the auditor would have been satisfied. He would not have seen a strange little fish swimming around in a big pond, looking different from all the other fish. Instead he would have seen a new little fish swimming in its own, ‘official’ small pond. There XP practices could have thrived and grown mature, at which time they might have invaded the larger pond of traditional practices.

But it was not to be. The project was canceled, a victim not of the audit but of the economy and a distant corporate merger. Today the thing managers remember about the project is that XP did not pass the audit. The little fish did not survive in the big pond.
__________________
Footnotes:

[1] What he was probably was referring to was the Therac-25 series of accidents, where indeed they had suspect software practices, including after-the-fact requirements traceability.

[2] For a comprehensive account of accidents in software based systems, see Safeware: System Safety and Computers, Nancy G. Leveson, Addison-Wesley, 1995

[3] See The Innovator’s Dilemma, by Clayton M. Christensen, Harper-Business edition, 2000.

Screen Beans Art, © A Bit Better Corporation