Thursday, April 4, 2019

What If Your Team Wrote the Code for the 737 MCAS System?

The 737 has been around for a half century, and over that time airplanes have evolved from manual controls to fly-by-wire systems.  As each new generation of 737 appeared, the control system became more automated, but there was a concerted effort to maintain the “feel” of the previous system so pilots did not have to adapt to dramatically different mental models of how to control a plane. When electronic signals replaced manual controls an “Elevator Feel Shift System” was added to simulate the resistance pilots felt when using manual controls and provide feedback through the feel of the control stick (yoke). A stall warning mechanism was also added – it was designed to catch the pilot’s attention if a stall seemed imminent, alerting the pilot to push forward on the yoke and thus increase the thrust and lower the nose a bit.

Enter a new version of the 737 (the 737 MAX) – rushed to market to counter a serious competitive threat. To make the plane more energy efficient, new (larger) engines were added. Since the landing gear could not reasonably be extended to allow for the larger engines, they were positioned a bit further forward and higher on the wing – causing instability problems under certain flight conditions. Boeing addressed this instability with the MCAS system – a modification of the (already certified and proven) Elevator Feel Shift System that would automatically lower the nose when an imminent stall is detected, rather than alerting the pilot. Of course, airplanes have been running on auto-pilot for years, so a little bit of automatic correction while in manual mode is not a radical concept. The critical safety requirement here is not redundancy, because the pilot is expected to override an autopilot system if warranted. The critical safety requirement is that if an autopilot system goes haywire, the pilots are alerted in time to use a very obvious and practiced process to override it.

Two 737 MAX airplanes have crashed, and the MCAS system has been implicated as a potential cause of each one. Based on preliminary reports, it appears that the MCRS system, operating with a combination of a (single) faulty sensor and a persistent reversal of the pilot override, may eventually be implicated as contributing to the disasters.

Hindsight is always much clearer than foresight, and we know that predicting all possible behaviors of complex systems is impossible. And yet, I wonder if the people who wrote the code for the MCRS system were involved in the systems engineering that led to its design.  More to the point, as driverless vehicles and sophisticated automated equipment become increasingly practical, what is the role of software engineering in assuring that these systems are safe?

Back in the day, I wrote software which controlled large roll-goods manufacturing processes. I worked in an engineering department where no one entertained the idea of separating design from implementation. WE were the engineers responsible for understanding, designing, and installing control systems. A suggestion that someone else might specify the engineering details of our systems would not have been tolerated. One thing we knew for sure – we were responsible for designing safe systems, and we were not going to delegate that responsibility to anyone else. Another thing we knew for sure was that anything that could go wrong would eventually go wrong – so every element of our systems had to be designed to fail safely; every input to our system was suspect; and no output could be guaranteed to reach its destination. And because my seasoned engineering colleagues were suspicious of automation, they added manual (and very visible) emergency stop systems that could easily and quickly override my automated controls.

But then something funny happened to software. Managers (often lacking coding experience or an engineering background) decided that it would be more efficient if one group of people focused on designing software systems while another group of people actually wrote the code. I have never understood how this could possibly work, and quite frankly, I have never seen it succeed in a seriously complex environment. But there you have it – for a couple of decades, common software practice has been to separate design from implementation, distancing software engineers from the design of the systems they are supposed to implement.

Returning to the 737 MAX MCRS System, while its not useful to speculate how the MCRS software was designed, it’s useful to imagine how your team – your organization – would approach a similar problem. Suppose your team was tasked with modifying the code of a well-proven, existing system to add a modest change to overcome a design problem in a new generation of the product. How would it work in your environment?  Would you receive a spec that said: When a stall is detected (something the software already does), send an adjustment signal to bring the nose down?  And would you write the code as specified, or would you ask some questions – such as “What if the stall signal is wrong, and there really isn’t a stall?” Or “Under what conditions do we NOT send an adjustment signal?” Or “When and how can the system be disabled?”

If you use the title “Software Engineer,” the right answer should be obvious, because one of the primary responsibilities of engineers is the safety of people using the systems they design. Any professional engineer knows that they are responsible for understanding how their part of a system interacts with the overall system and being alert to anything that might compromise safety. So if you call yourself an engineer, you should be asking questions about the safety of the system before you write the code.

It doesn’t matter that the Elevator Feel Shift System has been working well for many years – the fact is that this system has always depended on the reading from a single sensor, and that sensor can – and WILL – malfunction. In the earlier versions of the Elevator Feel Shift System, a single sensor was not critical, because the system provided a warning to pilots, who then took corrective action if needed; pilots can detect and ignore a false signal. But when there is no pilot in the loop and the software is supposed to automatically correct upon sensing a stall, it would be a good idea to make sure that the stall is real before a correction is made. At the very least, there should be an easy, intuitive, and PERMANENT override of the system if it malfunctions. And yes, this override should leave the plane in a manageable state. If this were your team, would you dig deep enough to discover that the stall signal depended on a single sensor?  Would you discuss whether there were limits to the extent of its response or conditions under which the system should not respond?

Possibly a more serious problem with the MCAS system is that it apparently resets after five seconds of normal operation, and thus can push the nose down repeatedly. Would your team have considered such a scenario? Would you have thought through the conditions under which a nose down command would be dangerous, and how the system could be disabled? It might not be your job to train pilots on how to use a system, but it the job of engineers to build systems that are easy and intuitive to override when things go wrong.

The demand for control software is going to increase significantly as we move into an era of driverless vehicles and increasingly automated equipment. These systems must be safe – and there are many opinions on how to make sure they are safe. I would propose that although good processes are important for safety, good engineering is the fundamental discipline that makes systems increasingly safe. Processes that separate design from implementation get in the way of good engineering and are not appropriate for complex technical systems.

Software engineers need to understand what civil engineers learn as undergraduates – safety is not someone else’s job; it is the responsibility of every engineer involved in the design and implementation of a system whose failure might cause harm. If your team is not ready to accept this responsibility, then call yourselves developers or programmers or technicians – but not engineers.