Thursday, April 4, 2019

What If Your Team Wrote the Code for the 737 MCAS System?

The 737 has been around for a half century, and over that time airplanes have evolved from manual controls to fly-by-wire systems.  As each new generation of 737 appeared, the control system became more automated, but there was a concerted effort to maintain the “feel” of the previous system so pilots did not have to adapt to dramatically different mental models of how to control a plane. When electronic signals replaced manual controls an “Elevator Feel Shift System” was added to simulate the resistance pilots felt when using manual controls and provide feedback through the feel of the control stick (yoke). A stall warning mechanism was also added – it was designed to catch the pilot’s attention if a stall seemed imminent, alerting the pilot to pull back on the yoke and thus increase the thrust and lower the nose a bit.

Enter a new version of the 737 (the 737 MAX) – rushed to market to counter a serious competitive threat. To make the plane more energy efficient, new (larger) engines were added. Since the landing gear could not reasonably be extended to allow for the larger engines, they were positioned a bit further forward and higher on the wing – causing instability problems under certain flight conditions. Boeing addressed this instability with the MCAS system – a modification of the (already certified and proven) Elevator Feel Shift System that would automatically lower the nose when an imminent stall is detected, rather than alerting the pilot. Of course, airplanes have been running on auto-pilot for years, so a little bit of automatic correction while in manual mode is not a radical concept. The critical safety requirement here is not redundancy, because the pilot is expected to override an autopilot system if warranted. The critical safety requirement is that if an autopilot system goes haywire, the pilots are alerted in time to use a very obvious and practiced process to override it.

Two 737 MAX airplanes have crashed, and the MCAS system has been implicated as a potential cause of each one. Based on preliminary reports, it appears that the MCRS system, operating with a combination of a (single) faulty sensor and a persistent reversal of the pilot override, may eventually be implicated as contributing to the disasters.

Hindsight is always much clearer than foresight, and we know that predicting all possible behaviors of complex systems is impossible. And yet, I wonder if the people who wrote the code for the MCRS system were involved in the systems engineering that led to its design.  More to the point, as driverless vehicles and sophisticated automated equipment become increasingly practical, what is the role of software engineering in assuring that these systems are safe?

Back in the day, I wrote software which controlled large roll-goods manufacturing processes. I worked in an engineering department where no one entertained the idea of separating design from implementation. WE were the engineers responsible for understanding, designing, and installing control systems. A suggestion that someone else might specify the engineering details of our systems would not have been tolerated. One thing we knew for sure – we were responsible for designing safe systems, and we were not going to delegate that responsibility to anyone else. Another thing we knew for sure was that anything that could go wrong would eventually go wrong – so every element of our systems had to be designed to fail safely; every input to our system was suspect; and no output could be guaranteed to reach its destination. And because my seasoned engineering colleagues were suspicious of automation, they added manual (and very visible) emergency stop systems that could easily and quickly override my automated controls.

But then something funny happened to software. Managers (often lacking coding experience or an engineering background) decided that it would be more efficient if one group of people focused on designing software systems while another group of people actually wrote the code. I have never understood how this could possibly work, and quite frankly, I have never seen it succeed in a seriously complex environment. But there you have it – for a couple of decades, common software practice has been to separate design from implementation, distancing software engineers from the design of the systems they are supposed to implement.

Returning to the 737 MAX MCRS System, while its not useful to speculate how the MCRS software was designed, it’s useful to imagine how your team – your organization – would approach a similar problem. Suppose your team was tasked with modifying the code of a well-proven, existing system to add a modest change to overcome a design problem in a new generation of the product. How would it work in your environment?  Would you receive a spec that said: When a stall is detected (something the software already does), send an adjustment signal to bring the nose down?  And would you write the code as specified, or would you ask some questions – such as “What if the stall signal is wrong, and there really isn’t a stall?” Or “Under what conditions do we NOT send an adjustment signal?” Or “When and how can the system be disabled?”

If you use the title “Software Engineer,” the right answer should be obvious, because one of the primary responsibilities of engineers is the safety of people using the systems they design. Any professional engineer knows that they are responsible for understanding how their part of a system interacts with the overall system and being alert to anything that might compromise safety. So if you call yourself an engineer, you should be asking questions about the safety of the system before you write the code.

It doesn’t matter that the Elevator Feel Shift System has been detecting stalls for many years – the fact is that this system has always depended on the reading from a single sensor, and that sensor can – and WILL – malfunction. In the earlier version of the Elevator Feel Shift System, a single sensor was not critical, because the system provided a warning to pilots, who then took corrective action if needed; pilots can detect and ignore a false signal. But when there is no pilot in the loop and the software is supposed to automatically correct upon sensing a stall, it would be a good idea to make sure that the stall is real before a correction is made. At the very least, there should be an easy, intuitive, and PERMANENT override of the system if it malfunctions. And yes, this override should leave the plane in a manageable state. If this were your team, would you dig deep enough to discover that the stall signal depended on a single sensor?  Would you discuss whether there were limits to the extent of its response or conditions under which the system should not respond?

Possibly a more serious problem with the MCAS system is that it apparently resets after five seconds of normal operation, and thus can push the nose down repeatedly. Would your team have considered such a scenario? Would you have thought through the conditions under which a nose down command would be dangerous, and how the system could be disabled? It might not be your job to train pilots on how to use a system, but it the job of engineers to build systems that are easy and intuitive to override when things go wrong.

The demand for control software is going to increase significantly as we move into an era of driverless vehicles and increasingly automated equipment. These systems must be safe – and there are many opinions on how to make sure they are safe. I would propose that although good processes are important for safety, good engineering is the fundamental discipline that makes systems increasingly safe. Processes that separate design from implementation get in the way of good engineering and are not appropriate for complex technical systems.

Software engineers need to understand what civil engineers learn as undergraduates – safety is not someone else’s job; it is the responsibility of every engineer involved in the design and implementation of a system whose failure might cause harm. If your team is not ready to accept this responsibility, then call yourselves developers or programmers or technicians – but not engineers.

Saturday, January 19, 2019

An Interview

Recently I was asked to complete an interview via e-mail. I found the questions quite interesting - so I decided to post them here.

Caution: The answers are brief and lack context. Some of them are probably controversial, and the the interview format didn't provide space for going below the surface. Send me an e-mail ( if you'd like to explore any of these topics further.

When did you first start applying Lean to your software development work? Where did you get the inspiration from?

I think its important to set the record straight – most early software engineering was done in a manner we now call ‘Lean.’ My first job as a programmer was working on the Number 2 Electronic Switching System when it was under development at Bell Telephone Labs. Not long after that, I was assisting a physicist do research into high energy particle tracing. The computer I worked on was a minicomputer that he scrounged up from a company that had gone bankrupt. With a buggy FORTRAN compiler and a lot of assembly language, we controlled a film scanner that digitized thousands of frames of bubble chamber film, projected the results into three dimensional space, and identified unique events for further study.

My next job was designing automated vehicle controls in an advanced engineering department of General Motors. From there I moved to an engineering department in 3M where we developed control systems for the big machines that make tape. In every case, we used good engineering practices to solve challenging technical problems.

In a very real sense, I believe that lean ideas are simply good engineering practices, and since I began writing code in good engineering departments, I have always used lean ideas when developing software.

From the organizations you've worked with, what have been some of the most common challenges associated with Lean transformations?

Far and away the most common problem occurs when companies head into a transformation for the sake of the transformation, instead of clearly and crisply identifying the business outcomes that are expected as a result of the transformation. You don’t do agile to do agile. You don’t do lean to do lean. You don’t do digital to do digital. You do these things to create a more engaging work environment, earn enough money to support that environment, and build products or services that truly delight customers.

So an organization that sets out on a transformation should be looking at these questions:
  1. Is the transformation unlocking the potential of everyone who works here? How do we know?
  2. Are we creating products and services that customers love and will pay for? How do we know?
  3. Are we creating the reputation and revenue necessary to sustain our business over the long run?

There's lots of talk now around scaled Agile frameworks such as SAFe, Nexus, LESS, etc. with mixed results. How do you approach the challenge of scaling this way of working?

Every large agile framework that I know of is an excuse to avoid the difficult and challenging work of sorting out the organization’s system architecture so that small agile teams can work independently. You do not create smart, innovative teams by adding more process, you create them by breaking dependencies.

What we have learned from the Internet and from the Cloud is very simple – really serious scale can only happen when small teams independently leverage local intelligence and creativity. Companies that think scaled agile processes will help them scale will discover that these processes are not the right path to truly serious scale.

One of the common complaints from developers on Agile teams have is they don't feel connected to customers, and there is sometimes a feeling of working on outputs, rather than customer outcomes. How might this be changed? 

This is the essential problem of organizations that consider agile a process rather than a way to empower teams to do their best work. The best way to fix the problem is to create a direct line of sight from each team to its consumers.

When the Apple iPhone was being developed, small engineering teams worked in short cycles that were aimed at a demo of a new feature. Even though the demo group was limited due to security, it was representative of future consumers. Each team was completely focused on making the next demo more pleasing and comfortable for their audience than the last one. These quick feedback loops over two and a half years led directly to a device that pretty much everyone loved. [1]

At our meetup last year, you spoke about resisting proxies, and one of those proxies is the Product Owner. What alternative approaches have you seen work for Lean or Agile teams, as opposed to having a Product Owner?

Why do software engineers need someone to come up with ideas for them? Ken Kocienda was a software engineer who ‘signed up’ to be responsible for developing the iPhone’s keypad. In the book Creative Selection [1], he describes how he developed the design, algorithms, and heuristics that created a seamless experience when typing on the iPhone keyboard, even though it was too small for most people’s fingers.

Similarly, at SpaceX, every component has a ‘responsible engineer’ who figures out how to make that component do its proper job as part of the launch system. John Muratore, SpaceX Launch Director, says “SpaceX operates on a philosophy of Responsibility – no engineering process in existence can replace this for getting things done right, efficiently.” [2]

The Chief Engineer approach is common in engineering departments from Toyota to GE Healthcare. It works very well. There is nothing about software that would exempt it from the excellent results you get when you give engineers the responsibility of understanding and solving challenging problems.

What is the most common thing you've seen recently which is slowing down organizations' concept-to-cash loop?

Friction. For example, the dependencies generated by the big back end of a banking system are a huge source of friction for product teams. The first thing organizations need to do is to learn how to recognize friction and stop thinking of it as necessary. When Amazon moved to microservices (from 2001 to 2006) the company had to abandon the idea that transactions are managed by a central database – which was an extremely novel idea at the time.

Over time, Amazon learned how to recognize friction and reduce it.  Today, Amazon Web Services (AWS) launches a new enterprise-level service perhaps once a month and about two new features per day. Even more remarkable, AWS has worked at a similar pace for over a decade. If you look closely, an Amazon service is owned by a small team led by someone who has 'signed up' to be responsible for delivering and supporting a service that addresses a distinct customer need at a cost that is both extremely attractive and provides enough revenue to sustain the service over time.

1. See Creative Selection by Ken Kocienda.

2. John Muratore, System Engineering: A Traditional Discipline in a Non-traditional Organization, Talk at 2012 Complex Aerospace Systems Exchange Event.