Friday, June 7, 2019

Lean Lunch Lines

“This doesn’t look good,” Tom said, pointing out six signs high on the wall, side-by-side.  Three said “Pork and Beef.” Three said “Fish and Vegan.” Below each sign a serving station was being set up. Lunch break started in a half an hour. There were over 2000 people to feed, and the pouring rain meant any outdoor food options were unlikely to attract much traffic.

We weren’t the only ones who noticed. Lines began to form at the food stations; the queues for meat grew especially long. We joined one of the three long lines and moved slowly to the front. At the food station we had several choices to make – so it took a while to be served.

I did the math. Each meat station was serving about 4 people per minute, so the three meat stations might serve roughly 750 people in an hour – maybe 1000 if the service got faster. From the short lines at the fish/vegan stations, I inferred that they might account for 15-20% of the demand, leaving over 1500 people to be served by the three lines offering meat. The 90-minute lunch break was probably not going to be long enough.

Sure enough, just before the sessions were set to resume, the following tweet announced that the afternoon talks would be delayed by a half hour:

At a conference whose attendees pride themselves on agility and thus should understand queuing theory, this was a big disappointment. In Lean we have a mantra: “Go and See.” It means go to the place where delays are happening and see for yourself what is going on.  I can’t help thinking that if an observant person with authority to change things had taken a close look at the lines on the first day, it might have been possible to improve the process and keep the conference on schedule. For example, they might have switched two of the three fish/vegan lines to meat, or perhaps they could have served all types of food from all six food stations, or shortened the time it took to serve a meal.

There was a second day to the conference; and I assumed that lessons had been learned, and queues would be shorter. As lunch time approached, there were four more signs posted high on a wall at the far end of the room, perpendicular to the first six signs. Two said “Pork and Beef.” Two said “Fish and Vegan.” So now there were five long lines for the 80% or so of the attendees who preferred meat, and five short lines for the others. An observant caterer would not have added more fish/vegan lines; with a total of ten lines, no more than two should have been devoted to the meatless meals preferred by perhaps 20% of the attendees. Of course, that ratio was not precise; for example, although I preferred meat, I opted for the very short line serving fish, and I’m sure I was not alone.

The real test of lunch line flow is this: How long do attendees have to wait in line to obtain the food of their choice? At a conference where organizers understand queuing theory, attendees should not have to wait in a food queue for more than 10 minutes – or maybe up to 15 minutes at peak times. Asking people to stand in line for longer periods shows a lack of respect for their time.

The following week we attended DevOps Days in Zürich. There were roughly 400 attendees, and as lunch approached I noticed two stations being stocked with food. I wondered how many people a station might serve per minute. If it was four per minute (as in Budapest) and there were only two stations serving lunch, I speculated that it could take 50 minutes to serve everyone – and that’s a long time for anyone to stand in line.

But I was wrong. Julia Faust, who helped plan the lunch, explained to me: “DevOps is about FLOW, so we want lunch lines that flow. We have four food stations – not two – and we have the same food at each station so anyone can get in any line. We limited the number of food offerings to be sure service is fast; we expect each station to serve about ten people per minute. We think that the four food stations can feed up to 40 people per minute, so 400 people could be fed in ten to fifteen minutes. We have also placed appetizers around on the tables so people can eat something before getting in line. We are hoping that the lines will form gradually and remain very short.”

Sure enough, there were almost no lines at the food stations, and everyone was served within fifteen minutes. Perhaps no one noticed that there was more time for networking and lunch-time gatherings, but to me it was clear that the organizers of this conference understood queuing theory and respected the time of attendees.

We went to the MiXiT conference in Lyon the following week, where about 1000 attendees were expecting lunch. I was delighted to see that people were able to help themselves to food rather than have it served. I never could understand why most European conferences I attend find it necessary to have someone serve food and pour coffee. After all, just about every European hotel I stay at seems to have a breakfast buffet, so there's nothing inherently difficult about self-service meals.

There were two long food tables, one on each side of the lunchroom. Again, I did the math. To feed 1000 people in 15 minutes, the tact time (or output rate) for each table would have to be about 32 people per minute. With a line on each side of each table, the tact time should be about 16 people per minute for each of the four lines. But my calculation was wrong – there would be only one line per table, not two, so a 15 minute line would require that each line serve 32 people per minute. Clearly this was not going to happen.

“Why not pull the tables away from the wall a bit further and allow people to get food from both sides?” I asked the gentleman in charge of lunch.

“That’s not possible,” he replied. “We need access to one side of each table for replenishment.”

“But,” I said, “Then the lines would move twice as fast.”

“They are fast enough,” he said. “Yesterday the lines were 35 minutes long; they don't need to be faster.”

I saw tables stacked high with boxes and bags of food, with a long line of people moving past each table, picking up three or four individually packaged items. On the other side, a few people watched and occasionally replenished one or the other stacks of food. They could have easily interrupted a line to add a depleted item – this happens all the time at breakfast buffets. There was no reason (other than habit) to limit each table to one line. A conference that respects its attendees should optimize lunch lines for the convenience of attendees, and find other ways to optimize the convenience of the people serving food.

A Footnote on Diversity
Every software conference I attend broadcasts a policy encouraging diversity. I welcome that because I am different than most attendees – I am 75 years old (and proud of it). But somehow, my kind of diversity has not been considered at most conferences. Consider the first conference we stopped at this spring – Agile Lean Ireland. There were virtually no chairs except in rooms where talks were held; everyone was expected to stand during coffee and lunch breaks. So Tom and I ate – usually alone – in a conference room.

Lunch was served at multiple locations strung out down a long hallway. The station furthest from the conference rooms opened first, to encourage people to move to the most remote station. This might have been a good idea, except for one thing – I found myself swept up in a swarm of attendees racing down the hallway to get served first. After a short time, I just stopped – I had gone far enough. I turned to the servers at a nearby lunch station (which was not yet opened) and said “Give me food. Now. I can’t go any further.”

The gentleman I spoke to was about to refuse, but his wiser companion indicated he should go ahead. As he served lunch for Tom and me, I smiled gratefully at the woman who had broken the rules. Then she said to the nearby people hoping to get some food, “This location is not yet open. You have to keep moving to the end of the hall.” Sigh.

Long walks, long lines, no chairs, and toilets up or down stairs are all indications that a conference does not really welcome older, less agile attendees. Of the four conferences that Tom and I attended in April and May, DevOps Days in Zürich was the only one which had none of these limitations, and thus made us feel the most welcome.

Thursday, April 4, 2019

What If Your Team Wrote the Code for the 737 MCAS System?

The 737 has been around for a half century, and over that time airplanes have evolved from manual controls to fly-by-wire systems.  As each new generation of 737 appeared, the control system became more automated, but there was a concerted effort to maintain the “feel” of the previous system so pilots did not have to adapt to dramatically different mental models of how to control a plane. When electronic signals replaced manual controls an “Elevator Feel Shift System” was added to simulate the resistance pilots felt when using manual controls and provide feedback through the feel of the control stick (yoke). A stall warning mechanism was also added – it was designed to catch the pilot’s attention if a stall seemed imminent, alerting the pilot to push forward on the yoke and thus increase the thrust and lower the nose a bit.

Enter a new version of the 737 (the 737 MAX) – rushed to market to counter a serious competitive threat. To make the plane more energy efficient, new (larger) engines were added. Since the landing gear could not reasonably be extended to allow for the larger engines, they were positioned a bit further forward and higher on the wing – causing instability problems under certain flight conditions. Boeing addressed this instability with the MCAS system – a modification of the (already certified and proven) Elevator Feel Shift System that would automatically lower the nose when an imminent stall is detected, rather than alerting the pilot. Of course, airplanes have been running on auto-pilot for years, so a little bit of automatic correction while in manual mode is not a radical concept. The critical safety requirement here is not redundancy, because the pilot is expected to override an autopilot system if warranted. The critical safety requirement is that if an autopilot system goes haywire, the pilots are alerted in time to use a very obvious and practiced process to override it.

Two 737 MAX airplanes have crashed, and the MCAS system has been implicated as a potential cause of each one. Based on preliminary reports, it appears that the MCRS system, operating with a combination of a (single) faulty sensor and a persistent reversal of the pilot override, may eventually be implicated as contributing to the disasters.

Hindsight is always much clearer than foresight, and we know that predicting all possible behaviors of complex systems is impossible. And yet, I wonder if the people who wrote the code for the MCRS system were involved in the systems engineering that led to its design.  More to the point, as driverless vehicles and sophisticated automated equipment become increasingly practical, what is the role of software engineering in assuring that these systems are safe?

Back in the day, I wrote software which controlled large roll-goods manufacturing processes. I worked in an engineering department where no one entertained the idea of separating design from implementation. WE were the engineers responsible for understanding, designing, and installing control systems. A suggestion that someone else might specify the engineering details of our systems would not have been tolerated. One thing we knew for sure – we were responsible for designing safe systems, and we were not going to delegate that responsibility to anyone else. Another thing we knew for sure was that anything that could go wrong would eventually go wrong – so every element of our systems had to be designed to fail safely; every input to our system was suspect; and no output could be guaranteed to reach its destination. And because my seasoned engineering colleagues were suspicious of automation, they added manual (and very visible) emergency stop systems that could easily and quickly override my automated controls.

But then something funny happened to software. Managers (often lacking coding experience or an engineering background) decided that it would be more efficient if one group of people focused on designing software systems while another group of people actually wrote the code. I have never understood how this could possibly work, and quite frankly, I have never seen it succeed in a seriously complex environment. But there you have it – for a couple of decades, common software practice has been to separate design from implementation, distancing software engineers from the design of the systems they are supposed to implement.

Returning to the 737 MAX MCRS System, while its not useful to speculate how the MCRS software was designed, it’s useful to imagine how your team – your organization – would approach a similar problem. Suppose your team was tasked with modifying the code of a well-proven, existing system to add a modest change to overcome a design problem in a new generation of the product. How would it work in your environment?  Would you receive a spec that said: When a stall is detected (something the software already does), send an adjustment signal to bring the nose down?  And would you write the code as specified, or would you ask some questions – such as “What if the stall signal is wrong, and there really isn’t a stall?” Or “Under what conditions do we NOT send an adjustment signal?” Or “When and how can the system be disabled?”

If you use the title “Software Engineer,” the right answer should be obvious, because one of the primary responsibilities of engineers is the safety of people using the systems they design. Any professional engineer knows that they are responsible for understanding how their part of a system interacts with the overall system and being alert to anything that might compromise safety. So if you call yourself an engineer, you should be asking questions about the safety of the system before you write the code.

It doesn’t matter that the Elevator Feel Shift System has been working well for many years – the fact is that this system has always depended on the reading from a single sensor, and that sensor can – and WILL – malfunction. In the earlier versions of the Elevator Feel Shift System, a single sensor was not critical, because the system provided a warning to pilots, who then took corrective action if needed; pilots can detect and ignore a false signal. But when there is no pilot in the loop and the software is supposed to automatically correct upon sensing a stall, it would be a good idea to make sure that the stall is real before a correction is made. At the very least, there should be an easy, intuitive, and PERMANENT override of the system if it malfunctions. And yes, this override should leave the plane in a manageable state. If this were your team, would you dig deep enough to discover that the stall signal depended on a single sensor?  Would you discuss whether there were limits to the extent of its response or conditions under which the system should not respond?

Possibly a more serious problem with the MCAS system is that it apparently resets after five seconds of normal operation, and thus can push the nose down repeatedly. Would your team have considered such a scenario? Would you have thought through the conditions under which a nose down command would be dangerous, and how the system could be disabled? It might not be your job to train pilots on how to use a system, but it the job of engineers to build systems that are easy and intuitive to override when things go wrong.

The demand for control software is going to increase significantly as we move into an era of driverless vehicles and increasingly automated equipment. These systems must be safe – and there are many opinions on how to make sure they are safe. I would propose that although good processes are important for safety, good engineering is the fundamental discipline that makes systems increasingly safe. Processes that separate design from implementation get in the way of good engineering and are not appropriate for complex technical systems.

Software engineers need to understand what civil engineers learn as undergraduates – safety is not someone else’s job; it is the responsibility of every engineer involved in the design and implementation of a system whose failure might cause harm. If your team is not ready to accept this responsibility, then call yourselves developers or programmers or technicians – but not engineers.

Saturday, January 19, 2019

An Interview

Recently I was asked to complete an interview via e-mail. I found the questions quite interesting - so I decided to post them here.

Caution: The answers are brief and lack context. Some of them are probably controversial, and the the interview format didn't provide space for going below the surface. Send me an e-mail ( if you'd like to explore any of these topics further.

When did you first start applying Lean to your software development work? Where did you get the inspiration from?

I think its important to set the record straight – most early software engineering was done in a manner we now call ‘Lean.’ My first job as a programmer was working on the Number 2 Electronic Switching System when it was under development at Bell Telephone Labs. Not long after that, I was assisting a physicist do research into high energy particle tracing. The computer I worked on was a minicomputer that he scrounged up from a company that had gone bankrupt. With a buggy FORTRAN compiler and a lot of assembly language, we controlled a film scanner that digitized thousands of frames of bubble chamber film, projected the results into three dimensional space, and identified unique events for further study.

My next job was designing automated vehicle controls in an advanced engineering department of General Motors. From there I moved to an engineering department in 3M where we developed control systems for the big machines that make tape. In every case, we used good engineering practices to solve challenging technical problems.

In a very real sense, I believe that lean ideas are simply good engineering practices, and since I began writing code in good engineering departments, I have always used lean ideas when developing software.

From the organizations you've worked with, what have been some of the most common challenges associated with Lean transformations?

Far and away the most common problem occurs when companies head into a transformation for the sake of the transformation, instead of clearly and crisply identifying the business outcomes that are expected as a result of the transformation. You don’t do agile to do agile. You don’t do lean to do lean. You don’t do digital to do digital. You do these things to create a more engaging work environment, earn enough money to support that environment, and build products or services that truly delight customers.

So an organization that sets out on a transformation should be looking at these questions:
  1. Is the transformation unlocking the potential of everyone who works here? How do we know?
  2. Are we creating products and services that customers love and will pay for? How do we know?
  3. Are we creating the reputation and revenue necessary to sustain our business over the long run?

There's lots of talk now around scaled Agile frameworks such as SAFe, Nexus, LESS, etc. with mixed results. How do you approach the challenge of scaling this way of working?

Every large agile framework that I know of is an excuse to avoid the difficult and challenging work of sorting out the organization’s system architecture so that small agile teams can work independently. You do not create smart, innovative teams by adding more process, you create them by breaking dependencies.

What we have learned from the Internet and from the Cloud is very simple – really serious scale can only happen when small teams independently leverage local intelligence and creativity. Companies that think scaled agile processes will help them scale will discover that these processes are not the right path to truly serious scale.

One of the common complaints from developers on Agile teams have is they don't feel connected to customers, and there is sometimes a feeling of working on outputs, rather than customer outcomes. How might this be changed? 

This is the essential problem of organizations that consider agile a process rather than a way to empower teams to do their best work. The best way to fix the problem is to create a direct line of sight from each team to its consumers.

When the Apple iPhone was being developed, small engineering teams worked in short cycles that were aimed at a demo of a new feature. Even though the demo group was limited due to security, it was representative of future consumers. Each team was completely focused on making the next demo more pleasing and comfortable for their audience than the last one. These quick feedback loops over two and a half years led directly to a device that pretty much everyone loved. [1]

At our meetup last year, you spoke about resisting proxies, and one of those proxies is the Product Owner. What alternative approaches have you seen work for Lean or Agile teams, as opposed to having a Product Owner?

Why do software engineers need someone to come up with ideas for them? Ken Kocienda was a software engineer who ‘signed up’ to be responsible for developing the iPhone’s keypad. In the book Creative Selection [1], he describes how he developed the design, algorithms, and heuristics that created a seamless experience when typing on the iPhone keyboard, even though it was too small for most people’s fingers.

Similarly, at SpaceX, every component has a ‘responsible engineer’ who figures out how to make that component do its proper job as part of the launch system. John Muratore, SpaceX Launch Director, says “SpaceX operates on a philosophy of Responsibility – no engineering process in existence can replace this for getting things done right, efficiently.” [2]

The Chief Engineer approach is common in engineering departments from Toyota to GE Healthcare. It works very well. There is nothing about software that would exempt it from the excellent results you get when you give engineers the responsibility of understanding and solving challenging problems.

What is the most common thing you've seen recently which is slowing down organizations' concept-to-cash loop?

Friction. For example, the dependencies generated by the big back end of a banking system are a huge source of friction for product teams. The first thing organizations need to do is to learn how to recognize friction and stop thinking of it as necessary. When Amazon moved to microservices (from 2001 to 2006) the company had to abandon the idea that transactions are managed by a central database – which was an extremely novel idea at the time.

Over time, Amazon learned how to recognize friction and reduce it.  Today, Amazon Web Services (AWS) launches a new enterprise-level service perhaps once a month and about two new features per day. Even more remarkable, AWS has worked at a similar pace for over a decade. If you look closely, an Amazon service is owned by a small team led by someone who has 'signed up' to be responsible for delivering and supporting a service that addresses a distinct customer need at a cost that is both extremely attractive and provides enough revenue to sustain the service over time.

1. See Creative Selection by Ken Kocienda.

2. John Muratore, System Engineering: A Traditional Discipline in a Non-traditional Organization, Talk at 2012 Complex Aerospace Systems Exchange Event.