Figure 1 shows the relationship between the mental models of the designers and those of the operators. Designers deal with ideals or averages, not with the actual constructed system. The system may differ from the designer's original specification either through manufacturing and construction variances or through evolution and changes over time. The designer also provides the original operational procedures as well as information for basic operator training based on the original design specification. These procedures may be incomplete, e.g., missing some remote but possible conditions or assuming that certain conditions cannot occur. For example, the procedures and simulator training for the operators at Three Mile Island nuclear power plant omitted the conditions that actually occurred in the well-known incident because the designers assumed that those conditions were impossible.
In contrast, operators must deal with the actual constructed system and the conditions that occur, whether anticipated or not. They use operational experience and experimentation to continually test their mental models of the system against reality and to adjust the procedures as they deem appropriate. They also must cope with production and other pressures such as the desire for efficiency and "lean operations." These concerns may not have been accounted for in the original design.
Procedures, of course, periodically are updated to reflect changing conditions or knowledge. But between updates operators must balance between:
1. Adapting procedures in the face of unanticipated conditions, which may lead to unsafe outcomes if the operators do not have complete knowledge of the existing conditions in the plant or lack knowledge (as at Three Mile Island) of the implications of the plant design. If, in hindsight, they are wrong, operators will be blamed for not following the procedures.
2. Sticking to procedures rigidly when feedback suggests they should be adapted, which may lead to incidents when the procedures are wrong for the particular existing conditions. If, in hindsight, the procedures turn out to be wrong, the operators will be blamed for rigidly following them.
In general, procedures cannot assure safety. No procedures are perfect for all conditions, including unanticipated ones. Safety comes from operators being skillful in judging when and how they apply. Safety does not come from organizations forcing operators to follow procedures but instead from organizations monitoring and understanding the gap between procedures and practice. Examining the reasons why operators may not be following procedures can lead to better procedures and safer systems.
Designers also must provide the feedback necessary for the operators to correctly update their mental models. At BP's Texas City refinery, there were no sensors above the maximum allowed height of the hydrocarbons in the distillation tower. The operators were blamed for not responding in time although they had no way of knowing what was occurring in the tower due to inadequate engineering design.
FOCUSING ON BLAME
Blame is the enemy of safety. "Operator error" is a useless finding in an accident report because it does not provide any information about why that error occurred, which is necessary to avoid a repetition. There are three levels of analysis for an incident or accident:
• What — the events that occurred, for example, a valve failure or an explosion;
• Who and how — the conditions that spurred the events, for example, bad valve design or an operator not noticing something was out of normal bounds; and
• Why — the systemic factors that led to the who and how, for example, production pressures, cost concerns, flaws in the design process, flaws in the reporting process, and so on.
Most accident investigations focus on finding someone or something to blame. The result is a lot of non-learning and a lot of finger pointing because nobody wants to be the focus of the blame process. Usually the person at the lowest rung of the organizational structure (the operator) ends up shouldering the blame. The factors that explain why the operators acted the way they did never are addressed.
The biggest problem with blame, besides deflecting attention from the most important factors in an accident, is that it creates a culture where people are afraid to report mistakes, hampering accident investigators' ability to get the true story about what happened. One of the reasons commercial aviation is so safe is that blame-free reporting systems have been established that find potential problems before a loss occurs. A safety culture that focuses on blame will never be very effective in preventing accidents.
Hindsight bias permeates almost all accident reports. After an accident, it is easy to see where people went wrong and what they should have done or avoided or to judge them for missing a piece of information that turned out (after the fact) to be critical. It is almost impossible for us to go back and understand how the world appeared to someone who did not already have knowledge of the outcome of the actions or inaction. Hindsight is always twenty-twenty.
For example, in an accident report about a tank overflow of a toxic chemical, the investigators concluded "the available evidence should have been sufficient to give the board operator a clear indication that the tank was indeed filling and required immediate attention." One way to evaluate such statements is to examine exactly what information the operator actually had. In this case, the operator had issued a command to close the control valve, the associated feedback on the control board indicated the control valve was closed, and the flow meter showed no flow. In addition, the high-level alarm was off. This alarm had been out of order for several months but the operators involved did not know this and the maintenance department had not fixed it. The alarm that would have detected the presence of the toxic chemical in the air also had not sounded. All the evidence the operators actually had at the time indicated conditions were normal. When questioned about this, the investigators said that the operator "could have trended the data on the console and detected the problem." However, that would have required calling up a special tool. The operator had no reason to do that, especially as he was very busy at the time dealing with and distracted by a potentially dangerous alarm in another part of the plant. Only in hindsight, when the overflow was known, was it reasonable for the investigators to conclude that the operators should have suspected a problem. At the time, the operators acted appropriately.