When Alarms Stop Warning and Start Failing

Decades of guidance on alarm management exist. So why are control rooms still drowning in noise?

Alarm management is much like weight loss: everyone knows what to do, but not everyone is willing to do it.

The shift just began, and you are settling in at the console. The 200,000-bbl/day crude unit is running fine. The advanced process control has been turned off, so you decide to turn it on. In the next five minutes, 230 alarms actuate. That seems manageable compared with the 290 in the five minutes after that. Then comes a 15-minute respite with only 150–200 alarms every five minutes before the peak hits: 340 alarms 30 minutes into the event, only 45 minutes after the shift began. You are now receiving 30 to 70 alarms per minute, each demanding interpretation, diagnosis and response. Somehow, you get through this with no major incident other than destroying the catalyst on a downstream unit.

No operator can meaningfully respond to dozens of alarms per minute. Eventually, they stop discriminating between alarms, potentially missing the few that actually matter or alarms unrelated to the event. More alarms do not improve awareness; they destroy it.

Alarm management is not a new topic. My first job out of college in 1980 was to help improve the alarm system at Three Mile Island after its 1979 accident. Unit 2, which had the accident, was a victim of alarm creep, with over twice as many alarms as its sister plant, Unit 1. Some alarms had to be placed behind the main panel, out of the operators' sight. Industry attempts to improve plant alarm systems began with EEMUA 191 in 1999 and ISA 18.2 in 2009. More than 25 years have passed since guidance was issued on improving plant alarm systems. Why are alarm systems still so bad?

The short answer is obstinacy, timid operators and catastrophizing. EEMUA 191 and ISA 18.2 center on an alarm philosophy — a set of rules governing system configuration and use — including the requirement that alarms prompt a unique operator action. This seems simple until you ask people to remove alarms for failing to meet that criterion.

Plant: "We can't get rid of that!"

Me: "But there is nothing they could or would do when it actuates."

Plant: "True, but we can't remove it."

This is not rare. On an unrationalized unit, 40%-50% of alarms either require no operator action or duplicate another alarm.

Many very good operators are reluctant to remove an alarm due to uncertainty or anxiety over what may happen. They think that the alarms will be their savior in an upset rather than the cause of their failure. Any alarm, no matter how worthless or stupid, is kept for fear of an undefined bad situation. "I don't know why I would ever need this, but I feel better having it," said one operator. In all the alarm-related events I have examined, not one was caused by a lack of an alarm ("There was no warning"). However, there are numerous multi-million-dollar incidents attributable to excessive alarm activity.

Much of this resistance is personal. Operators and DCS personnel often identify with the system they've built or run. Rationalizing alarms feels like an attack on their competence or judgment. That conflation — between the person and the system — is precisely the problem. A control system exists to protect the plant, not to reflect its authors.

Catastrophizing undermines a key part of the alarm philosophy: defining the consequence of a parameter deviating beyond the alarm point, often called the consequence of inaction or unsuccessful response. For some reason that I cannot fathom, many plant personnel want to take this to some ultimate, catastrophic outcome where doom, gloom and agony await. When every alarm is tied to a catastrophic consequence, they all become equally urgent — which is the same as none of them being urgent. Too many priority-one alarms, an operator who has stopped discriminating between them, and a system that screams constantly while saying very little. The consequence of deviation for high level in an overhead accumulator is liquid carryover to the compressor knock-out. It is NOT destruction of the compressor with fire and loss of life. The compressor will trip on high knock-out after receiving a high-level alarm. A compressor trip makes for a bad shift, but it isn't a disaster.

Both tendencies share a common root: fear. Operators fear the alarm they removed will be the one they needed. Engineers fear being blamed for understating consequences. Alarm rationalization requires making confident decisions under uncertainty. But there has to be some basis for alarm selection, such as operator action or a consequence, otherwise everything can be justified as an alarm.

What is necessary for success is the backing of a heavy hitter: an operations manager or above who understands the risks of a poor alarm system and can instill the discipline to fix it. Trusting personnel who know the work and don't catastrophize. Alarm management is much like weight loss: everyone knows what to do, but not everyone is willing to do it.

About the Author

David Strobhar

David Strobhar founded Beville Operator Performance Specialists in 1984. The company conducts human factors engineering analyses of plant modernization, operator workload, and alarm/display systems for BP, Phillips, Chevron, Shell and others. Strobhar was one of the founders of the Center for Operator Performance, a collaboration of operating companies, DCS suppliers and academia that researches human factors issues in process control. He is the author of "Human Factors in Process Plant Operations" (Momentum Press) and was the rationalization clause co-editor for ISA SP18.2, "Alarm Management for the Process Industries." Strobhar has a degree in human factors engineering, is a registered professional engineer in the state of Ohio and a fellow in the International Society of Automation.

Sign up for our eNewsletters
Get the latest news and updates