Welcome to the Operator Training edition of Chemical Processing's Distilled podcast. This podcast and its transcript can be found at chemicalprocessing.com. I'm Traci Purdum, Editor in Chief of Chemical Processing, and joining me is operator training guru, Dave Strobhar, founder and Principal Human Factors Engineer at Beville Engineering. Dave is also the founder of the Center for Operator Performance. Thanks for joining me, Dave.
Dave: Thanks for having me, Traci, and it's a nice way to end out the summer.
Traci: Exactly. I can't believe how fast time flies. I guess the older you get, the faster it moves. You mentioned to me that you feel a major misconception is that people think alarm targets are about workload. Can you explain to me why you say that's a misconception?
Dave: Sure. Well, people will talk about the targets. I'll say the steady-state target is six alarms per hour, and people think, "Oh, well, if I go to eight, the operator's maxed out, and that's a real problem." When in fact, those targets were developed, and even in the document where they came from, they say these should not be taken as hard targets, and these are what companies can use to assess their alarm system. So it's not assessing operator workload; it's just a way to say that, "Hey, if you hit these targets, your alarm system is probably okay." It's not saying it's absolutely okay, but it's just a general rule of thumb, and it relates to operator workload in no particular manner.
Obviously, operators have more to do than just monitoring the distributed control system. What all those other things are, are never specified. So, to say, "I'm going to give you one metric, and if you go over that, the operator's overloaded," is ridiculous, given that you haven't specified, "Well, how much time is being taken up by all this other stuff?" Maybe none. Maybe they have plenty of time. I hear people talk about when their alarm rates are high, "Oh, the operator was overloaded at that particular point in time."
Likewise, one of the other targets where you could actually make some argument about a connection with workload is during alarm floods. There are data in terms of what an operator can handle, as being able to read and respond during a situation where I just have a lot of alarms coming in. The targets are way under what the research data shows a person can actually handle. And as the case with most of the metrics, there's no reflection on priority. In other words, it says, "An alarm flood occurs if I get over 10 alarms in 10 minutes or an alarm per minute." Well, is that a low-priority alarm? Is that an emergency priority alarm?
If I have them broken out into the correct priorities, what you typically see and what you exactly want and why you put priorities in is the operators handle the highest priorities first and then the next level and so on. So, when you say that this number is generating too much workload since priorities aren't in part of that and, well, this is the most of any priority set that they could handle, even there, it's not a workload issue. The guidelines really are just that. It's just a way to assess your system, and if you're close to those, you're probably okay. But it's in no means to be taken as a measure that your operator is overloaded or they have too much to do, and I've certainly run into people that try to make that argument throughout the industry.
Traci: You're talking about correct priorities, and I want to delve a little bit deeper into that in terms of the dangers of alarm fatigue and poorly defined alarms. You touched on it. Can we get a little bit further into that?
Dave: Oh, absolutely. What are your major concerns with the alarm system obviously, the operator is going to miss an alarm that has critical consequences associated with it. Well, why can they miss the alarm? Well, one way is what I was just talking about, was where they can come in at a rate too high for the operator to process and respond to them. And I've seen data that at one plant, they are getting 60 alarms a minute. Well, you can't read and process information at that rate, so one reason I can miss an alarm is they just come in too fast.
The other reason that you can miss an alarm is essentially when you call it alarm fatigue, but I have essentially devalued the alarm system. Alarms come in, but the operator doesn't look at them, and this occurs because a lot of cry-wolf alarms come in. If 19 out of 20 alarms really didn't have anything for the operator to do, they quit looking at the alarm system. They devalue it as a source of good information. And so when a truly significant alarm comes in, they don't really process it or respond to it because most of the time, the vast majority of the time, they've learned that alarms aren't giving them good information.
That's related to the other phenomena, which are standing alarms, alarms that stay on the screen for long periods of time. And you can go into control rooms, and they have multiple screens of standing alarms that are there. The same sort of thing; they just become numb to having alarms on the screen. That's just, "Well, there's always alarms on the screen. Why should I worry about it?" The truly significant alarm can come in and get buried in all those alarms that are standing there. And I know a plant that eventually lost all their steam because the critical alarm that came in just got mixed in with the other 20 alarms that were standing on the alarm summary screen. And the operators, again, didn't really use it as an information source. It had come in, it had been acknowledged, and it's just there.
But the amazing thing, it took eight hours for the consequences to appear, to run out of water for their boilers. So that alarm was sitting for eight hours on that screen, and the operator did not respond to it and take the appropriate action. This isn't a workload issue. This is, as we say, a fatigue issue. I'm just, "The alarms aren't helping me. Yeah, we always have alarms in, so there's nothing unusual about that." As opposed to, and there's not many, but I've seen a few plants where the alarm summary screen is essentially blank, maybe one alarm on it. But when you get down to that, any new alarm that comes in is going to stand out, and that's part of what you would consider a signal detection issue. "How do I pick out the signal from the noise?"
Well, if I've got all these standing alarms, that's a very noisy alarm system, so picking out the signal, the valid alarm is tough for them to do. You have that case where "I miss alarms because they come in too fast." The current metrics help a little in that regard, but not a lot. But again, it's not a workload issue. That's occurring over a short period of time. And then you have the alarm fatigue issue. "I'm always getting alarms, and they don't mean anything. They just sit there on the screen and nothing happens, and so I become used to that." That becomes the new normal for the operators, and then something important comes in and they don't react to it. Those are the typical ways you see an operator miss an alarm. None of those three are related to the workload that the operator has to do.
Traci: It makes absolute sense, and it's feast or famine with these alarms and the cry wolf alarms going out there. But with interconnected systems, with everything that we're doing these days, if alarm targets are not set appropriately within the complex environment, it can lead to the confusion that you're talking about and difficulty in identifying the root cause of the issues. What are some of the ways to mitigate this? How can they solve this? I mean, do they just go in and wipe the board of all their alarms and then start fresh? What do they do?
Dave: Well, unfortunately, that wiping it out, the alarms, and starting fresh tend to be almost the way you have to do it in many cases. Again, this is being able to pick out the signal, the valid alarm from everything else that's going on. And oftentimes, when you get an alarm, what you're getting is a symptom. This spot on this process is getting hot. Well, that's just a symptom. The cause, why is it getting hot, a lot of times, it's very close by and you can understand that. But sometimes, the cause is significantly upstream, where the cause is because of something that's going on at another unit.
And if you over-alarm in a particular area, "I've got redundant temperature transmitters around that," they all go into alarm. Now the focus is on this symptom, which is not really what you want them to be focusing on. You want them to say, "Okay, why is it hot and where could that be coming from?" Yes, it could be coming from the cooler just upstream. That would be the simplest, and that would be easy, and you do want to catch that. But you need them to take a wider look at the process and try to find out, "Well, where else could this be coming from? Because it may not be coming from my particular unit that I have to take some sort of action on. I either have to go upstream in my unit or I have to go to some other units in order to make these particular changes."
But if all the alarms are flooding in, you do tend to get that. That's this tunnel vision that people have heard about where, "I'm going to focus on all these alarms, because, wow, it's really getting hot there." Rather than say, "Okay, yes, it is getting hot there, but now let me look around. And I may even have some other alarms that are related to that, that are going to tell me that, "Oh, here's why it's hot there," and either I need to go somewhere else or I need to get somebody else to take some action.
Traci: How do you get that training? How do you get that intuition to know to look around?
Dave: Well, that's interesting because we use that with when we're going through alarm rationalizations. I'm listening to operators, and a lot of the time, I know what the general response should be. And when operators that look upstream and give me those responses, it's like, okay, this person, this individual, they know what they're doing here because they have understood over time that interconnectedness on these different systems. And I think what needs to happen now as these plants run more stable and you're getting fewer alarms, is that needs to be part of their training program. In other words, how do these interconnections manifest themselves and teaching them to step back and look at the big picture.
And in many cases, this would be literally the big picture, which should be your overview display for your entire span of control and see, "Well, do I have problems going on anywhere else?" Yes, this is where the alarm is. This is not where the cause is located at. That's a training issue, and it should be explicit in the training program. And with the use of simulators, easy to do those sorts of simulations to teach the operators that you need to look beyond where the problem is. Most people have had this at some point or another in visiting a doctor, where they go to the doctor and say, "My hip hurts." He says, "Well, that's because you have a bad knee." And it's like, "No, my knee feels fine. It's my hip that's hurting." He's like, "Yeah, yeah, but you're walking funny, and you're just picking up the pain there."
Well, we need that same sort of thing with operators. It's like, "Yeah, it's getting hot here or high pressure, but the reason that's happening is somewhere else." And they have to be well-versed enough in those interactions between the different systems to say, "Okay, what else could be driving this pressure or temperature to exceed a particular value that I'm looking for?"
Traci: If you were in charge of alarm targets at a facility, if you walked in there, what would be your first step to ensure that operators could deal with all levels of alarms?
Dave: I would say that the blank alarm summary screen, no standing alarms, that would be one of my first goals. And that isn't even generally listed as one of the major alarm targets that people have, but it's one of the things that if it captures, if your alarms actually prompt an action, you actually have to do something about it, then whatever alarm comes in, you take an action and resolve it and the alarm goes away. So if I come in, the only alarms that I should see on a screen should be the alarms on which the operator is taking action. That's probably the easiest one.
Then the second one would be the, how many alarms are you getting during these alarm floods? There became a real focus in the industry to look at that steady state alarm rate, six alarms an hour, and driving to that and not looking at, "Well, what happens during an upset?" And what we have found at many locations is that because they hit five alarms per hour, they're all happy and doing victory laps. But if the plant hiccups, all of a sudden the operators are just slammed with all these alarms, because they didn't look at that situation. They met that target, six alarms per hour. "Oh, hey, we're there. Everything is great."
I actually would put that as, if somebody's capturing it and they want to put it there, I would go, "Okay, fine," but I don't think it's a metric you would use certainly on an ongoing basis. But what happens when things go south? Have we handled that situation so that the operator is not getting 60 alarms per minute that I had seen elsewhere? If I came in and they had a blank alarm summary screen, and during upsets, they were getting, say, maybe 30 alarms an hour, then I don't care what their steady state alarm rate could be. They could be getting 12 or 13 alarms per hour, and I'm going to probably say they're okay, because I know that the alarms are giving them good information.
So if they're getting 13 alarms an hour, well, they're obviously getting good information from that, because the alarm summary screen is blank and they're taking action on it. And I know if the unit were to have some sort of incident that, again, that flood of alarms is below the rate at which individuals can process alarms, and they can probably handle it there. That alarm per hour number, six alarms per hour, that tends to get thrown around a lot. I really have little interest in it other than maybe at the very beginning.
If I know I've got a crappy alarm system, I'd want to see what that is, but as far as putting any emphasis on it, I would say not. Just no standing alarms, no alarm floods, because those, that's when operators miss alarms. They miss alarms because of the alarm fatigue. "I'm used to seeing them on the screen. They're crying wolf or I miss them because they're coming in too fast." Address those two and probably you're going to get 90% of the benefits out of your alarm system by looking at those two situations.
Traci: Dave, is there anything you want to add that we maybe not have touched on for the topic that you think is important?
Dave: I think there are some unstated assumptions in these alarm targets that come out. I mentioned one. What else is the operator doing? There's an assumption in there, and it would be nice to know, well, what is that assumption? Is this target a good target or a bad target? There's some unstated assumptions in there. The size of the operator's span of responsibility is another unstated assumption. Let's say I have two different units, and one of them is a very large hydrocracking sort of unit and 300 control loops for the operator, and they're getting seven alarms per hour. Contrast that with a utility area, maybe 100 loops, 20th of the size of the other one, and they're getting five alarms per hour.
Well, the one you would say, "Well, hey, at five alarms per hour, their alarm system must be good." And the other one, "Oh, well, they got seven alarms per hour. Oh, they need work." No, that's ridiculous. What's happening here is getting caught up in this particular number. And because there's no real assumption about the size of what I'm dealing with, making some assessment of goodness, I would actually have just the opposite. In that big unit, it's seven alarms per hour. It's like, "Oh man, they have control. They have a handle on their alarms."
And the smaller unit, that five alarms per hour, it's under the target guideline. But given how small it is, I'd be like, "Wow, you probably need to clean up that alarm system, because you shouldn't be getting that many alarms for a unit of that particular size." So there's a lot of these unstated assumptions that go into these targets, but because they're unstated, people can't assess relative to their own facilities. And if you go back to the original documents, the EEMUA document that really became the first to document alarms and alarm rates, and then later ISA-18.2, the EEMUA document clearly states about, "Don't become trapped in using these values. These are not hard targets. This is our best guess from people in the industry."
And yet they've become hard targets for many people, and I think they need to get away from that and just say, "Hey, this is just a measure." And you need to then look at some other things and understand that, "Oh, okay, for really a large span of control, seven alarms per hour, that might be really good." And don't get caught up in that, "Oh, I didn't meet that particular metric or that particular target."