Adroitly Manage Alarms

Six underutilized techniques can provide significant benefits

By Peter Andow, Honeywell Process Solutions

Share Print Related RSS
Every process plant of any significant size has an alarm system. At most sites, operators routinely work with control systems that have several thousand configured alarms and many plants have far too many alarm activations. A 1994 accident at Milford Haven in the U. K. stemmed in part from a poorly functioning alarm system [1]. It drew attention to the alarms problem. A survey of 13 plants commissioned by the U. K.’s Health and Safety Executive (HSE) showed that alarm systems in many plants performed poorly [2]. In response to the problem, the Engineering Equipment & Materials Users’ Association (EEMUA), London, issued its Publication 191 "Alarm Systems — A Guide to Design, Management and Procurement" in 1999. (The second edition came out in 2007 [3].) That guidance has had a tremendous impact. Various authors, e.g., those of Refs. 5 and 6, have described the range of problems encountered and means for improvement. Considerable progress has been made. A particular focus has been on the alarm rate targets suggested by EEMUA (Table 1).

The Abnormal Situation Management (ASM) Consortium [4] ran a benchmarking study to measure what had been achieved. Based on a sample of 37 consoles from plants operated by ASM Consortium members, the study showed that most of the sites averaged fewer than two alarms per 10 minutes — a considerably lower level than that found on many plants in the past. The study also showed that 13 sites had achieved an average of less than one alarm per 10 minutes. Progress towards the upset metric hasn’t been as satisfactory. Indeed, a summary paper [7] on the ASM study noted:

Alarm Targets

alarm metrics

Table 1. The guidelines in EEMUA Publication No. 191 underscore the
need for fewer but more informative alarms.

"However, the EEMUA recommendation for peak alarm rates following a major plant upset (i.e., not more than ten alarms in the first ten minutes) appears to be a challenge, given today's practices and technology. Only two of the 37 consoles came close to achieving the alarm rate guideline for upset conditions. This suggests that to achieve alarm system guidelines for upset conditions, more advanced site practices and alarm-handling technology (e.g., dynamic or mode-based alarming) are required."

Achieving further improvement
Many plants have primarily relied on rationalization of a relatively small number of bad actors (often just the "top 10" or "top 20") — this certainly can yield good performance improvements. But rationalization of bad actors usually doesn't significantly impact subsequent alarms floods because a large number of different alarms are likely to occur and those alarms usually won't have been the bad actors activated during routine plant operations. For the many sites that have already successfully completed a bad actors project (based on a coherent alarm philosophy) and that require further improvement, ways forward include:

1. Full rationalization of alarms (i.e., not just bad actors);
2. Use of a Master Alarm Database (MAD) to facilitate control and management of changes to the alarm configuration parameters;
3. Improvements to the Human Machine Interface (HMI), such as better graphics design for alarm management and easy access to alarm help;
4. Mode-based alarming;
5. Testing of alarms; and
6. Alarm suppression technology.

Let's now consider these separately, with the understanding that some may not apply everywhere.

1. Full rationalization. A considerable effort is required to rationalize all of the alarms configured for a particular plant area. The number of alarms that can be rationalized by an experienced team is often quoted at around 100 per day.

The benefits of full rationalization are that alarms activated during incidents are better designed and, thus, less likely to contribute to unnecessary alarm floods. One aspect of rationalization that deserves special attention is the use of appropriate dead-bands and debounce timers (when alarms are activated or cleared). An ASM Consortium project confirmed that more extensive use of debounce timers can be particularly effective when some alarms otherwise would chatter during alarm floods [8]. The ASM study found that the use of debounce timers and other configuration changes reduced the 10-min. alarm rate by 45% to around 90%. However, many plants don’t use this functionality very extensively even when it’s readily available on the distributed control system (DCS) and known to be effective.

Most sites currently don’t have fully rationalized alarm systems — so this opportunity for improvement has considerable potential.
Master Alarm Database MAD
Figure 1 - MAD’s Role: Use of a Master Alarm Database
provides an effective way to enforce alarm parameters.

2. Use of a Master Alarm Database. A MAD is the master repository for the alarm configuration parameters (Figure 1). Some vendors (e.g., Honeywell) now include operating envelope data in an expanded database known as the Master Boundary Database. The addition of operating envelope information is significant because well-designed alarms often relate to one or more of the many limits that define the operating envelope for safe and efficient operation.

Use of a MAD facilitates several different functions:

• Enforcing alarm parameters. This activity overwrites alarm parameters on the DCS when these are found to differ from the values stored in the MAD. It prevents the alarm system from being used when operators (or others) have changed the values of parameters (such as alarm priorities or alarm limits) — perhaps on a temporary basis — and not restored the as-designed values afterwards;

• Tracking enforcements. This enables identifying alarm parameters regularly being changed and updating them, if needed, to more appropriate values;

• Logging all changes to the MAD. This clarifies why and when changes were made and who made them;

• Providing online operator help. Information can include the likely cause of an alarm, the consequences of no action, the action itself, and the time available for operators response; and

• Linking alarms to the relevant operating-envelope values. Reasons for the alarm limits become much more transparent, making any changes in alarm values that violate the operating envelope — which could have serious implications — less likely.

The benefits of using a MAD are that the plant continues to operate using the parameters agreed upon during rationalization. If some parameters are changed on a regular basis for justifiable reasons, the alarms can be reviewed again and the MAD updated to include the modified values.

The use of a MAD is often regarded as critical to maintaining the alarm system’s integrity. This demands its implementation as an integrated part of the alarm system as well as robustness because its enforcement and operator-help functions must have high availability. It’s not simply an offline tool with DCS access but needs integrated functionality that can limit the loading on the DCS. Most sites currently don’t utilize a MAD — so this means for improvement again has considerable potential.

3. Improvements to the Human Machine Interface. Human factors issues have often been neglected. However, the whole control center environment affects operator effectiveness. At a more detailed level, the HMI can significantly impact alarm management [9].
alarm management menu
Figure 2 - HMI Improvement: Right clicking on
operating graphic brings up menu with many
useful functions for dealing with an alarm.

The quality of graphics and the style of alarm presentation on graphics vary widely. Some sites primarily rely on the standard DCS vendor “alarm summary” and “groups” — without any detailed graphics. Other sites also utilize very effective alarm presentation and operator help integrated into detailed operating graphics. Figure 2 shows a right-click contextual menu including functionality to acknowledge the alarm and obtain operator help for it from a MAD.

The potential for performance improvement by enhancing the HMI clearly depends on the quality of graphics (from an alarm presentation perspective) currently in use at a particular plant.

There’s also the possibility of utilizing a graphic designed specifically to be effective during alarm floods. The ASM Consortium has recently tested a graphic that might serve to replace the traditional list-based alarm summary, a format that isn’t usually effective when large numbers of alarms occur. Tests showed that the new style has considerable potential for giving operators a better understanding of the true abnormal situation — thereby allowing them to act more effectively.

Because floods are still the most significant alarms problem in many plants, this enhancement clearly has considerable potential for performance improvements.

4. Mode-based alarming. Many plants have multiple operating modes (startup, normal running, shutdown, cold standby, regeneration, etc.). The alarm system is often only appropriate for normal running — for example, many standing alarms derive from plant equipment that’s not in service. Operators have to recognize as such the many inappropriate alarms activated during other modes of operation. This devalues the integrity and value of the alarm system.

A better approach is to identify the various modes and define alarm parameter settings that suit each mode of operation. So, for instance, a standby or shutdown mode can be used to eliminate alarms from out-of-service equipment.

If a MAD is being used, various sets of alarm parameters can be stored in the MAD and written to the DCS when a plant mode change occurs. The enforcement functionality can handle this activity — effectively overwriting old alarm parameters with the ones required for the new mode of operation. This type of enforcement is much more efficient than requiring operators to make large numbers of manual configuration parameters changes.

Most sites currently don’t use mode-based alarming. So this means for improvement has considerable potential, although this clearly depends on the character of operations of the particular plant unit.

5. Alarm testing. All alarms should have a real meaning and value to operators. It makes no sense if some alarms are out of service (due to one or more failures) and aren’t clearly recognized as being out of service.

Standards such as IEC 61511 require regular proof testing of safety-related alarms. But most DCS alarms aren’t safety-related and most sites don’t routinely test these DCS alarms.

Some DCS alarms occur often enough (or are of relatively low value) that there’s insufficient justification for testing them. But other alarms may have a relatively high value and remain inactive for long periods of time — thus posing concerns about whether they will activate when required. Testing such alarms to prove they are operational clearly has value. EEMUA 191 includes recommendations for testing alarms. It’s generally impractical to test all alarms; so, it’s essential to identify and implement a realistic testing strategy based on the value of each alarm in terms of the potential consequences if it doesn’t activate when it should.

For example, a simple strategy might call for annual testing of all higher-priority alarms that haven’t been activated in the previous year. If the EEMUA recommendations for the proportion of the higher-priority alarms in the alarm system have been followed, this would mean that only around 15% to 20% of all configured alarms would require testing — and a large proportion of those might have been activated during the previous year, significantly reducing the number needing testing.

Given that most plants currently don’t systematically test alarms, there’s certainly some potential for performance improvement.

6. Alarm suppression technology. Use of suppression has often been advocated as a means for improving poorly performing alarm systems — with little or no consideration of other potential methods.

An event frequently quoted as a good opportunity for suppression is when large numbers of alarm activations occur when a compressor trips. Such events often generate 100 or more alarms in a short time. Apart from the first few alarms, most of these activations are of little or no value to operators — and are obvious candidates for suppression. To be effective, suppression of consequential alarms needs to be done quickly — often within a few seconds of the event causing the compressor trip.

In other situations, e.g., when a pump changeover occurs, there’s also value in suppressing a relatively small number of alarms for a period of time. The pump changeover will often have been operator-initiated and timing will depend on pump run-down or run-up dynamics. This clearly differs markedly from the requirements of suppression during a compressor trip.

There’ve been many attempts over the years to suppress or mask unwanted alarms. Many of these attempts have been costly and haven’t been very effective. In some cases the alarms being suppressed simply resulted from poor design; rationalization of those alarms often removes the need for suppression.

A key requirement for suppression schemes is to identify a pattern of plant or control-system conditions that must exist before initiation. Any alarm suppression scheme demands careful design, to avoid potentially dangerous situations where alarms remain suppressed when they should be operational.

One approach often used is to write custom code. While this offers maximum flexibility, it can also cause problems because the code can be difficult to test and maintain. A better approach is to use a standardized (and well-tested) suppression function based on tabulated suppression requirements. Tabulated data typically will need to include several different types of information:

• Required conditions for initiation of suppression, e.g., plant mode, values of process variables, digital states and other alarm conditions. The term “permissives” is sometimes used for such conditions. It’s sensible to employ (where possible) multiple permissives to reduce chances of failed or noisy signals initiating suppression;

• A set of alarms that need to be suppressed — and, in some cases, the order of suppression so that alarms expected to occur early during an event are suppressed first;

• Messages to indicate to the operations team that suppression is active, or means for operators to see which alarms have been suppressed; and

• Necessary conditions for the release of suppression as the plant returns to normal or enters a different operating mode.

Logic used to detect the need for suppression and for release from suppression must be robust and transparent to the operations team.

Because many plants currently don’t utilize any alarm suppression functionality, it promises considerable improvements — particularly when robust tools for suppression are available from DCS vendors.

Make More Progress
The need for improved alarm management gradually became more apparent after the Milford Haven accident and has received much attention and investment in recent years. Considerable progress has been made — much of it due to wide application of the EEMUA guidance to rationalize the alarm system.

A significant number of plants now average rates of fewer than 10 alarms per hour in routine operation, enabling plant operators in such plants to be much more effective and proactive.

However, significant problems remain. In particular, many plants experience alarm floods far too often. An ASM Consortium study found that not a single one of 37 consoles studied achieved the EEMUA recommendation of fewer than 10 alarms during the first 10 minutes of an upset. During floods, many alarm systems are of little value to operators and are effectively unusable. Clearly, more-effective alarm systems may have avoided some serious accidents or at least reduced their consequences. Unfortunately, alarm rationalization efforts alone don’t provide the performance improvement needed during upsets.

Bodies such as the U.K.’s Health and Safety Executive recognize the problems that can occur when alarm management is poor, and are providing regulatory drivers for plants to improve performance. More extensive use of the six techniques described here can play a significant part in enhancing alarm management.

Take Advantage of Shutdowns
Alarms failing to activate during accidents can lead to dire consequences. For example, some significant liquid-level alarms didn’t go off during the 2005 accident at BP’s Texas City, Texas, refinery [10]. Such alarms may have been out of service for a long period without staff realizing this. If fully operational alarms had been activated during the startup, operators conceivably might have had sufficient time to avoid or reduce the consequences of the accident.

Don’t wait until a startup to find out if important alarms for the startup are working. Schedule testing during the shutdown period immediately prior to startup — particularly if it’s known that these alarms haven’t been activated for a long period. This testing should be highly selective, focusing on the small number of higher priority alarms that truly are significant during startup operations. This strategy is much more cost-effective than simply relying on routine testing of all configured alarms.


Peter Andow is a principal consultant, advanced solutions, for Honeywell Process Solutions, Bracknell, U. K. E-mail him at peter.andow@honeywell.com.

References
1. “The Explosion and Fires at the Texaco Refinery, Milford Haven, 24 July 1994,” HSE Books, Sudbury, U. K. (1995).
2. Bransby, M. L. and J. Jenkinson, “The Management of Alarm Systems,” HSE Books, Sudbury, U. K. (1998).
3. “Alarm Systems — A Guide to Design, Management and Procurement,” Publ. No. 191, 2nd ed., EEMUA, London, U. K. (2007).
4. Andow, P., “Abnormal Situation Management: A Major U. S. Programme to Improve Management of Abnormal Situations,” IEE Colloquium on “Stemming the Alarm Flood,” London, U. K. (1997).
5. Campbell Brown, D., “Practical Steps Toward Better Management of Alarms,” Proceedings, “Alarm Systems,” IBC, London, U. K. (2000).
6. Nimmo, I., “Rescue Your Plant from Alarm Overload,” Chemical Processing, Jan. 2005, p. 28, http://www.ChemicalProcessing.com/articles/2005/209.html (2005).
7. Reising, D. V. and T. Montgomery, “Achieving Effective Alarm System Performance: Results of ASM Consortium Benchmarking against the EEMUA Guide for Alarm Systems,” Proceedings, 20th Annual CCPS Intl. Conf., Atlanta, Ga., AIChE, New York City (2005).
8. Zapata, R. and P. Andow, “Reducing the Severity of Alarm Floods,” Proceedings, Honeywell Users Group Americas Symposium 2008, Honeywell, Phoenix, Ariz. (2008).
9. Errington, J., Reising, D. V. and K. Harris, “ASM Outperforms Traditional Interface,” Chemical Processing, March 2006, p. 55, http://www.ChemicalProcessing.com/articles/2006/041.html (2006).
10. “Refinery Explosion and Fire, BP Texas City, March 23, 2005,” Report No. 2005-04-I-TX, U. S. Chemical Safety and Hazard Investigation Board, Washington, D. C. (2007).

Share Print Reprints Permissions

What are your comments?

Join the discussion today. Login Here.

Comments

No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments