Risk assessment

See Risks More Clearly

Avoid a number of common errors that can blur your vision

By Angela Summers, SIS-TECH Solutions, LP

The belief that all loss events are foreseeable, given sufficient analysis, is very alluring. Throughout the life of a manufacturing process opportunities exist to examine risk, apply more complex methods, and give hazard scenarios and their avoidance more thought. However, the reality is that most people have difficulty thinking outside the box and honestly looking at how the process can misbehave. It’s easy to accept that if nothing has happened before, nothing will happen in the future. The harsh truth is that a hazardous situation still can occur.

Omniscience isn’t possible — but you can see risk more clearly. So, let’s look at some of the challenges to achieving 20:20 vision.
We use hazard assessment methods to identify loss-event pathways and determine what must be done to prevent their occurrence. An inherent weakness of these methods is their vulnerability to lack of competency, incomplete information, and deficiencies in hazard awareness and design. Where there’s limited operational knowledge, there’s an associated limited awareness of how sensitive a process is to deviation. Successful operation of complex processes — no catastrophic incidents— sustains the belief that everything is safe as is. Clifford Nass, a Stanford professor who pioneered research into how humans interact with technology, warned, “denial is the greatest enabler.”

Risk analysis is a tool to ensure that an appropriate standard of care is applied, not to prove whether safeguards are needed or not [1]. It’s unrealistic to think that hazard and risk analysis identifies everything that could go wrong. An incident analysis [2] by the U.K.’s Health and Safety Executive (HSE) determined that more than 20% of loss events stem from an “organization failing to fully consider potential hazards or causes of component failure.” The vast majority of incidents (81%) resulted from the organization failing to adequately plan and implement procedures for risk control, including the design of the process (25.6%), the provision of operating and maintenance procedures (15.6% and 22.6%, respectively), the management of change (5.7%), a permit-to-work system (4.9%), plant inspections (3.5%),and ensuring competency (1.7%).

The American writer H. L. Mencken wrote, “For every complex problem there is an answer that is clear, simple, and wrong.” Consider the limits of what you know, then add a good-sized measure of bad luck. It’s wise to have a sense of vulnerability even when you’ve done your best to design a safe plant [1, 3, 4]. It’s sensible to implement safeguards that prevent the loss event rather than simply relying on probabilistic analysis. Every process needs a holistic loss-event prevention plan that includes:

Inherent safety —
 • Robust vessel and piping design so process deviation is tolerable.
Functional safety —
• A reliable control system that reduces the frequency of abnormal operation.
• An alarm system that notifies the operator when the process is experiencing abnormal operation.
• A shutdown system that sequences the process to a safe state when it reaches an unsafe condition.
• An emergency shutdown system that isolates the process from its supply when loss of containment occurs.
• Other safeguards as necessary to address loss of containment and event escalation.

The onion skin and Swiss cheese models of incidents are ubiquitous. These models typically serve as an analogy for layers of protection. On first glance, each shows the layers as independent of each other, i.e., the failure of one layer doesn’t impact the other. On further study, the graphics portray much more.

The onion skin visualizes the sequence of barriers that control, prevent and mitigate major accidents (Figure 1). Layers of protection are as independent as the layers of an onion. However, as any cook knows, the structural integrity of the onion depends upon keeping the layers attached to the base. The onion layers originate at the base and, without it, they fall apart. The integrity of the base of the layers of protection is determined by the safety management system applied to reduce human error and to sustain the fitness for service of safety equipment.

James Reason’s Swiss cheese model [5] has been adapted to illustrate each barrier as a cheese slice possessing holes that represent deficiencies in barrier performance due to random and systematic faults (Figure 2).Seemingly independent systems can fail due to common systematic mechanisms that degrade or disable multiple similar elements. The graphic emphasizes that barriers aren’t perfect over their life and that an accumulation of deficiencies (an increasing number of holes in each cheese slice) raises the likelihood that holes will line up, thus allowing an event to propagate past the barriers. The holes open and close dynamically as management systems identify and correct faults, so the better managed the barriers, the fewer, smaller and more transient the holes will be.

Many human-factor issues impact every layer. Resources and infrastructure typically are shared; similar equipment, procedures and people are used to design, operate, maintain and test barriers. The more complex or specialized the layer is, the higher the potential for human error. A company’s culture toward manual operation of control loops, bypassing safety instruments or continuing operation with known faults can result in multiple risk sources being turned over to operators across a facility. The layers of protection only can be as strong as the rigor applied in identifying and minimizing human errors and systematic failures.

In the last decade, use of risk criteria to determine the required safeguarding has become endemic. The wise are mindful of the Andrew Lang quote, “He uses statistics as a drunken man uses lamp-posts... for support rather than illumination.” Statistics seem concrete and defensible but the estimates are very fragile given the vast range of assumptions in most analyses and the lack of actual process data to substantiate performance claims. Any wrong assumption propagates through the analysis, affecting multiple scenarios and, in some cases, the fundamental basis of the entire evaluation.

Designing by risk criteria is attractive because it seems to provide a shield against claims that not enough has been done to reduce the potential for an incident. The perceived protection afforded by risk criteria falls short when, post-incident, the question is asked whether something else could have been done and the answer is “well, we could have…” The value of the analysis is not the math, but what is learned about the safety vulnerabilities of the operating plan for the process and what is done to improve system resilience against these vulnerabilities. The intent of the math is to allow options to be benchmarked against one another based on a similar set of assumptions and to demonstrate that risk has been reduced below the maximum threshold tolerated by the company (Figure 3). The risk of process safety incidents should be made as low as practicable given readily available technology and accepted practices. Keep in mind the vulnerabilities in hazard evaluation methods.

With the emergence of layers of protection analysis as a dominant risk assessment method, selecting values based on the number 10 has become pervasive. Multiples of 10 are easy to understand and anyone can multiply 10 ×10 to get 100. However, a risk analysis needs to reflect what is achievable in the actual operation. Few loss-of-control events ever would propagate to loss of containment if the practices necessary to achieve the claims of 10 were as pervasive as the claims. An HSE study [6] determined that 32% of reported loss-of-containment incidents resulted from process and safety equipment failure due to inadequate design and maintenance.

The control layer must reflect industry best practices for control system design and management to achieve a failure rate of less than 1 in 10 years. As more users track the demands on their safeguards, they are finding that the actual number of alarms, trips and pressure-relief-valve lifts exceeds the frequency claimed in the risk analysis.

The safety layers must be designed and managed according to good engineering practices documented by recognized industrial organizations. A risk reduction of 10 means that 1 in every 10 times the layer is called upon to work, it won’t. Design and manage to achieve 0 failures. Don’t assume a risk reduction of 10 without justification because achieving this level of risk reduction requires planning and discipline.

Implement procedures to evaluate the demand rate on the layers during actual operation and compare the performance of each layer against its safety requirements. Document and investigate demands on protection layers as well as failure records for protective equipment. Define the corrective action to be taken if the challenges are too frequent or the actual layer performance doesn’t achieve the necessary risk reduction. Examine the underlying causes to determine what those involved in plant operation and maintenance think should be done to improve the reliability.

The same HSE study [6] reported that 37% of loss-of-containment incidents resulted from incorrect operator action. The root causes of the actions were inadequate operating procedures, deficient process design, inadequate supervision and ineffective management of change.

An event always seems obvious when an individual scenario is being evaluated. In the real world, process deviations propagate through the process’s other deviations; the operator sees an array of events happening simultaneously. How does the operator recognize which scenario is occurring and respond with the right action at the right time? The event may be precipitated by other control system failures requiring controller overrides and manual control.

Operators rely on control and safety systems for process information. Many studies list human error as a cause for an event without considering the automation that’s providing the operator with data and status information. Without the control system, the operator can’t act on the process safely; without the operator, control system malfunction can propagate to an unsafe condition. For many events, it’s difficult to separate automation design and human error.

Situational awareness comes from operating experience and process simulation, not from the hazard evaluation of a loss event.

Nassim Taleb[7] uses Aristotle’s black swan to exemplify a rare and unpredictable event and discusses the tendency everyone shares to look for simplistic explanations after its occurrence.

Every year it seems that someone is proposing a new practice destined to become another vogue solution to identifying risks and preventing their occurrence (Figure 4).Every new method claims to be better than the last. Intellectual curiosity and the pursuit of a “correct” answer often drive implementation of more complex methods and calculations.

Making things more complex can give the illusion of accuracy but also can create a situation where team members don’t understand the method, become disengaged from the process, and allow the facilitator (or analyst) to dominate the risk analysis. Some current vogue methods have so many degrees of freedom that a good analyst can get nearly any answer desired. Each method shares the same systematic flaw: the risk judgment only is as good as the data and model certainty, which are highly influenced by the competency, experience and knowledge of team members, and the availability of detailed specifications, safe operating procedures, and operating and maintenance history.

It’s easy to fall into the intellectual trap of believing in an analytical perfection in which you “know” what the risk is. In reality, the chosen analytical method can affect outcome quality but the relationship of the results to the real world has a great deal more to do with the experience of the people participating in the study and the quality of the information available to them than to the methodology itself.

Quantification isn’t a panacea. Manipulating numbers can make loss events seem more theoretical and probabilistic rather than real incidents that hurt actual people. The detachment afforded by a calculation encourages confirmation bias — i.e., selecting and using information in a way that supports a particular belief — unless the methods are backed with real data. Anyone with experience knows that there’re significant limitations to what is considered in most quantitative analysis and there’s a high degree of uncertainty associated with the data. Calculations only are good for estimating things that can be measured easily. It is easy to quantify how a device’s failure can affect the system, because the device has limited functionality. Human failings, on the other hand, can impact the system throughout the lifecycle in obscure ways. Risk calculations often exclude human factors even though these factors typically are the dominant cause of failure.

Certainty in the estimate only comes when real measurements rather than theoretical numbers justify the data. For a risk model to come close to reality, those participating and leading the analysis must understand how the method works and how to apply it to the specific application. Benefit only comes when a method is used in the right way for the right application. The hard part is understanding the assumptions, limitations and proper application of the risk analysis method.

Determining whether a company has acted reasonably to prevent a loss event requires considering a variety of factors. These include the company’s care and skill in producing its product, its awareness of the harmful event prior to the incident, the activity being performed, the specific circumstances that led to the incident, and whether the company did what it could to prevent the incident’s occurrence. Monitoring and reporting actual performance over the life of a process is essential for proving safe operation. Benchmarked values provide an initial basis and rationale for the design but operating history yields the actual frequency of root (or initiating) causes, process deviations (or initiating events), and work orders related to safeguards (or failures on demand) [8]. Data feedback to the risk analysis process is crucial to credible decision-making.

An effective process-safety-management program uses a systematic approach to understand and control the risk of the whole chemical process. The ultimate goal is to prevent the unwanted release of hazardous chemicals, materials or energies that impact people, the environment or the process equipment. Success depends upon the rigor of the systematic approach applied to develop a loss-event prevention plan, prioritize risk reduction opportunities, and support the organizational discipline necessary to fully implement the plan. With good methods, realistic risk criteria and appropriate data-feedback processes, management is well equipped to see clearly where attention and resources are needed to achieve zero losses.

ANGELA SUMMERS, Ph.D., PE, is president of SIS-TECH Solutions, LP, Houston. E-mail her at asummers@sis-tech.com.

This article is based on her presentation at the 10th Global Congress on Process Safety, New Orleans, La. (March 31–Apr. 2, 2014).

1.    Murphy, John F., “Beware of the Black Swan,” pp. 330–333, Process Safety Progress, Vol. 31, No. 4 (Dec. 2012).
2.    “Loss of Containment Incident Analysis,” p. 5, Health and Safety Laboratory, Sheffield U.K. (2003).
3.    Summers, Angela E., “Safe Automation Through Process Engineering,” Chem. Eng. Progress, pp. 41–47, Vol. 104, No. 12 (Dec. 2008).
4.    Summers, Angela E., “Safety Management is a Virtue,” pp. 210–213, Process Safety Progress, Vol. 28, No. 3 (Sept. 2009).
5.    Reason, James, “Managing the Risk of Organizational Accidents,” Ashgate Publishing, Farnham, U.K. (1997).
6.    “Findings from Voluntary Reporting of Loss of Containment Incidents, 2004/2005,” Health and Safety Executive, Bootle, U.K. (2005).
7.    Taleb, Nassim N., “The Black Swan: The Impact of the Highly Improbable,” 2 ed., Random House, New York City (2010).
8.    Summers, Angela E. and Hearn, William H., “Quality Assurance in Safe Automation,” pp. 323–327, Process Safety Progress, Vol. 27, No. 4 (Dec. 2008).