Keep operations safe

Making safety a sure thing

Accidents continue to happen because too many owner/operators still use injuries and fatalities as the predominant metric for safe operation. This focus on direct impact can lead to acceptance of loss of containment events and tolerance for latent weaknesses in process safety management (PSM). Knowledge of gaps in equipment integrity and management systems shouldn’t depend on catastrophic events. Injuries and fatalities should occur so infrequently that impact data are meaningless for trending performance.

Accidents often occur when equipment is improperly designed, installed, operated, tested and maintained. Adequate theory and standards are available to ensure safe operation of process equipment. The problem isn’t bad people and lack of competency — it’s that the systems governing equipment integrity aren’t rigorous enough to ensure the required reliability.

A plant must use a rigorous quality management system to sustain equipment reliability; otherwise, accidents will occur when enough latent conditions in equipment, procedures and personnel training accumulate. It’s essential to take a proactive approach — not just monitoring for behaviors, errors and failures that are known root causes for process safety incidents but also identifying improvement opportunities to counter this accumulation and minimize risk.

This demands a comprehensive risk reduction strategy, one that relies on a wide variety of safeguards to prevent releases of highly hazardous chemicals. Here, we use the Shewhart Cycle — with its Plan, Do, Check and Act phases — to introduce the various activities involved in achieving safe operation using instrumented safety systems (ISS).

Plan

W.E. Deming believed that 85% of a worker’s effectiveness is determined by the system he works within, only 15% by his own skill¹. Planning ensures that work processes yield equipment that operates consistently in a safe manner, fulfills government and jurisdictional requirements, and meets recognized good engineering practices. The output of planning is a management system of policies, practices and procedures that seeks to identify and control releases of highly hazardous chemicals. Recommended work practices and activities are provided for instrumented protective systems in “Guidelines for Safe and Reliable Instrumented Protective Systems”² by the Center for Chemical Process Safety (CCPS) and for safety instrumented systems (SIS) in ANSI/ISA 84.00.01-2004³.

There is no substitute for knowledge⁴. Only a small amount of knowledge can prevent mistakes leading to process hazards. Unfortunately, many owner/operators are losing process knowledge and history as operators and technical staffs retire or simply leave for better jobs. Errors accumulate unless there’s continuous analysis and improvement of safety practices. Counteracting loss of expertise as well as equipment degradation through age and obsolescence requires significant effort.

Written process safety information (PSI) covering the process hazards, technology and equipment provides the foundation for sustaining internal process knowledge. A written design basis should define the PSI for the safety equipment and should be traceable to the process hazards analysis. For SIS, the design basis is the hardware and software safety requirements specification³. It should be maintained under revision control for the equipment life.

Knowledge evolves over time. Real-world failures identify weaknesses in actual system performance. Hazard evaluation procedures⁵ used periodically throughout the equipment life pinpoint and evaluate significant events involving abnormal process operation. Analyze qualitatively or quantitatively the event risk to determine the causes and potential frequency of occurrence. Then implement independent protection layers to ensure that failures or errors don’t compromise safe operation. When the residual risk exceeds the owner/operator criteria, establish additional administrative and engineered safeguards to reduce the risk below the criteria.

Train personnel in the process safety information associated with their work activities. Personnel must have the necessary skills and knowledge to follow procedures and properly execute their tasks, so specify minimum levels for the job. When on-the-job training is required, the program should address how the skills and knowledge are developed in a timely and safe manner and how progress is measured².

Finally, planning must consider security and management of change (MOC). Restrict physical and cyber access to the ISS using administrative procedures and physical means². Independence assessments should consider data communication and human interface failures. Written procedures should address how to initiate, document, review and approve changes to ISS other than replacement in kind. Evaluate any change to the process and its equipment through a MOC process to identify and resolve any impact on the ISS requirements.

Do

This phase implements the systems defined in the Plan phase. From a project perspective, detailed engineering is completed, yielding an ISS installation that conforms to the design basis. Detailed engineering includes sufficient information to ensure the ISS is properly specified, constructed, installed, commissioned, operated and maintained. Equipment installed in ISS should be proven to provide the required performance in similar operating environments.

Equipment classification also must consider the core attributes of protection layers, namely independence, functionality, integrity, reliability, auditability, MOC and access security. To counteract the unknown, owners/operators should rely on a defense-in-depth strategy of multiple independent protection layers to lower operational risk. An independent and separate safety instrumented system (SIS) is essential to ensuring safe and reliable operation. Defense-in-depth also seeks to minimize common cause, common mode and systematic errors that cause multiple layers to fail^7,8. Detailed design should provide an ISS equipment list identifying the equipment by a unique designation (e.g., the tag number) and the required inspection and proof test interval.

Validation activities should include an input-to-output test of each new or modified ISS to demonstrate and document that the equipment is installed according to specification and operates as intended for each operating mode. It’s crucial to satisfactorily complete validation prior to the initiation of any operating mode where a hazardous event could occur.

Periodically conduct proof tests using a written procedure to demonstrate the successful operation of the ISS and to identify and correct deviations from the design basis and equipment specification. Train maintenance personnel on the procedures and make sure they understand equipment pass/fail criteria. Choose the proof test interval based on the relevant regulatory or insurance requirements, equipment history in a similar operating environment, manufacturer’s recommendations and risk reduction requirements.

Operating plans should consider the inspection and preventive maintenance requirements necessary to keep the equipment in “as good as new” condition. ISS proof tests should demonstrate that the mechanical integrity program maintains the required equipment performance. (Feed records forward into the Check phase for trending and metrics.) Operating procedures should cover the safe and approved methods for interacting with the safety equipment, such as bypassing, manual initiation and reset. Train and test operations personnel on the procedures as needed to ensure correct actions are taken. Record and periodically assess operator actions in response to abnormal operation.

Check

By what method? Only the method counts⁴. The Check phase applies metrics to assess performance against requirements. Sustainable operation is achieved by focusing on metrics providing real-time indication. Table 1 provides example metrics for the ISS. CCPS has suggested additional metrics⁹.

Lifecycle step	Example Metric
Hazard analysis	Total number of hazard and risk analysis scheduled during defined interval Number completed Number behind schedule Percentage hazard and risk analysis behind schedule For those behind schedule, total number of days behind schedule
Design basis	Total number of safety equipment Number of equipment with as-built documentation Number of equipment with redlined or missing documentation Percentage of equipment with redlined or missing documentation
Mechanical integrity	Total number of inspections scheduled during defined interval Number of inspections on schedule but incomplete Number of inspections completed Number of inspection behind schedule Percent on time for inspection Percent behind schedule for inspection For completed, number of successful inspections Percentage of successful inspections Total number of equipment tests scheduled during defined interval Number of tests incomplete Number of tests completed Number of test behind schedule Percent on schedule for test Percent behind schedule for test For completed tests: Number of tests with “as found” within equipment specification Number of tests with “as found” outside equipment specification (i.e., failed dangerously, failed safe or degraded state) Percentage of tests within equipment specification Percentage of tests outside equipment specification Total number of safety equipment Total number of failures found by diagnostics Total number of failures found by inspection and testing Total number of failures requiring equipment repair or replacement Total number of failed equipment returned to service within allowable repair time Percentage of equipment returned to service within allowable repair time
Degraded operation	Total number of safety equipment that are out of service (bypassed, disabled, overridden or under test/repair) during process operation for defined interval Total number hours that safety equipment is out of service Number of safety equipment that are returned to service within allowable repair time Number of safety equipment that are out of service, but covered under MOC
Process performance	Percentage of start-ups involving abnormal or emergency operation Total number of process shutdowns during defined interval Number due to spurious operation of safety equipment Number due to abnormal or emergency operation Total number of safety alarms during defined interval Number of standing or nuisance alarms Number of safety alarms requiring response

Selecting appropriate metrics to track can seem like an overwhelming task. Sometimes technical personnel want to measure everything just because they can. It’s important to carefully choose metrics so that just the right amount of meaningful data is collected. All systems involving humans and machines suffer some degree of variation in output quality. Good metrics drive personnel to do the right thing by identifying and correcting variation outside what’s considered acceptable. Measuring the wrong things can undermine process safety. It’s unfortunate but true that personnel will behave contrary to reason and the best interest of the company if necessary to “make their numbers.”

In the real world, some owner/operators essentially are following the old adage: “Measure with a caliper. Mark with a scribe. Cut with a chain saw.” Their process hazards analysis is becoming increasingly quantitative with more factors and modifiers, and the verification of risk reduction uses multiple significant digits — yet the mechanical integrity record simply states “failed.”

The real world must come into balance because mechanical integrity data prove the risk reduction strategy. The risk reduction provided by a piece of equipment is the inverse of its probability of failure on demand (PFD), which is the number of times the ISS has failed dangerously divided by the total number of times the ISS has been challenged. Using probabilistic techniques, the PFDs of specific equipment can be calculated and compared to expectations⁷.

The most important things cannot be measured¹. Consequently, PSM requires that quality be built into the design and management system. Validation and periodic proof testing demonstrate that the quality system is rigorous enough to exceed the required equipment integrity. Maintenance plans should consider how degraded equipment operation will be detected early, so it can be corrected before the equipment fails. Safety equipment must not be run to failure.

The more that’s known about the equipment and what’s affecting its operation, the better the risk can be managed. For safety systems, the most important thing is knowledge that the equipment will operate as required when called upon. The quality of the installed equipment is limited by the rigor, timeliness and repeatability of mechanical integrity activities as well as by wear-out and degradation.
To gain confidence in the equipment, perform periodic inspection and preventive maintenance to maintain it in “as good as new” condition. Proof tests provide an auditable means to demonstrate proper operation. Near-miss and incident investigations should evaluate any identified ISS inadequacy or failure. Track spurious trips and process demands and compare them with expectations from the hazard analysis. The Check phase involves monitoring equipment records and looking for trends indicating design or management gaps that need to be closed.

Failure tracking is essential to close the safety lifecycle. Repeated failures likely indicate that the installed equipment isn’t capable of meeting performance requirements. Use root cause analysis to determine why metrics are trending in the wrong direction, then implement action plans to improve the management system, equipment, procedures and personnel training. Identify and communicate to personnel special and previously unknown causes of failure — to ensure that lessons learned aren’t hidden in mechanical integrity records. Use MOC processes to resolve performance gaps.

Act

“What is a system? A system is a network of interdependent components that work together to try to accomplish the aim of the system. A system must have an aim. Without an aim, there is no system. The aim of the system must be clear to everyone in the system. The aim must include plans for the future. The aim is a value judgment⁴.

Even when good people apply adequate theory and standards, there’re always lessons to be learned. The Act phase involves the actions taken in response to trends in metrics and continuous improvement opportunities. If an owner/operator’s safety culture shines here, risk will be driven as low as reasonably practicable.

Continuous improvement is incorporated in PSM through a concept often called “grandfathering,” where the owner/operator determines and documents that the existing equipment is designed, maintained, inspected, tested and operated in a safe manner. An assessment of the existing safety system should demonstrate that the design and management practices meet or exceed the intent of current good engineering practices and process requirements. Don’t hide outdated or under-performing equipment under the cloak of grandfathering.

Address identified gaps by developing action plans for closing them, establishing compensating measures until the gaps are closed, and creating an implementation schedule. Periodically assess plans to see if there’s a need to accelerate the schedule or broaden the plan objectives. For example, a planned ISS upgrade may be accelerated when the manufacturer withdraws support for the installed equipment. To be successful, action plans should be communicated to affected personnel so they understand and commit to the plans.

The most important things are unknown and unknowable⁴. So, management must continually work on the system, measure what can be meaningfully measured and move forward with improvement activities. Continuous improvement counteracts the accumulation of latent conditions that present potential safety challenges and weaken protection layers. Improving long-term operational effectiveness often takes time. Operating plans should consider how residual risk will be managed during the transition. Review and update as necessary the ISS operating and mechanical integrity basis to ensure equipment, procedures and personnel training remain in sync with modifications.

An ongoing process

Deming believed that experience by itself teaches nothing and that data without context are meaningless. Information gained from experience must be interpreted against a framework of expected behavior, equipment design and operating performance. But experience isn’t always the best teacher. Without an understanding of the underlying root causes, raw data can be misinterpreted, creating a flawed view of reality. Only data understood within their proper context provide a solid foundation for safe operation. New information identifies the need for new metrics, which point to additional improvement opportunities.

Accidents are prevented when safety issues are approached from a quality perspective. The Plan, Do, Check and Act phases are essential for maintaining safe and reliable operation. Use a management system supported with metrics to establish targets and monitor performance against policies, practices and procedures. Conduct periodic gap analysis to verify that actual performance exceeds expectations established in the hazard analysis and design basis. Close performance gaps with action plans that reduce risk and prevent accidents.

Angela E. Summers, PhD, PE, is president of SIS-TECH Solutions, LP, Houston, Texas. William H. Hearn, PE, is a senior consultant at the firm. E-mail them at [email protected] and [email protected].

REFERENCES

Deming, W. E., “Out of crisis,” MIT Press, Cambridge, Mass. (1986).
“Guidelines for safe and reliable instrumented protective systems,” American Institute of Chemical Engineers, New York (2007).
“Functional safety: safety instrumented systems for the process industry sector,” ANSI/ISA 84.00.01-2004, Instrumentation, Systems, and Automation Society, Research Triangle Park, N.C. (2004).
Deming, W. E., “The new economics for industry, government, education,” 2nd ed., MIT Press, Cambridge, Mass. (2000).
“Guidelines for hazard evaluation procedures, second edition with worked examples,” American Institute of Chemical Engineers, New York (1992).
“Layer of protection analysis: a simplified risk assessment approach,” American Institute of Chemical Engineers, New York (2001).
“Safety instrumented functions (SIF) — safety integrity level (SIL) evaluation techniques,” ISA TR84.00.02, Instrumentation, Systems, and Automation Society, Research Triangle Park, N.C. (2002).
“Guidelines for the Implementation of ANSI/ISA 84.00.01-2004 (IEC 61511),” ISA TR84.00.04, Instrumentation, Systems, and Automation Society, Research Triangle Park, N.C. (2005).
“Process Safety Leading and Lagging Metrics,” proposed metrics for review published on www.aiche.org, Center for Chemical Process Safety, American Institute of Chemical Engineers, New York (Jan 2008).