Safety Instrumented Systems: Proof Test Prudently

The functional safety lifecycle covers a safety instrumented system (SIS) from concept to retirement. While important activities occur in each phase of the lifecycle, operation phase activities stand out because they are performed repetitively and are critical to long term reliability.

An SIS is a high reliability system comprised of sensors, logic solver(s) and final elements. It includes a number of safety instrumented functions (SIFs), each designed to provide a specified risk reduction. The necessary risk reduction is assigned as a safety integrity level (SIL) that establishes the reliability requirements for the SIF. A clear understanding of the failure rate, failure mode and failure effects for devices as well as implementation of a management program to effectively identify and correct failures on a routine basis are essential for achieving the needed reliability.

SISs operate in one of three modes: continuous, high demand or low demand. In low-demand systems, proof testing is an effective tool because the SIF components generally are dormant for long periods of time — which provides the opportunity to detect and repair failures and then return the component to service between demands.

In this article, we’ll review failure rate and failure mode basics, discuss proof test frequency and effectiveness, consider the robustness of the maintenance program, identify information to be collected during a proof test, and provide tips for analyzing the data to ensure continued reliability.

Device Failure Basics

You must overcome three hurdles to achieve the SIL target: probability of failure (PFD_avg. in low demand), hardware fault tolerance (HFT) and systematic capability (SC). The failure rate, failure mode and failure effects of the SIF components influence all three hurdles. Reliability theory is based on the premise that components are replaced at the end of their useful life before wear-out affects failure rate. A common mistake in the operating phase is overestimating the useful life of devices. For example, solenoids have a useful life of 3–5 years and should be routinely replaced during refurbishment. Valves can have a useful life as short as 3–10 years — or less if improperly specified, installed in severe service applications or not maintained correctly.

Devices are classified as Type A or B. Type A devices generally are mechanical and usually fail in a more predictable manner. Examples include valves, actuators, solenoids and relays. Type B devices are primarily intelligent and electronic — therefore, they can fail unpredictably. HFT requirements are increased for Type B devices to compensate.

Failures may be random or systematic. Systematic failures stem from design or manufacturing procedures or personnel competency — and can be reduced or eliminated. For certified devices, SC is determined by assessing the ability to control or avoid failures associated with the design and manufacturing process. A certificate will list the SC limits of a device for a specific HFT based on the assessment. Non-certified devices require the reduction of random and systematic failures through proven in use (prior use) data collection and analysis.

Overall device failure rate will include both random and systematic failures. Failure mode is either safe, dangerous or no effect. Failure rates are designated by λ, using subscripts to indicate safe (S) or dangerous (D), and detected (D) or undetected (U). For example, a safe/detected failure would be identified as λ_SD. Diagnostics can spot some dangerous failures, λ_DD. The goal of proof testing is to identify dangerous undetected failures, λ_DU, and repair them in a timely manner. Proof test coverage (C_PT), neglecting diagnostics, is the percentage of λ_DU failures that the proof test can identify [1]: C_PT = (λ_DU revealed during test)/(λ_DU total).

Proof Test And Diagnostics

IEC 61511 [2] defines low demand as a “mode of operation where the SIF is only performed on demand, in order to transfer the process into a specified safe state, and where the frequency of demands is no greater than one per year.” When the demand frequency exceeds twice the proof test interval, a SIF should be treated as high demand and the benefits of proof testing no longer are realized [3]. Demand rate is fixed based on the frequency of failures that could initiate a trip. As organizations seek to lengthen the time between turnarounds where offline proof tests can be performed, SIFs can shift from low-demand to high-demand mode. Extending a turnaround interval thus necessitates combining diagnostics, online proof testing and offline proof testing to maximize SIF reliability.

Automatic diagnostics continuously monitor the health of SIF components while SIF protection is in place. They enable identifying some failures immediately, allowing timely repair or replacement. The partial diagnostic credit (PDC) for automatic self-diagnostics depends on the ratio of the diagnostic and demand rates. For example, a ratio of 100× can provide 99% PDC while a ratio of 10× gives 95% PDC [3]. In low-demand systems, repair capability limits diagnostic benefit. Administrative procedures must set a timeline (typically 24–72 hours) to remove the affected device from service, repair or replace, and return to service. Diagnostics most commonly are available for Type B devices such as transmitters; they may be an additional cost option that must be specified prior to purchase. Actuation of a device during normal operation also provides diagnostic value but isn’t considered a proof test. System design must include isolation and bypass capability to permit making repairs. Diagnostic coverage is set based on a combination of these factors.

Online proof testing provides some diagnostic benefit. However, the test is performed at a lower frequency than diagnostics, and SIF protection is disabled during the test. An example is partial valve stroke testing (PVST), which is a useful tool where the process can tolerate partial valve stroking without initiating a trip. Typically, an online test will identify only a subset of the failures that a full stroke offline test can detect. Proof test coverage is determined based on the percentage of λ_DU failures the PVST can identify. Online proof testing may take place as often as practicable while a unit is in operation. As with diagnostics, system design must provide isolation and bypass capability to permit timely repairs.

SIF response time and some failures, such as leakage, may only be detected during an offline proof test performed during a turnaround — with repairs then completed before returning the process to operation. An offline proof test typically will identify the highest percentage of λ_DU failures; however, the test rarely is perfect (i.e., C_PT = 100%). In reality, proof test coverage can range from less than 60% to as much as 99% depending on the method [4]. End users should consult vendor safety manuals to determine recommended diagnostic, online and offline proof test methods and associated coverage. Ensure system design and operation procedures are in place to support testing and repair activities. Moreover, it’s imperative to conduct proof tests at the intervals defined in the safety requirements specification (SRS). Proof test intervals that extend beyond 15% of the period mandated will start to impact the integrity of the SIF; so, track proof test intervals as a leading indicator.

Diagnostic, online and offline proof testing procedures should be well thought out and designed to maximize failure detection. Table 1 shows an example of the content expected in a proof test for a simple one-out-of-one (1oo1) SIF.

Maintenance Capability

A quality proof test is important. However, results can vary based on the site maintenance culture. An incomplete or incorrect proof test can significantly misrepresent the reliability of a SIF [5]. Human and procedural elements of a proof test can introduce random and systematic error. Procedures must be in place to ensure proof testing is performed as scheduled, repairs are completed immediately and effectively, and bypasses are removed after testing. Moreover, it’s essential to verify that the tools used are properly calibrated; power supplies, pneumatic and hydraulic systems are clean and in good repair; and components selected are compatible with the process and environmental conditions of service and are replaced before end of their useful life. In addition, maintenance technicians should be well trained and periodically assessed per IEC61511.

An organization must clearly understand its maintenance culture before attempting improvements. The testing and maintenance process can introduce systematic errors that negatively impact the reliability of the entire SIS. A tool such as the Site Safety Index (SSI) [6], www.exida.com/SSI, is useful for performing a self-assessment and identifying opportunities for improvement.

Data Collection And Analysis

Continued reliability depends on timely and effective proof testing, and routine monitoring of system performance (which the 2nd edition of IEC61511 now requires). Establish data collection and analysis methods to monitor the failures that could lead to demand on SIFs and those that contribute to SIF failure (i.e., lagging indicators). It’s important to capture the “as found” condition before disassembling process equipment for testing and repair.

Set up a database to track all demands and failures associated with process and safety instrumentation and controls and other independent protection layers (IPLs). Collect information from near-miss and incident investigations as well as from diagnostics and proof testing. Each dataset should include device make, model and serial number; date of failure; name of technician identifying the failure; results of proof test; trip time and conditions that may have contributed to the failure.

Prepare a written procedure to ensure data analysis is completed in a consistent manner. Classify each failure as safe or dangerous, systematic or random, etc. An analysis method such as predictive analytics [7] can be used to calculate site-specific failure rates. Finally, compare the calculated rates to λ values used in SIL verification. If a device is found to be less reliable than expected, take steps to correct the situation by decreasing the proof test interval or replacing the device.

Evaluate two factors at the system level:

1. Failures of IPLs that could result in demand on a SIF should be trended and compared to the design basis demand frequency given in the SRS. If actual demand rate exceeds expected demand rate, residual risk exists that needs mitigating.
2. SIF trip time must be tested at SIF acceptance and periodically during the lifespan. The results should be trended to confirm that the SIF response time remains within the process safety time to ensure the SIF responds before an event occurs.

A Valuable Tool

The purpose of a SIS is to reduce risk through instrumentation. Proof testing is an effective means to detect failures that reduce system reliability for low-demand SIFs and thus enable timely repair. An operations team must understand how decisions such as extending proof test intervals (turnaround cycle) affect demand rate and SIS reliability. Diagnostics as well as online and offline proof testing can be useful in detecting device failures so repairs can be implemented and devices returned to service. Finally, it’s necessary to catalog information about failures discovered though testing, to confirm that the SIS is performing consistent with the design basis.

DENISE CHASTAIN-KNIGHT, PE, CFSE, CCPSC, is a senior functional safety engineer for exida, Sellersville, Pa. JIM JENKINS, CFSE, is a senior functional safety engineer at exida. Email them at [email protected] and [email protected].

REFERENCES
1. Chris O’Brien and Lindsey Bredemeyer, “Final Elements & the IEC61508 and IEC 61511 Functional Safety Standards,” exida, Sellersville, Pa. (2009).
2. “IEC 61511-1 Functional Safety: Safety Instrumented Systems for the Process Industry Sector – Part 1: Framework, Definitions, System, Hardware and Application Programming Requirements,” 2nd ed., Intl. Electrotechnical Comm., Geneva, Switz. (2016).
3. Iwan van Beurden and William M. Goble, “Safety Instrumented System Design, Techniques and Design Verification,” ISA, Research Triangle Park, NC (2018).
4. Chris O’Brien, Loren Stewart and Lindsey Bredemeyer, “Final Elements in Safety Instrumented Systems, IEC 61511 Compliant Systems and IEC 61508 Compliant Products,” exida, Sellersville, Pa. (2018).
5. Julia V. Bukowski and Iwan van Beurden, “Impact of Proof Test Effectiveness on Safety Instrumented System Performance,” presented at Reliability and Maintainability Symp., Fort Worth, Texas (January 2009).
6. Julia V. Bukowski and Denise Chastain-Knight, “Assessing Safety Culture via the Site Safety Index,” presented at 12th Global Congress on Process Safety, Houston (April 2016).
7. William M. Goble, Iwan van Beurden and Curt Miller, “Using Predictive Analytic Failure Rate Models to Validate Field Failure Data Collection Processes,” presented at Instrumentation and Automation Symp. for the Process Industries, Texas A&M Univ., College Station, Texas (January 2015).