Podcast: How Equipment Reliability Impacts Process Safety

Equipment reliability failures at Longford, BP and Buncefield highlight process safety lessons, maintenance KPIs and industry risks.
Sept. 9, 2025
28 min read

Longford Gas Explosion Lessons

Traci: That's awesome. Well, in today's episode, we're going to talk a little bit about the importance of equipment reliability and when equipment is specified in the design to provide information, and then it fails, we are outside of our assessed risk. There are several incidents where equipment reliability was at fault, and we're going to examine a few of those to hopefully help others avoid the same mistakes. And the first one that I want to talk about is the Longford gas explosion, which launched your career in process safety. There was an equipment failure and management of change issues. Can you walk us through that and talk a little bit about the lessons learned from that?

Trish: Yeah, so I think we might've discussed Longford in a previous podcast if you want a bit more detail about it. But essentially, it was a gas processing plant. Condensate, oil and gas were coming in from offshore and onshore into a gas processing plant, where it was then split out into natural gas, condensate oil, etc., and pipelined to various locations for further processing, refining and use. Several years before the incident happened, there'd actually been a change that had been made where there were multiple gas plants operating at this facility—multiple trains—and they had interconnected two of them, and that had created a longer-term issue that, depending on how production was going offshore, it was resulting in different quality of gas going into the plants, which actually caused some of the units to drop in their operating temperature. And so whilst that's not normally an issue to have a little bit of variation in operating temperature, this actually created a step change—this management of change activity that happened.          

Alongside that, there were other changes that occurred where they actually decided to remove their onsite engineering support and have it located elsewhere. So we then had an operator workforce that didn't have that day-to-day engineering support with them either. And over a series of then equipment reliability issues—so we've now got a plant that's running at a lower temperature than it should be, and it's doing this for an extended period of time. In fact, I've been doing it for almost 36 hours or something continuously. So it had actually seen a step change in the temperature that it was running at, and this was sort of seen as an operational issue to manage. They then had a series of pumps trip out, and so they lost flow throughout their heat transfer system, which runs through their heat exchangers. And every time they went to restart the pumps, they weren't actually able to tell what was going on in the process because their flow meters or their flow indicators were not working.

So they had not been maintained, and they were therefore not reliable. So we had a situation where, certainly, from the panel, the control room operator was unable to tell what was going on in the plant because of faulty equipment. So if the equipment was running, then he would know that certain pumps were running because he had pump indicators, but he also had other indications to tell him because certain pumps ran in certain conditions, so they were interlocked to other things. And so one of the problems here is that when they were then troubleshooting the issue, they couldn't tell where the flow was occurring and not occurring, and this led to a lot more confusion at a time when they needed clarity. Eventually they discovered that because of the very low temperatures that had then started to occur—so not only was the plant operating a bit colder, normally what had then happened is because they lost their heating system, they actually had a massive temperature drop across all of their system, all of their pipework and heat exchangers, and they were all carbon steel material and those dropped to about minus 47 degrees Celsius. Now at minus 47 degrees Celsius, carbon steel is brittle. So imagine what happens when you take a really cold glass and you put it in a sink of hot water. What does it do? It just shatters.     

And effectively that's what then happened. So they then, because everything had cooled down and it all shrank—things contract when they cool, so the metal had shrunk away from some flanges, so they had some leaks occurring—and to try and stop the leaks, they wanted to gently heat up the system to expand the metal back because they couldn't do up the studs any tighter. They were torqued adequately. And so during that process of reheating, they unfortunately caused a catastrophic, low-temperature, brittle failure where they shattered and had a full tube sheet rupture on a heat exchanger. And that incident—the blast and subsequent fire—killed two workers, and eight others were injured in that incident. So it was certainly a very significant incident. It also shut down the supply of the natural gas system to the city of Melbourne, where I live, for almost three weeks, which was a major political outcry and very much an inconvenience for the community. But tragically, as I said, there were two people who were also killed in that incident. A lot of it comes down to how we safely operate our facilities if we can't see what they're doing, if we don't have adequate instrumentation that's telling us what the current and correct state of the facility is, and that's what this idea of reliability and equipment integrity is all about. We need to make sure that the devices that are critical to the safe operation of our facility are functioning correctly.        

Refinery Blast at BP Texas City, Texas

Traci: And as you've pointed out many times before, we don't make new mistakes. We keep making the same mistakes. And so another incident that we're going to talk about stresses the importance of equipment reliability as well, and it's the BP America Refinery blast in Texas City, Texas. The fire and explosion, which occurred on March 23rd, 2005, resulted in 15 deaths and severe injuries to 180 others. What went wrong here? 

Trish: Yeah, it's amazing to think that was 20 years ago, isn't it? Time certainly moves on. Now, this was a startup issue, so they were coming out of a shutdown period, and they were restarting the Raffinate Splitter Tower as part of the startup of the unit, and one of the challenges that they had was due to some design issues. The operators had determined over many, many startups in many, many years that if they didn't slightly overfill the tower, they had a pump cavitation problem, and it kept interrupting the startup sequence. So it was customary and practice to overfill the tower to overcome this issue that they had, and that had worked, provided the level indicator in the tower was functioning. Now, in this particular instance on this day, the level indicator in the tower was not functioning correctly. It was actually showing at one point during the sequence of events that the level of liquid in the tower was decreasing when, in actual fact, it was increasing at that time.          

So they were getting very false signals. Now you could argue, oh, well, they should have looked at the amount of liquid going in, the amount of liquid going out and realized that those two numbers didn't match, and therefore there was a growing volume. No, we give operators level indicators so that they can understand the level. We need to make sure that, particularly during a very complex event like a startup, they're getting all the right information they need to be able to make good decisions and take the right actions at the right times. We can't expect them to then go and calculate or try to infer different conditions from a series of other pieces of data. We actually need to give them the right information. And so they were relying on this level gauge. Now, the other catch with this level gauge was another design issue that, because this was a big, tall tower, and it was only the bottom few feet of that tower that were ever filled, the rest of it was all vapor.

Only the bottom was liquid; the rest of it was vapor. But what had actually happened was—I mentioned in that startup they used to overfill—now the level indicator on that tower only covered the bottom 10 or 12 feet of the tower, and this was a very tall tower, so there was only about a third of the tower they could see the level in. Once it got above that high level, they had no indication of where it was other than that it was above high level. Was it half full? Was it two-thirds full? Was it completely packed and about to overflow? And what had happened in that instance was that it was completely packed and about to overflow, and it came out of the top of the tower at the overflow. It went into a blowdown drum, and then it came out through the vent in that drum as well.       

And that's where we got the geyser of hot raffinate that then ignited, and as you said, killed 15 people and injured 180 in that tragedy. And again, it really comes back to we need to make sure that our operators—we need to not make it hard on them. There are enough other things going on. We need to give them instrumentation that is accurate, that is calibrated properly, that is functioning properly, and that they can rely on because we are expecting them to make good decisions on that instrumentation and the information it gives them. But if we're not maintaining that instrumentation adequately, then how can they trust the information they're getting? They just can't trust it. And if they can't trust it, then they potentially start to second-guess and make other decisions that are perhaps not the ones we want them to be making, or they make decisions on the basis of the information that's in front of them. And in hindsight, we see that sometimes those decisions don't work out well because the information input was wrong in the first place.

Lessons from Buncefield Fire

Traci: Good lessons learned there and certainly something to think about. And something again, we're going into another faulty level gauge incident. This last incident that we're going to discuss today is the explosion at the Buncefield oil storage and distribution depot about 20 miles north of London. This occurred again 20 years ago, on December 11th, 2005, and this was said to be the biggest fire in Europe since World War II. Can you describe this incident a little bit, and then what we've learned from it?

Trish: Yeah, so this was a really simple process. This was pumping gasoline through a pipeline into a storage tank. Process doesn't actually get much simpler than that—pumping something into a vessel. It's a little bit of a complication around it. The pipeline that was being pumped into the tank was owned by another company, and that pipeline originated miles and miles and miles away and was operated by other people. So they were operating the pumps that were putting product into a tank at this terminal. And this was actually a normal situation for this terminal. Whilst there were a lot of issues around various equipment reliability aspects in that facility and overload, the facility hadn't received any major upgrades, even though its throughput had increased something like fourfold over the years. So all of a sudden, massive stress on a facility in terms of putting a lot more product through it without actually upgrading its facilities to take that extra product.

Now, what had actually occurred in that instance was that the operators at the terminal had no visibility of how much product the pipeline was pumping, so they couldn't determine that flow rate. They didn't have that indication available to them. So there, if you can imagine, operating blind on this tank. Now, what they did know was that they knew what time the tank started filling because the pipeline people said, "Oh, we're starting the pipeline now." They knew what roughly the rate was, and they knew kind of how long it should take to fill that tank, but with no other instrumentation, so no flow available on the pipeline at all. The only way they had to monitor the level in that tank was the tank level gauge. Now, the tank level gauge had a couple of issues associated with it, or the level monitoring on that tank had some issues associated with it.   

The first one was that the tank level gauge had a history of getting stuck. What that means is that if you imagine a little float that's on a string and it goes up and down on the fluid level inside the tank, the little float gets stuck, and so it doesn't float anymore, and the liquid level continues to rise, and so all of a sudden your float is now stuck under the liquid level. So it's not telling you how much is in the tank anymore. And that's basically what happened, and it had a history of doing that, and it would sometimes dislodge itself, so that all of a sudden the level would rapidly jump in the tank, and at other times I'd have to send maintenance up to jiggle it and get it loose and get it back to the top of the liquid level again. So in this particular instance, the level gauge stuck, but the operators didn't notice that the tank level wasn't going up.

We have a situation where they didn't have the right instrumentation to tell them what was going on in their facility. The instrumentation they relied on was faulty. And so they also had a number of other tasks going on. They had a lot of other activities going on. In fact, there were many tanks that were not only receiving product from a pipeline, but were also loading trucks at the same time. So you couldn't even just say, well, if we're putting this much product in and it's going to take this many hours to fill, then we know when it's going to be full, because you might have a situation where you're taking product out intermittently as well. And so you couldn't even figure it out that way. Now this level gauge got stuck, and it failed effectively. The next safety system they had was an independent high-level switch.  

So this is, we get to a certain point, and then there's a cutoff switch that, when the liquid hits that device, shuts the inlet valves. Now, that would've caused a rise in pressure in the pipeline, which would've automatically triggered the pumps to stop at the other terminal where they were coming from. And that's how the system is designed to work, except in this instance, the independent high-level switch did not activate. There's a whole series of reasons, mainly design, but also communication reasons, that didn't operate. And one of them was that the design had a red lever that you operated on that independent switch, and the red lever in the upright position meant that the device was online. You lowered it to the horizontal position to test the device to make sure that it was functioning. Because we do need to test our safety-critical equipment. We have to know it functions exactly how we expect it to. And so if you can imagine this little red handle could either be in a vertical position or a horizontal position when it's in its vertical position. So online, there's a little padlock that sits in there to hold the handle upright. But people assumed the padlock was actually just so that people didn't tamper with the valve, with the gauge, the device, so they actually didn't put it on. Now, gravity tends to pull things down over a period of time. This little vertical lever was in the horizontal position due to gravity. And what then happened was- that when the liquid level hit the independent high level switch, it was offline. It didn't trigger. So that tank then proceeded to overflow gasoline because of the design of some outside parts of the tank. It actually vaporized the gasoline very effectively, creating a massive vapor cloud that then had a massive detonation. So this wasn't just a fire; this was actually an explosion. It was a vapor cloud explosion on gasoline, which traditionally we'd kind of thought, oh, we don't really do vapor cloud explosions on gasoline. We do that. They happen on gas. That had happened with gasoline. But in this instance, because the design of the tank had so perfectly and in conjunction with the atmospheric conditions of the time had perfectly vaporized that gasoline, such that it detonated and created that massive explosion that shook windows all over the place.          

It was felt for miles and miles and miles, and as you said, it created the biggest fire in Europe since World War II, burned for days, and had a lot of environmental impact. And it was also interesting that, from a reliability and asset integrity aspect as well, during the fire, it was then discovered that the tank bunds, which are designed to hold the volume of the tank and some fire water in the event of an incident, did not hold the liquid. They had gaps in them. There were cracks and penetrations that went through the bund wall, which were not adequately sealed, or they were sealed with materials that were not fire-safe. So when there was a fire, it actually melted the seals. And so one of the things from Buncefield, whilst there were no fatalities, which was just an amazing outcome, and it was purely because of the time of day it happened at 6 a.m. there weren't many people around, but Buncefield's ongoing legacy of issue is the environmental contamination that occurred from the gasoline and the distillate that burned, the firefighting water, the foam from the firefighting. It all flowed into the local environment, down the local streams into the water tables. So Buncefield, whilst it is a process safety incident, one of its most significant impacts was actually a wide-scale environmental impact.

KPIs for Equipment Reliability and Maintenance    

Traci: Now these are—we have three examples of catastrophic reliability issues. I want to now talk about how to mitigate and manage these examples and in the future, try and stop this. What key performance indicators best measure equipment reliability and maintenance effectiveness?

Trish: So there are a couple of parts to this. First of all, we need to understand what our safety critical elements are, whether it is that level gauge or whether it is that secondary containment of the bund. Whatever we determine our safety-critical elements to be, we need to know what they are, and we need to document them. We then need to say, what is the performance standard of this element? What do I expect it to do when I need it? And very clearly document what that performance standard is. Now we can start to talk about some metrics. We're going to have as part of that performance standard, we're going to have required maintenance that we need to do, and that might be time-based, it might be cycle-based. It could be a range of different things that our maintenance regime is based on, but we are going to have a schedule of how often we need to do a calibration and a maintenance check on that particular element.      

So the first metric you need to think about is whether we are performing safety-critical maintenance on schedule. So we've got a schedule, are we meeting it? Are we actually doing the testing we need to do when we need to do it? And so that's the first metric that we need to look at. The next metric we need to look at, though, is how our maintenance aligns with our performance standards. So for example, if I have a pressure relief device, a pressure relief valve that's safety critical and I need to check that, so I check my—that particular pressure release device, I check it on a 12-monthly basis. So I go in, I isolate it, I pull it out. The first thing I do with that valve in a maintenance scenario is I take it to a test bed where I actually test it. It's what's called a pop test.   

I test it and see what pressure it relieved at, and then does it close again, because the performance standard on a pressure relief valve [is] that [it] needs to lift within a certain parameter of its set pressure and then needs to reset again after that pressure has decreased. What we sometimes see happen is our maintainers are great at taking the valve off and they're great at doing the test and they're great at recalibrating it and putting it back on because they found that it didn't lift at the right pressure. How are we getting them to then communicate and record that it didn't lift at the right pressure? Here's the pressure it did lift at because that information is really important. What we've discovered is what we call a failure on test. If your pressure relief valve didn't relieve at the right pressure within—give or take—its tolerance, then it wasn't working, which meant it was installed and not working.

How long wasn't that working for? If you had an overpressure scenario where you needed that valve to protect you and prevent an incident, it wasn't going to do it. So what we need to do is when we have those failures on test, we need to track them and we need to investigate them and we need to understand why. And potentially we need to make changes to our testing regime so that we do know that our valves are working when we test them. And it might mean that 12 months is too long in that service. It needs to be nine months. Or it might be that we discover the seal is the wrong seal and it had worn and it had created an issue in the valve, and so we need to change out the seal material. It could be a range of different reasons why a safety critical element failed on test, but the important thing is we have a schedule to test them, we test them, and then we document and make note of the failure on test and investigate and then resolve those.

Documentation & Record Keeping

Traci: issues. Now you're talking about the documentation and the record keeping. How can we be better at that?

Trish: I think this is where there's a lot of opportunity in digitalization aspects of what we do. So when we look at maintenance management systems and how maintenance is scheduled, most organizations, certainly large organizations are all doing this through computerized maintenance management systems. So they're large programs or platforms that we track everything—we schedule all of our work through, we've got all this information. I think one of the steps that we're missing is how do we make it really easy for the maintainer that's discovered, the failure on test to report it. And so I think there is opportunity in this space if we can integrate things like tablets or phones where they can actually just put the information straight in rather than having them write down on checklists still, which does still occasionally happen. They do their check and then they fill in their form and then that form goes to someone else who then reads that form and they might then put it in the system, or it might get lost somewhere in the process.

If we can actually create systems that are simple enough that it's really quick to do. Because keep in mind we put a lot of work on these maintainers—we expect them to do enormous amounts of things. We need to not overburden them with bureaucracy. We need to slim it down and get them reporting the very most critical things. So make it as simple as possible. A quick reporting tool that's a couple of presses on a screen is fantastic. Then the other important part is that they also need to realize the importance of why they're doing what they're doing. Why is it important to report that that failed on test? So it's around upskilling everybody's knowledge about why the safety critical elements are safety critical. What is the key driving force? What is it preventing? This is why we have to look after it. I once heard someone say, manage safety critical elements like your life depends on them because it does and we need our maintainers understanding that. Our operators, I think very much do to an extent understand that. We need our maintainers understanding that as well a little bit more perhaps because they're sort of one step removed from the operator. Though maintainers are often involved in the shutdown activities going on, which is also when a lot of different incidents occur. So I'm sure they're very aware of process safety, but making that connection of why this is so important, and this actually then comes back to my favorite topic, weak signals and the platypus. What are the weak signals when they're maintaining something and how do we actually get our maintainers so that subconsciously their brains notice the weak signals because they notice that, "Now I did this inspection and it passed, but it was a bit different this time. It did pass, but it wasn't normal. There's a range of one to 10, and it normally sits at five and 10 is still okay, but it's now at eight. Is that a weak signal on an individual isolated data point? It might not be, but it also might be, and there might be a trend. It might've gone from five to six to seven to eight. Are we actually seeing it move towards going out of range?" So providing information in an easily digestible quick input way is absolutely vital to achieving these outcomes.

Redundant Systems & Backup Equipment

Traci: Now let's talk a little bit about the role of redundant systems and backup equipment. Would this have helped in any of the three examples we had?

Trish: Well, interestingly, looking at the Buncefield explosion, it was the backup overfill protection that also failed. So redundancy can be good, and we do then need to understand what the operational strategy and maintenance strategy is for redundant equipment. So you might have a pump and a spare. Well, do you run the pump and the spare is there when you need it, or do you regularly cycle through and work out the same number of hours on each piece of equipment? What happens when you're switching between? It's not always clear cut and simple. The mere act of switching between two different pumps or two different conveyors or two different whatever, might actually introduce other incidents and other hazards into the process. So it's not just as simple to say, we'll put all these extra redundant devices in and redundant equipment in, and we'll have all of that backup stuff and that will protect us.

Because what we're actually doing is making the system more complex. And one of the challenges in inherently safer design is to simplify. Because when we make something more complex, we give it more opportunity to go wrong. And do we have a common mode failure? So if we're talking about completely independent level devices on a tank, you might argue there's not a common mode failure. There's not on the devices, but there is potentially on the maintenance system that takes care of them. So if we're not maintaining our level gauge, are we maintaining our independent level switch? Probably not as effectively. If we're not doing one well, why would we randomly do one thing well and not another thing? And just adding more things can sometimes give us a sense of comfort that can lead to complacency and we can end up in a situation that is not as safe. So understanding inherently safer design principles, it's not just as simple as adding more layers of protection on, it's not about a barrier count. It's actually about the quality of the barriers we have and how well we look after them.

Traci: Trish, is there anything you want to add?

Trish: I think just if we can get more people understanding the critical importance of knowing what your safety critical elements are, knowing what their performance standard is, and then maintaining and monitoring against that performance standard, then we could prevent a lot of events occurring. That takes effort, it takes money, and that's I think, one of the biggest challenges that we see. Maintenance budgets get cut and we are stretched to try and do far more with much less. I want to leave as a final point here that if we think about it, if our plants are operating reliably, then they're operating. When they're operating, we have a chance to make money out of them. When they're breaking down and we're having to repair them, they're not operating, they're not making money, and they're costing us money because we're having to put money into the repairs and the activities around that and the lost production.

I want to get people thinking about the idea that when we get process safety right, we get good plant reliability, and that's how we make money. Process safety does not have to just be a cost on your business, and you don't have to only justify things by the avoidance of the cost of an accident or incident, which we can never really quantify. But process safety gives us reliability, and reliability is key in any business in making money. So process safety can make you money as well as save lives. Now, the saving lives is, to me, the ultimate reason, but making money is a pretty good reason too.

Traci: And if it gets the—and you said it earlier, don't want to make it hard on our operators, we need to give them good instrumentation. And if we can do that and show that it helps the bottom line, I think that's how we send it home to everybody. Thank you for all of the information that you gave us today. The look back on those three incidents—unfortunate events happen all over the world, and we will be here to discuss and learn from them. Subscribe to this free award-winning podcast so you can stay on top of best practices. You can also visit us at chemicalprocessing.com for more tools and resources aimed at helping you run efficient and safe facilities. On behalf of Trish, I'm Traci. And this is Process Safety with Trish and Traci. Thanks again, Trish. Stay safe. 

 

 

About the Author

Traci Purdum

Editor-in-Chief

Traci Purdum, an award-winning business journalist with extensive experience covering manufacturing and management issues, is a graduate of the Kent State University School of Journalism and Mass Communication, Kent, Ohio, and an alumnus of the Wharton Seminar for Business Journalists, Wharton School of Business, University of Pennsylvania, Philadelphia.

Sign up for Chemical Processing eNewsletters
Get the latest news and updates