Callide Power Station: 34 Minutes to Disaster

Callide Power Station: 34 Minutes to Disaster

May 13, 2025
Switching operation triggers catastrophic power station explosion, proving that process safety principles apply beyond traditional chemical plants to any high-hazard environment.

In this episode, Trish and Traci discuss the catastrophic failure at Queensland's Callide Power Station C4 on May 25, 2021, which caused power outages for 470,000 people.  During a routine switching operation to replace DC battery systems, a voltage drop was misinterpreted, triggering a cascading failure. Both AC and DC systems failed, leaving the turbine without lubrication while it continued spinning backwards at 3,000 RPM.

The incident demonstrates that process safety principles apply beyond traditional chemical plants to any high-hazard environment. Key lessons include proper hazard identification, functioning safety controls, and maintaining culture, leadership, accountability and governance in safety management.

Transcript

Welcome to Process Safety With Trish & Traci, the podcast that aims to share insights from past incidents to help avoid future events. Please subscribe to this free podcast on your favorite platform so you can continue learning with Trish and me in this series. I'm Traci Purdum, editor-in-chief of Chemical Processing, and joining me as always, is Trish Kerin, director of Lead Like Kerin, and essentially the brains of this whole operation. So thanks Trish for once again joining me.

Trish: Always a pleasure, Traci.

Traci: Well, in this episode, we're analyzing the catastrophic failure on your side of the globe at the Callide Power Station's Unit C4 on May 25, 2021, a cascading event that began with a turbine malfunction and culminated in a hydrogen explosion that knocked out power to over 470,000 Queenslanders. Fortunately, the power station was safely evacuated, and no one was injured, but there are a lot of things that we need to unpack here. Can you explain what happened that day?

Setting the Scene: Power Station Primer

Trish:  Yeah. So, first of all, I'll start by setting the scene of where this power station was located and why it's actually important to us. The Callide Power Station is located near a town in Queensland called Biloela, and that's not a very big town. It's got a reasonable little population of people, and there's a whole lot of different industries there. Some abattoirs, some cattle grazing, those sorts of things are all out in that area. It's in outback Queensland, so inland from the coast, almost halfway up the Queensland coast, so the heading is toward that mid-Queensland area. All of the power stations in Queensland are interconnected to ensure security of supply and also make sure that they've got a network that runs effectively across the whole state, even though they're operated by different companies.

What happened on the 25th of May was during a switching operation that was taking place at Callide on their C4 Power Station. Now, Callide actually has several power stations located in this area as well, so there's Callide B and Callide C. Now, Callide B is a fairly decent-sized power station. It produces about 700 megawatts and Callide C produces 844 megawatts, so it's the bigger of the two stations. Now, Callide C was built in about 2001 and it's a coal-fired supercritical boiler. It was what was actually built at the time, and that's important as we talk through what actually occurred, as well as this event unfolded. As I said, they were doing some switching operations at the power station.

To make sense of what the switching operations are, we need to understand a little bit about how power stations work. So, if you'll indulge me, I want to explain at a very basic level how a coal-fired power station works so that we understand some of the components, because I'd need to talk about them in a little bit of detail. First of all, as I said, this is a coal-fired power station, so that means it is located near a coal mine. They mine the coal, the coal gets crushed up, and it is then transferred through conveyor systems into a boiler. So the coal is the fuel for the boiler.

The boiler then makes the supercritical steam. The steam comes off the boiler and it flows into a series of turbines, and those turbines combined are basically what we call the rotor system, and that steam turns those turbines and that rotates a shaft, and that shaft rotates and it runs through a generator. And the generator is, if you imagine, I don't know if you were a kid and you ever played with those little dynamo sets where you could wind it up and basically what you're doing is you're turning a little shaft inside a coil of copper and that mechanical movement of the shaft moving inside the coil of copper generates electricity. So that electricity is then put through transformers that then flow into the power grid, and that's how, when you go to your light switch or you go to your power outlet and you plug your device in, you get power coming out. So that's how it's been formed as it's gone through.

There are different ways to make electricity, so you might have a gas-fired power station instead of coal or nuclear as well. All of those sorts of methods of power generation are generating steam to drive a turbine to generate electricity, and that's what was happening at this particular power station. When we think about how the power station works and some of the more intricate details of it, electricity has two different types of forms. There's alternating current and direct current, and so we need to understand the difference between the AC and the DC systems because they did different things at this power station. This power station had circuits that were both AC and DC. The AC system supplied all the electricity to run most of the unit itself.

So, what I mean is when you've got a shaft running through bearings and things like that, you need lubricating oil, and you need seal oil. So, there are hydraulic oil pumps that provide that oil to the bearings and the seals, and cooling systems are needed as well. So, the AC system at that power station provided all of that electricity. The DC system provided electrical protection and monitoring systems and was also a backup to the lubricating oil pumps, because the lubricating oil pumps are really important. You've got a rotating shaft; you've got to make sure you keep those bearings lubricated. So we've got the AC supplying the lubrication system primarily, and then we've got the DC supplying the monitoring systems and a backup to the lubricating pumps.

Now, there are also some devices installed to protect the rest of the system. If the control system detects a fault in anything that's going on, it will automatically disconnect the rotor from the grid so that we're not creating an electrical disruption. Now, the DC system, remember this is the one providing the monitoring systems and the backup to the lubricating oil, is powered by a battery charger, and that battery charger then has batteries as backup. So this is starting to get to be a really complicated system to understand. Yeah, okay. The DC is provided power by a battery charger, and there are batteries as a backup, but the AC system provides the power to the battery charger. So these two systems are deeply interconnected now, which makes this a really complicated story to understand, which is why I have to confess, this one took me a long time to get my head around what was actually going on here.

So that's some of the basic context about what happened with this particular event. This is basically the Callide C4 rotor is the piece of equipment that we're actually talking about. To understand what actually happened that day, we need to take a couple of steps back and look at a project that was being undertaken in February of that year, February 2021, and there was a project to replace the DC battery charger and batteries on C4. So we're changing the battery charger and the batteries. Now remember, the battery charger is provided by the AC system, but the battery charger then provides power to the monitoring systems. From February to May 2021, while the battery charger and battery were isolated, the DC system was being supplied from another DC supply that they had. One, they called the station DC supply, so they had that DC supply coming in, so they were still protected in that instance. They still had the power going to all the monitoring systems.

Then on May 25th, they started the switching to turn on the new DC charger and battery, and this is the event that then started to cascade all of the issues, which then saw the catastrophic failure of the turbine... The rotor system basically destroyed the turbines at that instant. The Callide C4 was on power at the time of the incident. It was running as normal, and they were doing this switching over on the rotor as it was running, so the power station was online. They had previously switched over some other battery systems and DC charges on the other rotors earlier in the project, and they were online at the same time and everything worked fine.

So we've got a learned behavior that it's, "Okay. We can actually do this switching whilst we're online. We don't need to have the power station offline or the turbine offline to do that particular work." It was during the switching sequence that both the AC and the DC systems failed. Now, at that point, when both the AC and the DC systems failed, a couple of things happened. The first thing was that the turbine steam valve automatically closed, so that meant there was no longer steam going to the turbine, forcing the turbine to spin.

Traci: And that's what you want to happen, or is that not?

Trish:  Exactly.

Traci: Okay.

Trish:  That's what you want to happen. So it should actually shut down. Now, that fault that was detected at that point in time should have also disconnected the unit from the grid, but it didn't. The unit failed to disconnect from the grid because, remember, a little bit ago I said that if a fault is detected in the unit, it will disconnect from the grid, but it failed to disconnect from the grid. Now, we've got a turbine that's isolated, and it starts to spin backwards. And what it's doing then is actually instead of producing and exporting current and electricity, it's actually importing electricity from the grid and spinning backwards, and it's doing what we call motoring at that point in time. It starts to spin backwards and motor. Why is that a problem? It's a turbine. It's designed to spin.

Remember, I said the lubrication systems were provided by the AC system and backed up by the DC system? We've lost both of those systems. We now have no lubricating oil going to our bearings or seals on a rotor assembly that is spinning at about 3,000 RPM backwards. So keep that in mind. How did we get into this situation? What happened that actually allowed all of that to fail? I know this is a really complicated story.

Traci: I'm following you.

Trish:  Good. I'm trying to keep it as clear as I possibly can in my head.

Traci: Okay.

Trish:  The switching sequence, what they actually did at that point in time, was that they connected the battery charger to the C4 DC system, and then they turned off the station DC supply to C4. So that meant that the battery charger was meant to provide the DC power to C4. Now, the next step was to connect the battery because the only thing that was on at the time was the battery charger. However, the battery charger did not immediately respond, and the voltage dropped. And when the voltage dropped, the system interpreted that as a fault in the AC system and then triggered what's called the arc flap protection to disconnect AC from the unit. But it was not an AC fault. The system misinterpreted that it was an AC fault when it wasn't. When the AC power was lost, remember all the way back, what provided power to the DC battery charger? The AC system. The DC battery charger dropped in voltage, which caused the AC system to trip out, which took the power off the DC system battery charger.

And then when all of that happened, the backup systems failed to operate as planned as well, because there's an automatic changeover switch that should have immediately restored power to the DC system, but that had been damaged back in January 2021 and hadn't been fixed. So one of our backup systems, which you could probably say was a critical backup system, was inoperable, so it failed to reinitiate the DC system. And remember the DC also provides our backup lubrication oil. The AC backup diesel generator did start, so we got a tip, something happened, it worked, but the DC system was needed to be able to reconfigure the AC supply from the AC backup generator, and the DC system was out. Now, on top of all of this, at that time, all of the control room displays failed and went black.

Everything Starts to Fail

Traci: A perfect storm is happening.

Trish:  Exactly because without the AC and the DC system, there was no supply to the control room screens and all of their control systems. So, everything cascaded at this point, and it just continued to have different faults occur at various different times, and this is essentially what then resulted in the event occurring. That's the technical background, and there are a lot of technical causes. So there are design-related issues here.

Now, interestingly, when they went to switch over to the battery charger, the battery charger that they designed and specified was not designed to maintain voltage as the sole supply, which is why when it dipped in voltage, that should not have been unexpected because that the battery charger was not specified or tested to be able to maintain supply in that instance. And then the arc flap protection system triggered incorrectly as a result of some design issues with that, too. And then, if we think about the risk awareness that happened at the time, the switching took place without an automatic changeover switch functioning. So we went into quite a serious and critical operation without our fundamental safety controls in place to have averted an issue should one have occurred because we were doing this work whilst we were online. And there was also a temporary change in how the facility was operating that was required to be able to do the cutover as well. So the risk awareness in understanding what was actually going on there was concerning.

Now, given this is a power related instance, it's also interesting to understand that there's a whole lot of organizational factors that were identified in the investigation report, and there was a very significant independent investigation report done into this incident, and it actually was quite critical of a number of organizational and particularly, process safety aspects in the organization. So it actually specifically calls up process safety as being necessary even in a coal-fired power plant, which is probably not somewhere that we automatically think about process safety applying, but it actually does. So there were management system issues. The management of the change process that they had was not very robust, and in fact wasn't actually used for this project, as this was all being done without adequate management of change occurring.

Now, interestingly, another event that occurred that was basically caused by a management of change system or the lack of a management of change system adequately identifying all the risks, was the San Bruno pipeline incident near San Francisco in California. They were actually doing, again, a changeover of control systems and they were changing over the UPS batteries at the time when they lost control system access and pressure regulation control in their pipeline. So management of change in areas that are not your traditional processing refinery chemical plant type things actually is still really, really critical. There was no risk assessment following the impairment of the automatic changeover switch. So we know we've got an impaired system, but we don't assess the risk of that. And the risk assessment system didn't require the activity they were about to undertake, the switching, to actually be risk assessed to understand it at all.

Staffing Issues Lead to Trouble

Trish: The other interesting thing is that there's always more than one thing that goes wrong in these instances. And you talked about how it was a cascading effect of all of these things that happened. Well, let me tell you about what happened with the resourcing at the plant that day. First of all, there was a history of very high management turnover at that facility, and that can create a lot of lack of knowledge issues associated with competency and familiarity with systems, but also it can create a lot of unease and disruption culturally in the workforce as well as we see people turn over at quite significant rates. But on the actual day of the cutover, the shift supervisor had to leave the site at 10:00 AM for personal reasons, and this incident all started to occur at about 1:30, 2 p.m.

Now, at about 12:30 that day in the afternoon, the C station operator went to the medical center to undertake their mandatory mask fitting activity. Now, that's a really important activity, but was it urgent to take place in the middle of the switching operation? What that meant was the C3 control room operator had to go outside to do the outside checks, leaving the C4 control room operator to manage both C3 and C4 at this time. So we've got a resourcing issue happening here. We've seen this happen before in other incidents as well. If you remember back to the Texas City refinery incident back in 2005, the shift supervisor left the site for personal reasons before that event occurred. If we think about the Buncefield tank farm explosion, the supervisor was suffering quite significant fatigue because of excessive rosters. So that's clearly a resourcing issue. And even tragically, more recently, the Washington DC Air Traffic control tower at the time of that tragic helicopter passenger aircraft collision, there was lower than normal resourcing at the time of that collision occurring.

So, resourcing at really critical times is something that we need to actually focus on and understand a bit more about, too. And then there were operational integrity issues that were occurring. So at this facility, it was acceptable to operate with system impairments. They weren't questioned; it was just accepted. And the control room also had received inconclusive and contradictory information that made it very difficult to make decisions when they did get their control systems back up, what they were being shown on their control systems was not necessarily what was really happening in that turbine hall. So that started to create a lot of issues for them. And it wasn't a matter of just going in to look at the turbine hall of what was happening. They had a motoring turbine, so they did what they needed to do at that point in time. Evacuate everybody out of the turbine hall, and evacuate everybody except essential personnel out of all buildings because a motoring turbine with no lubrication is quite a substantial risk emerging in that facility.

And from an overall governance perspective, the critical risk processes that they had initiated several times. They had tried to implement process safety systems. They'd done several campaigns over many years, and every time they'd lost focus on it and that campaign was defunded. So they never quite got there. They never quite implemented effective process safety management systems. Competing priorities is a classic that I see in organizations all the time. They distract people from managing the critical risks. So, for example, all of their metrics or KPIs were dominated by production. How many kilowatts or megawatts were they producing? Financially, what was their dollar per megawatt? And occupational safety, making sure that they could track those lost-time injuries. They were not tracking anything around the integrity of their equipment effectively. So they were lacking process safety indicators, both lead and lag at that facility as well. So there's a whole lot of different aspects that started to occur there.

After a period of time, the turbine continued to motor backwards. As I said, it was spinning at about 3,000 RPM. Now, with no lubrication occurring on any of the seals or the vents, that caused metal to metal contact starting to occur. Now, what happens when you've got metal to metal contact and you've got a rotating? You're going to generate friction. That's the whole point of having the lubricating oil there. That friction generates heat. That heat started to deform the shaft because we've now got a incredibly finely tuned and balanced rotating shaft spinning at high speed without lubrication, generating friction, generating heat, the shaft starts to heat up and deform. That deformation results in unbalanced spinning at 3,000 RPM.

Turbine Missile Event

And what then occurred was what they call a turbine missile event. This is basically when bits of your turbine dislodge and start shooting out as missiles. When that occurred, to give you an example, there was one particular piece of shaft that weighed about 4,400 pounds and it was thrown over 16 feet. A bearing gear weighing over 660 pounds was thrown 65 feet into the air and through the roof of the turbine hall, the shaft actually disintegrated at around nine places. Fortunately, there was nobody in the room at the time, so they had successfully evacuated everybody.

Now, because the turbine had remained connected to the grid before it actually started to disintegrate, and then as it disintegrated, that caused arcing to occur. Because remember, we're still connected to the electricity system because the turbine or the rotor system didn't disconnect, and that drew current from the grid at more than twice the unit's rated power. What then happened at a substation downstream, the grid protection disconnected that substation from the entire Queensland power grid. And that's why at about 2:07 that afternoon, all the lights dipped across Queensland and millions of people became aware that there was a significant power disruption occurring somewhere because the entire grid was disrupted. So we're seeing millions of people actually then have that impact occur. This massive power disruption has tripped out what's called the Kelvale substation, which then took out a substantial portion of Queensland from a blackout perspective. The rest of the network did come back up again, but C4 was out of action for several years while they rebuilt and then recommissioned that particular rotor and turbine system. So that's a long story, but really interesting and quite complicated.

Traci: Very complicated. And as I was researching this, one of the questions I'd like to ask you is, what have they done to prevent this failure in the future? And ironically, I learned that they also had another incident just a month ago, almost the same incident. Can we talk about that?

Trish:  Yeah. It was a different type of incident that happened, but back in April, early April, April 4th, this year, on their C3 power station, so C4 is back up and running, that was recommissioned, and that's back up and generating power. But they had an issue on their C3 system, and what actually happened was actually quite a different incident, and it resulted in a boiler explosion. So remember how I said we start this story with they getting coal out of the mine, they crush up the coal, and they feed it into a boiler to make steam. Then that steam goes to the rotor to go through the generator to produce the electricity. So, back to the coal and the boiler system in this instance. So we're talking about those two particular aspects of C3.

In C3, because of the type of coal and the way they inject their coal into their boiler system, they can form what's called clinkers. Now, a clinker is when ash hardens and starts to build up on the walls of the boiler internal, and it's quite common for clinkers to form at coal-fired power stations. And there's a whole series of different management tools that are in place and practices that do things like regularly detach the clinkers from the turbine tubes so they fall to the bottom and can easily be removed. But what happened in this particular instance was a clinker that had formed detached, and when it actually fell, it resulted in the release of steam from the ash conveyor water system. And that release of steam actually extinguished the flames on the boiler. So all of a sudden, we've got a boiler without any flame happening in it, and that immediately caused some negative pressure because the flames went out. So it's caused a bit of a suck of air into the system, and it prompted the unit to automatically trip.

Now, the flame loss in that instance led to pressure fluctuations. As I said, there was a negative pressure and then a positive pressure surge occurred, and that positive pressure surge was caused in the re-ignition of the accumulated gas that formed in those moments when the burners were out. So basically, they had a buildup of gas in the boiler, and then they ignited it. So if you can imagine if you've got a gas stove top and you turn it on, but the igniter doesn't work instantly, and then when the igniter does work, you get a much bigger... That's basically what happened. So there was a buildup of fuel, and then it was ignited in one hit, and so that actually caused some damage to the boiler and some of its associated equipment.

The boiler did not rupture. There was no loss of high-pressure steam or water in that instance, but it has basically taken the C3 boiler out of service for the moment whilst they go through that investigation and repair activity. So, there's some potential learnings around boiler management in there around the boiler ignition management in particular, and control of the fuel-air ignition mixture that happens within their boilers as well. So sadly, there's obviously still some challenges at the particular site.

How to Ensure Control Logic is Sound

Traci: Now, let's talk a little bit about dialing back. And as we said, it's a cascading event. Everything was happening and things were just happening out of their control, seemingly out of their control. How can facilities ensure that their control logic is sound? It sounds like you have to factor in so many things. How can you possibly factor in all of that?

Trish:  So that's where we have to go through and make sure that we have adequate risk assessments to identify all of our hazards at our facilities, because these hazards are not new. For example, we know the hazard of dealing with alternating current and direct current in power stations. We know how turbines work. We know how high-pressure steam works. We understand all of these things because we actually understood them enough to design and build the facility. But what we've got to make sure we do is understand what the hazards are and what consequences could occur, and then make sure we map out from our hazard all the way through to our consequence, and clearly understand our controls that we have in place.

So from that technical perspective, identifying that an automatic changeover switch is actually a critical safety device because it allows the system to be able to switch over is critically important. But the facility was running for four months without it functioning, and then went into a complex switching operation that was not the normal procedure. We don't normally switch between these things. Normally, they're just steady-state operation. So it was not a normal everyday activity to do this switching activity. There was a procedure for it, but it was not an everyday activity. Going into that particular activity without having our necessary functioning devices working for us would have provided protection.

So really, it comes down to understanding all of those aspects of our management systems, our risk assessments, and understanding where they fit. And I talk a lot with people around the idea of culture, leadership, accountability and governance. These are all four critical areas that in any organization we have to actually understand from a process safety perspective. So there's the design parts we've got to get right, but we've got to get that culture, leadership, accountability and governance.

In C4, if you look at culture. There was acceptance of work without adequate controls in place or being operational. From a leadership perspective, there was inadequate resource planning and high leadership turnover. From an accountability perspective, management of change was not effectively described or implemented. And governance, there was distraction in implementing of the process safety fundamentals. So we can actually go back and say, "Okay. So as sites, we need to make sure we get our PHAs, our hazard assessments, they're critical and they need to be done and they need to be applied when we do management of change. We need to make sure our critical roles are defined and understood and resourced adequately, particularly when we're doing complex, unplanned or unusual activities. And that management of change is clearly defined and applied with rigor around our facilities, making sure when we make that change, we understand what risk that is driving. And then lastly, making sure that-

34-Minute Mark

Traci: Trish, the alarm's going off.

Trish:  The alarm. So that alarm for all you listeners out there is 34 minutes. That was the length of time it took from the moment the switching operation failed and they lost AC and DC systems to the time the turbine disintegrated, 34 minutes. It all happens rather quickly in real time, doesn't it, Traci?

Traci: It's amazing, isn't it? It's going through this, and how fast it happened is just incredible. And I guess another question that we want to know is what are the lessons? Can we learn from this incident? In the 34 minutes, we've learned a lot, but what else can we learn?

Trish:  So that last part that I was just finishing up when our 34-minute alarm went off was that when we do process safety improvements, we actually have to manage them at a strategic level. So we need to consciously make decisions to apply process safety principles to our facilities and our organizations, and they need to be championed at the highest levels in an organization because they do have budget implications to them. And we need people on board, but they also have cultural implications.

We need leaders to be embracing the needs of process safety and speaking about process safety and engaging in process safety. So I think my message for everybody is to really think about, as I said, culture, leadership, accountability and governance, and how you're applying your key critical risk assessment activities to your business to make sure you understand what is going on in your facility, and can adequately and safely manage the activities that you've got going on.

Traci: Trish, is there anything else you want to add?

Trish:  This is an interesting story because as I said, it was a coal-fired power plant, something that we don't normally talk about in process safety. But I think I've laid out and highlighted some of the complexities of the operation that the principles of process safety apply. Whether you're in a refinery or whether you're in a power plant, or indeed, whether you're in a hydroelectricity plant or a solar generating facility, it doesn't really matter what your particular field is. If you are working in a high-hazard environment, an environment where you can have potential catastrophic or actual catastrophic events, then you need to be managing those events adequately. And you need to understand your hazards, understand your consequences, and make sure you put your controls in place and you know that your controls actually work.

Traci: Well, Trish, as always, you help us get the culture, the leadership, the accountability, and the governance right, and we appreciate that. Unfortunate events happen all over the world, and we will be here to discuss and learn from them. Subscribe to this free podcast so you can stay on top of best practices. You can also visit us at chemicalprocessing.com for more tools and resources aimed at helping you run efficient and safe facilities. On behalf of Trish, I'm Traci, and this is Process Safety With Trish & Traci. Thanks again, Trish.

Trish: Stay safe.

 

About the Author

Traci Purdum | Editor-in-Chief

Traci Purdum, an award-winning business journalist with extensive experience covering manufacturing and management issues, is a graduate of the Kent State University School of Journalism and Mass Communication, Kent, Ohio, and an alumnus of the Wharton Seminar for Business Journalists, Wharton School of Business, University of Pennsylvania, Philadelphia.

Sponsored Recommendations

Many facilities handle dangerous processes and products on a daily basis. Keeping everything under control demands well-trained people working with the best equipment.
Enhance the training experience and increase retention by training hands-on in Emerson's Interactive Plant Environment. Build skills here so you have them where and when it matters...
See how Rosemountâ„¢ 625IR Fixed Gas Detector helps keep workers safe with ultra-fast response times to detect hydrocarbon gases before they can create dangerous situations.
The Micro Motion 4700 Coriolis Transmitter offers a compact C1D1 (Zone 1) housing. Bluetooth and Smart Meter Verification are available.