Podcast: Lessons Learned from the Columbia Disaster
This 100th episode of "Process Safety With Trish and Traci" examines the 2003 Space Shuttle Columbia disaster through the lens of due diligence. Columbia disintegrated during re-entry after foam debris damaged heat shield tiles during launch. The podcast explores how NASA normalized foam strikes over time, turning "lessons of failure into memories of success." Multiple intervention opportunities were missed due to inadequate resources, poor communication, and cultural barriers.
Transcript
Welcome to Process Safety With Trish and Traci, the award-winning podcast that aims to share insights from past incidents to help avoid future events. Please subscribe to this free podcast on your favorite platform so you can continue learning with Trish and me in this series. I'm Traci Purdum, Editor-in-Chief of Chemical Processing, and joining me as always is Trish Kerin, director of Lead Like Kerin. And Trish, this will be our 100th episode today, so thank you for putting up with me and my sometimes silly questions for this whole time.
Trish: Wow, I can't believe it is our 100th episode. That is amazing. And when I think back to five odd years ago, five, six years ago, when you actually said, "Would you like to do a podcast?" And I went, "Oh, I don't know. All right. Okay. We'll give it a go." 100 episodes later, finalist in an award, winning an award. That's pretty cool, actually.
Traci: It is, and it's a great topic, and I was worried that we weren't going to have a lot to talk about. And here we are 100 episodes in and I think we can get another 100 episodes if you're willing.
Trish: You just can't stop us talking, can you?
Traci: Well, what have you been working on lately?
Trish: Oh, I've had a busy week this week. I'm on the road. I'm over in Perth in Western Australia, and I've had a great time doing a keynote presentation for a conference talking about process safety leadership in the energy sector, which was really great fun to do and so many interesting questions and great conversations around it. So that was wonderful. And doing some work with a company and their board as well around understanding process safety better so they can get better safety outcomes going forward. So, exactly the sort of stuff I love doing, training, education, coaching and speaking.
Traci: And what you're excellent at, so helping the world be a better place for sure. Today. we are going to be talking about due diligence. As with several of our podcast episodes, we will look outside the chemical industry to hopefully learn lessons that can help our listeners avoid risks based on the insight from other catastrophes. We often look toward the aviation field, and today we are going to soar a little bit higher and focus on the Space Shuttle Columbia Disaster, which disintegrated as it reentered the atmosphere, killing all seven astronauts on board. The mission was the 28th flight of the Orbiter, the 113th flight of the Space Shuttle fleet, and the 88th after the Challenger disaster. Can you walk us through some of the events of that day in February 2003?
Space Shuttle Columbia Disaster
Trish: Yeah. So, it's interesting. This story actually, well, it starts when the space shuttles were first developed, obviously, but this particular flight, it started in January, and then it re-entered in February when the incident, the catastrophic part of the incident happened. The first part of the incident happened when on launch, as the solid rocket booster detached, there was a breakaway of some insulation foam and that struck the leading edge of the wing. Now, foam broke away a lot and everybody just got used to it, foam breaking away and hitting the leading edge of the wing or the underside of the wing or some part of the orbiter. But one of the issues this time was that it sufficiently damaged one of the heat shield tiles, and that was what happened on that day in February. As the orbiter was re-entering the Earth's atmosphere, the friction as it starts to hit the atmosphere and run through it, starts to really develop enormous heat, which is why there's a heat shield and why the orbiter actually comes in sort of belly and wing first because that's where the heat shield is.
So, it's taking all of the heat that is being generated as it re-enters the Earth's atmosphere, the thicker air that it has to come into. And because of that tile damage, superheated air was actually able to get inside the wing, under the tiles and inside the wing. Now, this is incredibly hot, and it actually started to melt away the internals of the wing. The first that control had any idea that there was something going on, so the command center, was they started to get some alarms on the air pressure in the tires on the landing gear, and that was because tires had blown completely in the heat.
And they were having a conversation, they were talking about getting some alarms, everything was okay, everything was good. And then, there is a complete loss of communication with the orbiter Columbia. And the last words that were spoken on the radio from the orbiter itself, from Columbia, the captain had made some comments, and then said basically that, and then nothing. You then hear, on the transcript and the recording, launch control, mission saying, "Columbia, Houston, comm check." So, Houston trying to reestablish communication. Then you hear it again, "Columbia, Houston, comm check." And the next thing you hear is someone saying, "Lock the doors." Because at that point in time, they didn't know they had a problem.
They don't know what it is still, but they know they have a problem basically. Then they start getting reports from all over the country of this massive fireball in the sky and those sorts of things, as they start to realize what's actually gone on. So, it's an incredibly tragic story. The astronauts on board, seven of them, they had no idea that this was about to happen. They did not know the state that their vessel was in, and so they were unable to take any sort of mitigation action. And it's still arguable, could they have done something different? But they basically went into this, and they were excited about coming home. There's actually some video recording in the minutes before they start the re-entry, where they're joking and laughing and talking about coming home and looking forward to coming home. They've been in space for a few weeks. That's the last footage we have of these people.
No Substitute for Good Engineering Practice
Traci: You had authored a case study on this topic titled “81.7 Seconds.” In that case study, you quote the Columbia investigation report, "Reliance on past success became a substitute for good engineering practice." Let's talk about that.
Trish: Yeah. So, as I said earlier, the foam broke away on almost every launch when the solid rocket booster actually disconnected, and it became normal. So, the first time the foam broke away and hit an orbiter, NASA said, "Oh, that's a problem," and started to investigate it. And thought, "Yeah, this could be a real issue. We could have some big significant challenges here. If we damage the heat shield, we can't do re-entry because we'll have a burn through event and we'll lose the orbiter." Exactly what actually happened that day. But what ended up happening was they couldn't figure out how to stop the foam breaking away. And when I say foam, it's not soft, spongy foam. It's hard, a little bit sort of hard crusty foam that's like a fire protectant, a fire retardant type stuff. And so when it breaks away and it also hits, it hits at incredibly high velocities. And so, it is quite a significant impact onto the vessel.
Now, one of the other challenges is that they discovered that it did sometimes hit the carbon tiles and damage them. And so, was that going to damage the heat shield? The biggest issue with the carbon tiles that they had was that they're actually all a unique shape. There's not a single tile that's identical to another tile. It's the most complex jigsaw puzzle because when they first started to travel into space and they created heat shields, the vessels they were using were not reusable. They were disposed of at the end. So, the heat shield on those actually, during the re-entry process, fused to the vessel and then it became a throwaway item. Because this was a reusable vessel, they couldn't use the same sort of heat shields because they needed to be able to send it back up into space to send it into that really cold atmosphere and environment, and then bring it back through that extreme heat again.
And as you can imagine, the constant expansion and contraction with temperature change. So, this was a really complex jigsaw puzzle. And so, they couldn't just replace tiles if it was damaged in space because they couldn't carry every single tile with them. So, they tried some other methods, they tried sort of patching stuff. They couldn't really get anything to actually work effectively. But over the years, they've developed this experience of, well, we keep getting hit by foam, but it's not a problem because we haven't had a burn-through.
And so, they started to believe it wasn't a problem. The likelihood of a burn-through was so remote that it just wasn't going to happen. And when you actually then go back and look at the history of foam strikes, there's some other interesting things. Columbia was the orbiter that was hit the most with foam. Now, Columbia was the first orbiter, and it was actually slightly different to the others. It was the first one that flew in space. It wasn't the first one built. There was the first one with engines that flew in space, and it was very much a test vehicle. So, it had all sorts of extra telemetry in it to understand how this reusable orbiter was actually going to work. And when they actually then were coming back in, that's when mission control also started to get all these other alarms because this vessel was full of unmaintained telemetry.
It was still connected but not maintained. So, was it giving me the right answers or not? So, they started to not believe what their systems were telling them because it was probably wrong, or it wasn't. But they never figured out how to fix the foam strikes, and they never had an incident where they lost an orbiter before due to this. However, they had one incredibly large warning sign that Atlantis had had. And several years earlier, Atlantis had been struck by a particularly large piece of foam, and it had caused a very substantial damage to one of the leading-edge wing tiles. In fact, they had a camera mounted on Atlantis at the time, and so they asked the crew to inspect the heat shield. And I believe that the commander there responded, "It looks like it's been shot by a buckshot." This was a mess. There were pockholes everywhere.
The decision was made to bring Atlantis home, and it landed. It got through. It didn't have a burn-up on re-entry. When it landed, they discovered that there was really some significant wing damage leading edge tile, but because of the exact location, it was taking longer for the superheated air to get in the wing structure and damage it. So, Atlantis landed, but only just. And so, when it did land, that created this thought of, we can have significant wing damage and still land. So, all of a sudden, we've turned, as the report also talks about, we've turned the memory of the lessons of failure into a memory of success. How good are we? We can land this thing even with wing damage. And so, it was just assumed to be normal. It was a classic normalization, which Nancy Leveson created this concept, this phrase for us to understand, and she actually created that out of the Challenger's space shuttle investigation that she was part of.
And it's still very much applied here, and it applies in everyday life to everything we do. When you go through and the standard just creeps a little bit and nothing goes wrong, and you stay there for a little while and nothing goes wrong, and then the standard creeps a little bit more, nothing goes wrong, you start to normalize that creep, and it's okay because nothing's gone wrong. As humans, we're actually not really great at judging risk. Nothing's gone wrong yet, so we're fine. Unfortunately, just because it hasn't happened yet doesn't mean it's not going to happen.
Traci: And the Atlantis just, it was basically dumb luck that they didn't blow up on re-entry because of the tile. The position of the tile that was affected didn't bring that heat in. I have a question that I want to circle back on. You said that all of these tiles aren't the same shape. Is that a structural thing on purpose? Is that by design?
Trish: Yes. So, it was actually so they could fit them all in to allow them to be reusable. So, they couldn't use the traditional type of heat shield, and so it allows for some expansion and some contraction. And so, every single tile is its own unique shape and is fitted individually. It's an amazing feat of engineering what they actually did in creating the space shuttle. So, I don't want people to go away from this going, "The space shuttle, that was just an absolute mess." Tragically, they lost 2 and killed 14 astronauts in that.
But overall, the ingenuity, the creativity, the genius in how they were able to develop the space shuttle under the constrictions that they were put under, it was actually quite a phenomenal piece of engineering. There's a really good book called Into the Black, and it actually talks about the development history and the engineering history of creating the space shuttle orbiters, and it is, it's an amazing story of human exploration and ingenuity.
Missed Opportunities
Traci: Now, throughout this mission, there were several missed opportunities for intervention, and I have several friends who are engineers with NASA. I'm not going to name names, but during this whole time, they would always hold their breath and shake their heads. And so, seemingly, people did understand that there needed to be intervention. But let's talk a little bit about what were some of those missed opportunities there.
Trish: Yeah. So, if we just look at this launch alone, the first one was, anytime a launch is done, there's a launch assessment team. Effectively, they get together and they review all of the details of everything that happened to understand, collect data, analyze it, figure out what was going on. And they said an object had struck the left wing leading edge. And so, they said, "We need to investigate this." So, part of it was that they started to investigate it. But mission control were very much of, it's not a safety or flight issue because foam strikes the leading edge on almost every launch. It's just a maintenance issue that we have to refit some tiles when it gets home. That's where they, that was the first sort of missed opportunity they had.
Another interesting one, and they really didn't pick up on this one at all at the time, it was very much a weak signal going through. But when the orbiter is in space, it actually has a field around it, a vision to track any possible objects that might collide with it. And so, this is to make sure that it doesn't get hit by asteroids and space junk and all that sort of stuff. So, NASA tracks everything around the orbiter itself as well. And a little while in, about shortly after it reached orbit, this object appeared in its little envelope, its little protective zone, and it sort of appeared and disappeared for a couple of days before it totally disappeared. And it was about the size and the shape of the tile that was lost, but it was never connected that that was the tile and that we just lost a tile. Not damaged a tile, lost a tile. And so, the collision avoidance system picked up something, but it was just a bit of space junk potentially. It never quite connected that weak signal that that was actually really something there.
We had issues of the debris assessment team wanting to do inspection with the crew, but that request never went anywhere. Now, there was no planned spacewalks for that mission. There was no camera on board to inspect visually, so they would've had to have done an unscheduled spacewalk. Now, an unscheduled spacewalk, it's not a little thing. That's a big deal. You are putting a human in a suit outside in space. So, they didn't want to do an unscheduled spacewalk. That would've been potentially too greater risk to the individual.
Sadly, we saw a greater risk in the end. But you can see why at that point in time, we're not really sure. We're not going to put someone outside. So, they didn't. It was quite common for the crew to film the external tank separation activity as it happened. But mission control forgot to ask the crew for that footage. So, there may have actually been some quite detailed footage that showed more detail on the impact than the ground-based cameras that were looking at an orbiter way up in the sky trying to see this tiny chunk of foam hit something. So, they missed that too.
The assessment team wanted satellite images to be taken, but they didn't follow the right channels. They didn't request them through the right area. They sort of went to their friends and back-channeled and tried to get some images taken. And so, those images got canceled because they didn't follow the right channels. And so, an opportunity to perhaps see the damage more clearly was lost, to have known whether there was really an issue. There was also an assumption by the managers that the images wouldn't be good enough quality to see anything anyway, so why would we move satellites to take images here? Sadly, possibly the NASA managers that said the images wouldn't be good enough, probably weren't aware of some of the camera capabilities of the military satellites. So, there may well have been satellites that had the capability to do what they needed.
The assessment team, when they were doing their assessment activity and trying to understand whether it was an issue, because they couldn't get the images, they couldn't get all the information, they stopped work. These were a dedicated group of technical people that were being frustrated every step of the way in what they needed to do. And in the end, they basically went, "Well, we can't do anything." And they walked away from the assessment. So, it was never quite finished. Interestingly, they had the authority to mandate certain things, but they didn't use it. And that, I think speaks to, as the report talks about, very much a culture of fear in the organization. It was very command and control and you do not step out of line, and there would probably be quite significant consequences to you if you were to exercise your mandates.
So, from a cultural perspective, there was a lot of gaps there. And they did have a modeling program that they used to model the foam strikes, but it wasn't fit for purpose. It didn't consider, it hadn't been validated on such large pieces of foam for impact, and the people that were using it weren't overly experienced in using it. Now, one of the things that we often talk about with computer programs, so I'm showing my age now, when we first started to talk about computer programs, garbage in, garbage out,
Traci: Right.
Trish: If you don't get the right information going into your model, and if you don't understand what's happening in that black box model, what you're getting out is garbage because you can't interpret it properly and you won't be able to determine if it is something that is just not right, just not in the right ballpark of answer. And so, there was all these different gaps that appeared, and it's the classic Swiss cheese. There's holes just kept lining up here.
The Due Diligence Model
Traci: Now, a way to combat garbage in, garbage out is what I said we were going to talk about today is due diligence. Can you explain to us the due diligence model? And I know that you had mentioned there's six elements in your case study that I read. Can we talk through that?
Trish: Yeah. So, the due diligence model is actually a, it's legally enshrined in most of Australia and in New Zealand as well. And so, it's actually the first time anywhere in the world the concept of due diligence has been defined in safety law, and it requires officers of a corporation, basically those in charge, to exercise due diligence in safety. Now, for those of you that are listening and saying, "Well, I'm not in Australia or New Zealand, so why do I care about this due diligence model?" All right. This due diligence model is actually just a really useful framework for thinking, which is why I talk to people all over the world about it and encourage people to use it as a framework for thinking because it focuses you on some particularly important parts of your business. And even if it's not a legal requirement, it's just good business. It's good business to know what's going on in your company and know whether it's in control, and that's what the due diligence model is about.
Now, it does have six elements to it, and the first one is to acquire knowledge and keep it up to date about process safety matters. So, it actually is about safety matters, but that includes process safety, so I'm putting process safety in there too. So, acquire knowledge on process safety matters that affect your business. Understand your business including the process safety, hazards and risks. Know your risks, right? Pretty simple there. Understand your risks. Ensure that the business has the right resources and processes in place and uses them to minimize or eliminate process safety risk. So, apply your controls, have the processes in place to apply your controls. Ensure that the business has the right processes to receive and respond to reports of incidents, hazards or other process safety issues and respond in a timely manner. And so, this is really important. Incident databases are not just a one-way thing. You don't just put in, you got to take out.
Then we want to make sure that, ensure that there is a process to comply with all legal duties that you have. So, you've got to comply with the law. And then lastly, have a verification process that says you've done the previous five things I've talked about. So, a lawyer in Australia called Michael Toomer, he summarizes these as, you need to know, understand, resource, monitor, comply and verify. And so, that's become shorthand now for us all in Australia in terms of focusing on what due diligence is. So, know your safety issues, understand your risks and your business, resource your business to manage that risk, monitor it and respond back to reports, comply with the law, and verify that you've done it. So, it's a really simple framework. It's a way of thinking, it's a way of approaching and understanding your business. And that's why, even if it is not a legal requirement where you are, just take a look at it. It's really useful.
Traci: Absolutely. And thinking about the Columbia outcome and the due diligence approach, and as I'm listening to you, obviously the resource and monitor were pretty big problems within, taking into account those six elements. How would the due diligence approach have changed this outcome? How do you think that would be?
Trish: Yeah. So, I now get to play Captain Hindsight. And hindsight, yeah, to be fair to the people involved, they didn't have the benefit of hindsight that we have having read the report and having delved into that. So, take what I'm about to say with a grain of salt there because I'm now projecting some ideas onto this. But if we go through them one by one, so, no. So, the managers didn't really have the knowledge of the capability of the imaging that was available to them. So, they didn't know about some of the health and safety material available or resources available to them because they didn't know the capability that the Defense department had as an example.
So, you want to think about, in your business, how is asking questions about how is process safety being managed? Do you actually know some key critical pieces of information about your process safety that you just should because you may need that information one day to be able to intervene potentially in an incident? If we think about understand, the management didn't understand the hazards and risks of the foam strike, they discounted it as not a safety of flight issue. It's not a problem. It's never been a problem before. It's not going to be a problem in the future. They didn't understand that risk that they were talking about.
And so, what we need to be doing is asking questions to really understand the risk and understand what our controls are, and control verification. How do we know our controls are being used and are effective? Because if we don't know that our controls are effective, then how can we rely on them? It's even worse if we don't know whether they're actually being used. Are they being bypassed? Are they impossible to actually implement potentially? Some controls that we think are most useful and brilliant, as we as engineers in our ivory towers write processes, we put these things in place and the poor worker can't actually implement the control. It's not possible. So, making sure we really ask questions to understand those things.
Then we get to resource. So, this is where the imaging resources were not readily available or encouraged to be used, and the modeling software just was not fit for purpose for what they were trying to do. The engineers weren't competent in its use. So overall, the organization was not adequately resourced in the management of its process safety at the time. So, you need to really think about how your teams, do they have the sufficient resources they need and are they competent in their use? Can they actually use them to make a difference? Is a really important part.
From the monitor perspective. So, I talked about the collision avoidance system. It detected something, but it wasn't investigated. It was just dismissed as a bit of space junk, a bit of debris somewhere. They didn't collect the additional video from on board, so they lost an important piece of information there. And then lastly, the assessment team didn't mandate those images. They just, when they were told they couldn't have them, they just went, "Well, we can't do anything else. They didn't keep fighting."
Now, that's not on those individuals because I'm sure you can all imagine the pressure, the intensity and the fear that they worked under. This is not a comment on those individuals by any stretch. But in our businesses, how are we reviewing the types of incidents that occur and really understanding the root causes, and really understanding what control measures failed and why? Why did those control measures fail? And how can we ensure that we provide that right information back to the workforce? We fix those control measures and we tell people about it so they know what's going on. They need to understand all these things. We need that communication.
Comply. So, when the launch or the debris assessment team went to first get their images, they didn't comply with the process. They didn't follow the chain of command. Maybe they thought they weren't going to get them if they followed the chain of command, so they thought they'd go to their friends and bypass the process, ask around. But in the end, that didn't work for them because their images were denied on the basis of, well, you didn't follow the right process.
Now, there's hubris in that of, you didn't follow the process, so I'm not going to let you do it now. It's not a great way to lead. But again, this is not a reflection on those people. This is a real life incident that I have the benefit of hindsight looking at. So, it's not to be unfair to these people involved. When we are thinking about how we deal with the compliance in our business, and often I prefer to talk about performance rather than compliance. I'm more of a performance-based regime person than a compliance-based regime person. But fundamentally, even in a performance-based regime, there is basic compliance you have to meet. You need to know your legal requirements. You need to know what the rules are, and you need to make sure that compliance is embedded in the work you do so that you are following the necessary rules.
The rules are there because we've had incidents in the past, and our rules and legislation and our standards are all written in people's blood. Let's respect that and honor that and make sure that we follow those rules. And then lastly, this verification step. So, the launch management didn't really delve into why the debris assessment team was so concerned. They just sort of went, oh, they want these images. They're not asking for them properly. They're not verifying why they wanted the images, what they concern was. Why were these group of engineers so concerned about something? Shouldn't that be a red flag when your technical experts are getting really, really, really uncomfortable about something? That says, as a leader, there's something you are missing here if they're really worried and you don't see it.
And so, I'd encourage the leaders out there, participate in audit and inspection processes. You need to review the findings, understand the gaps, understand when the gaps are closed. Are they effectively closed? How are you verifying the effectiveness of actions that have been taken in your organization? And so, that's just an example of how, if we looked at those six elements, that you can see every one of them had a way that it played out within this story.
How to Apply Lessons from the Challenger Disaster to the Chemical Industry
Traci: Now, how can we apply all of this into the chemical industry? Obviously, there are very poignant ways to do so, but are there some nuances that we can learn from this?
Trish: There is. And for leaders out there, it is really about, one of the key leadership traits that I think is so important and it's one that we don't pay enough attention to, and that is curiosity. You need to be curious. You need to ask questions. You need to hunt down the answers. You need to be willing to admit you don't know and go and find the information. Admitting you don't know something is not a weakness. It's actually a sign of strength. It's knowing your limitations and knowing that you need to learn something else. So, you need to be getting into your businesses and really asking questions and understanding.
If you've got people that are concerned, why? All these people have enough of a day job to occupy themselves. They don't need to create work for themselves to stress themselves out. They've got enough of that. So, if they're getting really stressed out, there's something underlying. As a leader, you need to understand and see if there's a specific action that you should be taking or authorizing to deal with the issue. So, encouraging curiosity. Curiosity is not a bad thing. Curiosity is a positive thing. We need to be curious, we need to ask questions. And so, if that's really that one last thing I could leave you with, it is go and ask questions. Spend your time talking to people, asking questions, and genuinely listening to the answers.
Traci: Well, Trish, you always help us avoid assumptions, always encourage us to ask questions, encourage us to be humble and open to hearing the answers and we appreciate that. Unfortunate events happen all over the world, and we will be here to discuss and learn from them. Subscribe to this free award-winning podcast. You can stay on top of best practices. You can also visit us at chemicalprocessing.com for more tools and resources aimed at helping you run efficient and safe facilities. On behalf of Trish, I'm Traci, and this is Process Safety with Trish and Traci. Thanks, Trish.
Trish: Stay safe.