What is the probability that a pump failure will ruin your Christmas? If we assume all pumps operate in series, like lights on a Christmas tree, and two pumps of 50 fail in a typical year, the probability of failure is 2/365 per day. The probability that all pumps will operate simultaneously is (1-2/365)^{50} or about 76%. Don’t plan a long trip to the Bahamas.

[pullquote]

Now, suppose you install spare pumps that can be switched in easily and reliably. Let’s disregard poor suction connections, unreliable heat tracing, blocked strainers, switchgear problems, etc., and presume the spares are well maintained. Now, the probability of a single pump shutting down the plant is only (1-0.0003)^{50} — effectively zero. Take that vacation! This example illustrates the value of a fundamental understanding of statistics.

Let’s start with the basics: sampling. It’s easy to trip up on the terminology: replicates, variants, and specimens mean the same thing: repeated tests under the same conditions. You want enough replicates per test to compare them among themselves and to identify systemic failures in testing. If there is a high risk of loss of the specimen, use 4–5 as a minimum; 3 is minimum for cross-comparison. When in uncharted territory, do screening experiments to bolster your familiarity and to improve efficiency of testing. Examining whether all the samples collected belong in the same population is called analysis of variance (ANOVA). Use the F distribution at a 95% confidence interval to compare sample averages of variants and X distribution to check for normality. You can use process data for analysis but you can’t check for systemic problems or detect inter-dependence between data points. Bored?

Consider the permutation. Suppose you are trying to figure out the load on a manual vent system involving six tanks. Let’s assume initially that flows are equal. You know that if more than three flows occur at the same time the vent will reach a choked condition. How many triplets are there? In statistics this is: _{6}P_{3} = 6!/[(6-3)!3!] = 20. While it’s unlikely that all flows are equal, now you know you have twenty different combinations to model to decide which triplet is the worst case.

Lastly, let’s look at regression analysis. An equation that varies a property with temperature and pressure is convenient in process modeling. Regression equations can emulate pump curves or control valves in a simulation. In fitting a curve you have several choices: 1) use a global equation to fit the entire curve, perhaps badly; 2) fit the curve in parts; 3) linearize the curve using 1/X forms or logarithms; or 4) use a generic polynomial fit. The last option generally isn’t recommended unless the curve is boxed in — a good example of a boxed-in curve is one that has a 0–100% abscissa, which is typical for a control valve. Only interpolate polynomials, never extrapolate them. A global equation sometimes is chosen, despite a poor fit, because it can be hard to easily switch the regression equation in a model. This was a common problem with old DCS program languages.

How do you know if your regression equation is valid? The simplest tool is the Z-score. Let’s say you have 20 data points you want to develop for a regression equation. First, you want to examine for any deviations, assuming your points are independent, free of systemic error and fit a normal distribution. Before moving on to the Z-score, ensure all of your points represent averages; ideally, each point should be based on the same number of replicates. Next, calculate the Z-score: Z = (y – Y-bar)/ σ where Y-bar is the average of all points: Y-bar = Σ y/N, where N is the number of points; and σ is the standard deviation: σ = [Σ(y – Y-bar)^{ 2}/(N-1)]^{½} (for a small sample). So, if an observed value (y) is 4, and σ is 0.521 with an average of 2.33, then the Z-score for the observed value is Z = (4-2.33)/0.521 = 3.2. Because 99.7% of data in a normal distribution should have a Z-score within ±3, this point is suspicious; scrutinize it and, if appropriate, toss it out as an outlier. Also investigate borderline values, i.e., >±2.75. Plot the Zs to look for systemic problems, e.g., more than 4 points in sequence on the same side of the average shows interdependence. Outlier elimination only can be done once. In the 1980s it was discovered that a famous petroleum oil property regression from the 1940s was purged of outliers over and over again. It was discarded and a professor was discredited.