[pullquote]Technical journals play a crucial role in the effective communication of experimental results and, thus, the progress of science and engineering. So, publication of properly reviewed and accurate information is paramount. Unfortunately, some journals choose content based on inadequate peer reviewing or without any real evaluation whatsoever. (See “Hoax Highlights Publishing Problem.”) Now, an investigation by a team at Tilburg University, Tilburg, the Netherlands, has found that many of the papers they checked have statistical mistakes.
The team focused on null hypothesis significance testing — a technique to determine whether a result from a given set of statistics is significant or not. The Dutch investigators developed a program called statcheck, downloadable at http://statcheck.io, to check files in either pdf or html format for errors in statistical reporting. The software automatically extracts statistics from a paper and recalculates the p (probability) value, a measure of the validity of the hypothesis. Many researchers treat p ≤ 0.05 as indicating a “true” finding.
In a recent posting on “Retraction Watch,” a site that tracks retractions of scientific papers, Michèle Nuijten, a PhD student at the university who helped develop the program, summarized the reason for developing it as well as the troubling results the software has uncovered.
“We knew we would never be able to program statcheck in such a way that it would be as accurate as a manual search, but that wasn’t our goal. Our goal was to create an unbiased tool that could be used to give an indication of the error prevalence in a large sample, and a tool that could be used in your own work to flag possible problems.”
“Unfortunately, we recently found that our suspicions [of reporting inconsistencies] were correct: Half of the papers in psychology contain at least one statistical reporting inconsistency, and one in eight papers contain an inconsistency that might have affected the statistical conclusion.”
Fuller details appear in “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013).”
These disturbing findings only relate to papers published in experimental psychology. Articles in research journals focused on chemical engineering and chemistry hopefully are less prone to such errors, given the technical background of most authors. However, how can we be sure?
Nuijten notes: “At the moment statcheck can only read results reported in APA (American Psychological Association) style (e.g., t(28) = 3.21, p < 05; or more generally: test statistic (degrees of freedom) =/> …, p =/> …).
Extending such a program to other article styles would require they have two “ingredients,” she explains:
1. A standardized reporting style. “The way statcheck works now is that it looks for very specific combinations of letters, numbers and things like parentheses and equal signs, etc. For each specific statistical test that it can read, we had to program separate lines of code.”
2. A way to recalculate numbers from the extracted results. “The statistics that statcheck reads now consist of three parts that are internally consistent. That means that if I have two of the three parts, I can calculate the third to check if the reported result is indeed consistent.”
The Dutch team deserves praise for their initiative in developing a tool that can make spotting statistical mistakes easier. Let’s hope software of this type becomes more widely available and applicable.
However, the very existence of a website like Retraction Watch underscores that articles with flawed research (not just related to statistics) remain a persistent problem — one that occurs more often than most of us would care to believe. The “publish or perish” environment with which many researchers must contend probably only will get worse in the foreseeable future. So, the importance of rooting out inaccurate work surely will increase.
High-quality research journals take this responsibility very seriously. Unfortunately, too many readers assume that all journals are equally diligent. That’s another critical flaw that needs attention.