Bayesian statistics as a therapy for frequentist problems

Part 1: The replication crisis
(and what does Statistics have to do with it)

Jorge N. Tendeiro

November 14, 2019


Replication crisis

What do researchers think?

Baker (2016)

Causes for the crisis?


Causes for the crisis?

Questionable research practices
(QRPs; John, Loewenstein, & Prelec, 2012; Schimmack, 2015)

  • Omit some DVs.
  • Omit some conditions.
  • Peeking: Sequential testing — Look and decide:
    • \(p > .05\): Collect more.
    • \(p < .05\): Stop.
  • Only report \(p<.05\) results.
  • \(p\)-hacking: E.g.,
    • Exclusion of outliers depending on whether \(p<.05\).
    • \(p = .054 \longrightarrow p = .05\).
  • HARKing (Kerr, 1998): Convert exploratory results into research questions.

Causes for the crisis?

Researcher’s degrees of freedom

  • Researchers have a multitude of decisions to make (experiment design, data collection, analyses performed);
    Wicherts et al. (2016), Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016).

  • It is very possible to manipulate results in favor of one’s interests.

  • This is now known as researcher’s degrees of freedom (Simmons, Nelson, & Simonsohn, 2011).

  • Consequence: Inflated false positive findings (Ioannidis, 2005).

Causes for the crisis?

Turning exploratory into confirmatory analysis

“(…) [L]et us (…) become intimately familiar with (…) the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indices. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something– anything– interesting.

Bem (2004)

This is not OK unless the exploration is explicity stated.

Causes for the crisis?

Turning exploratory into confirmatory analysis

Brian Wansink’s description of the efforts of a visiting Ph.D student:

I gave her a data set of a self-funded, failed study which had null results (…). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.

Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them. I outlined the first paper, and she wrote it up (…). This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).

Causes for the crisis?

Lack of replications

Until very recently (Makel, Plucker, & Hegarty, 2012).

  • Very low rate of replications in Psychology (estimated ~1%).
  • Until 2012, majority of replications were actually successful!!
  • However, in most cases both the original and replication studies shared authorship…
  • Conflict of interest?…

Famous replication failures

Is it really that bad?…


  • Martinson, Anderson, & Vries (2005): “Scientists behaving badly”.
  • Fanelli (2009): Meta-analysis shows evidence of science misconduct.
  • John et al. (2012): Evidence for QRPs in psychology.
  • Mobley, Linder, Braeuer, Ellis, & Zwelling (2013): Reported evidence of pressure to find significant results.
  • Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli (2017): Evidence of QRPs, now in Italy.
  • Fraser, Parker, Nakagawa, Barnett, & Fidler (2018): In other fields of science.

Interestingly, science misconduct has been a longtime concern (see Babbage, 1830).

Is it really that bad?…

For the sake of balance, not everyone agrees (e.g., Fiedler & Schwarz, 2016).

Use / abuse / misuse of Statistics

\(p\) values


Probability of an effect at least as extreme as the one we observed, given that \(\mathcal{H}_0\) is true.

\[\fbox{$ p\text{-value} = P\left(X_\text{obs} \text{ or more extreme}|\mathcal{H}_0\text{ true}\right) $}\]

The definition is simple enough, right?…

Test yourself

Consider the following statement (Falk & Greenbaum, 1995; Gigerenzer, Krauss, & Vitouch, 2004; Haller & Kraus, 2002; Oakes, 1986):

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). Furthermore, suppose you use a simple independent means \(t\)-test and your result is significant (\(t = 2.7\), \(df = 18\), \(p = .01\)). Please mark each of the statements below as “true” or “false.” False means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.

Test yourself

Test yourself

Try it!

Confidence intervals

A better alternative?

  • Confidence intervals (CIs) have been often advocated as the best inferential alternative to NHST.

  • Recall, for example the Wilkinson Task Force (Wilkinson & Task Force on Statistical Inference, 1999):

“(…) it is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual \(p\) value or, better still, a confidence interval.”

  • But, are CIs really a better alternative?


See, for instance, Hoekstra, Morey, Rouder, & Wagenmakers (2014).

A (say) 95% CI is a numerical interval found through a procedure that, if repeated across a series of hypothetical data, leads to an interval covering the true parameter 95% of the times.

  • A CI indicates a property of the performance of the procedure used to compute it:
    How is the procedure expected to perform in the long run?

  • A CI for a parameter is constructed around the parameter’s estimate.

  • However, a CI does not (really not!) directly indicate a property of the parameter being estimated!

You’re not alone…

Test yourself

From Hoekstra et al. (2014), mimicking the \(p\) value study by Gigerenzer et al. (2004).

Test yourself

Test yourself

Try it!

\(p\) values – Results


But how did students and teachers perceive these statements?

This was in 2004. But things did not improve since…

Goodman (2008)