10 August 2019

Slides at: https://rebrand.ly/Nagoya2019-Part1
GitHub: https://github.com/jorgetendeiro/Nagoya-Workshop-10-Aug-2019




Fraud = scientifc misconduct.

  • Falsifying or fabricating data.
  • This is intentional, not accidental.
  • Puts all science under a bad light.
  • Markedly different from QRPs (next).

Notable examples

Today we don’t talk about fraud explicitly.

We talk about something much harder to identify and erradicate:

Questionable research practices (QRPs).

Questionable research practices


Term coined by John, Loewenstein, & Prelec (2012).
See also Simmons, Nelson, & Simonsohn (2011).

  • Not necessarily fraud.
  • Includes the (ab)use of actually acceptable research practices.
  • Problem with QRPs:
    • Introduce bias (typically, in favor of the researcher’s intentions…).
    • Inflated power at the cost of inflated Type I error probability (\(\gg 5\%\)).
    • Results not replicable.

Example of QRPs

(John et al., 2012; Schimmack, 2015).

  • Omit some DVs.
  • Omit some conditions.
  • Peeking: Sequential testing — Look and decide:
    • \(p > .05\): Collect more.
    • \(p < .05\): Stop.
  • Only report \(p<.05\) results.
  • \(p\)-hacking: E.g.,
    • Exclusion of outliers depending on whether \(p<.05\).
    • \(p = .054 \longrightarrow p = .05\).
  • HARKing (Kerr, 1998): Convert exploratory results into research questions.

Researcher’s degrees of freedom

  • Researchers have a multitude of decisions to make (experiment design, data collection, analyses performed); Wicherts et al. (2016), Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016).
  • It is very possible to manipulate results in favor of one’s interests.
  • This is now known as researcher’s degrees of freedom (Simmons et al., 2011).
  • Consequence: Inflated false positive findings (Ioannidis, 2005).

Fried (2017)

  • The 7 most common depression scales contain 52 symptoms.
  • That’s 7 different sum scores.
  • Yet, all are interpreted as `level of depression’.

Turning exploratory into confirmatory analysis

From Bem (2004):

“(…) [L]et us (…) become intimately familiar with (…) the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indices. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something– anything– interesting.

This is not OK unless the exploration is explicity stated.

Daryl Bem is the author of the famous 2011 precognition paper
(data used in Part 2 of today’s workshop).

A now famous example…

Prof. Brian Wansink at Cornell University.

His description of the efforts of a visiting Ph.D student:

I gave her a data set of a self-funded, failed study which had null results (…). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.

Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them. I outlined the first paper, and she wrote it up (…). This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).

This isn’t being creative or thinking outside the box.

This is QRPing.

What happened to Wansink?

  • He was severely criticized, his work was scrutinized (e.g., van der Zee, Anaya, & Brown, 2017).
  • Over 100 (!!) errors in a set of four papers…
  • Has now 40 (!!) publications retracted (as of July 2019).
  • After a year-long internal investigation, he was forced to resign.

Is it really that bad?…


  • Martinson, Anderson, & Vries (2005): “Scientists behaving badly”.
  • Fanelli (2009): Meta-analysis shows evidence of science misconduct.
  • John et al. (2012): Evidence for QRPs in psychology.
  • Mobley, Linder, Braeuer, Ellis, & Zwelling (2013): Reported evidence of pressure to find significant results.
  • Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli (2017): Evidence of QRPs, now in Italy.
  • Fraser, Parker, Nakagawa, Barnett, & Fidler (2018): In other fields of science.

Interestingly, science misconduct has been a longtime concern (see Babbage, 1830).

And for the sake of balance:
There are also some voices against this description of the current state of affairs (e.g., Fiedler & Schwarz, 2016).

Preregistration eliminates QRPs, right?…

Well, maybe not (yet).

Here’s an interesting preprint (from July 2019!) from a Japanese research group (Kyushu University):

Ikeda, A., Xu, H., Fuji, N., Zhu, S., & Yamada, Y. (2019). Questionable research practices following pre-registration [Preprint]. doi: 10.31234/osf.io/b8pw9

But why?…

Why are QRPs so prevalent?

It is strongly related to incentives (Nosek, Spies, & Motyl, 2012; Schönbrodt, 2015).

  • “Publish or perish”:
    Publish a lot, at highly prestigious journals.
    • Journals only publish a fraction of all manuscripts.
    • Journals don’t like publishing null findings…
  • Get tenured.
  • Get research grant.
  • Fame (prizes, press coverage, …).

But, very importantly, it also happens in spite of the researcher’s best intentions.

  • Deficient statistics education (yes, statisticians need to acknowledge this!…).
  • Perpetuating traditions in the field.


Threats to reproducible science

Munafò et al. (2017)

  • Hypothetico-deductive model of the scientific method.
  • In red: Potential threats to this model.

Lack of replications

Until very recently (Makel, Plucker, & Hegarty, 2012).

  • Very low rate of replications in Psychology (estimated ~1%).
  • Until 2012, majority of replications were actually successful!!
  • However, in most cases both the original and replication studies shared authorship…
  • Conflict of interest?…

Famous replication failures

Didn’t we see this coming?

Meehl (1967)

How poorly we build theory (see Gelman):

“It is not unusual that (…) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (…) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through (…) a long series of related experiments (…) without ever once refuting or corroborating so much as a single strand of the network.”

Say what?…

Cohen (1962)

Low-powered experiments:

“(…) It was found that the average power (probability of rejecting false null hypotheses) over the 70 research studies was .18 for small effects, .48 for medium effects, and .83 for large effects. These values are deemed to be far too small.”

“(…) it is recommended that investigators use larger sample sizes than they customarily do.”

Kahneman (2012)