Replication and Reproducibility in English Corpus Linguistics

Organizers: Martin Schweinberger (Arctic University of Norway) and Joseph Flanagan (University of Helsinki)

One of the hallmarks of academic research is that its results are said to be reproducible. This notion of reproducibility includes two somewhat distinct senses of the term. First, it refers to the ability for an outside researcher to get identical measurements as the original study by using the same methods on the original dataset. We’ll refer to this sense as reproducibility as “verification.” Are the reported numbers in a study accurate? A second sense refers to the ability for researchers to get the “same” results as a reported study within some reported margin of error with procedures and data closely matched to that study. We’ll refer to this sense of reproducibility as “robustness.” Are the results of a particular study generalizable?

Within the last decade, there has been a growing concern within many of the natural and social sciences that the results of much scientific research are neither verifiable nor robust. One study (Nuijten et al. 2016) found that 1 in 8 studies contained data-reporting errors that affected their conclusions. More famously, a special issue of Social Psychology edited by Nosek and Lakens (see Nosek & Lakens, 2014) found that the results of 11 out of 14 studies were significantly different from the reported outcome of the original study. The results of these and other studies have increasingly led researchers in many fields to talk about a crisis that threatens to undermine the credibility of academic research.

Our workshop wishes to address the relevance of the reproducibility crisis for English corpus linguistics. We propose to combine best practices from (English) corpus linguistic methodology, statistics, and software engineering to address problems that have been identified in other fields that may also apply to our own. Some of the possible solutions we will propose include

  • being aware of the methodological and statistical choices a researcher makes that may lead to an over-estimation of the robustness of the results of a particular study;
  • recognizing the respective benefits and limitations of exploratory and confirmatory analyses;
  • emphasizing the value of publishing replication studies and null results;
  • being aware of and following the FAIR principles (Findable, Accessible, Interoperable, and Reusable) in data management;
  • making use of existing infrastructure that enhances the transparency and replicability of research (e.g. using Git or Docker to share code and data);
  • using Jupyter or R Notebooks to document analyses and making them available to the community and reviewers so that their results can be verified;
  • making use of documentation and policy protocols in departments, schools and institutes to ease onboarding procedures and prevent data loss and corruption.

The workshop thus offers relevant information for researchers on replication, avoiding inefficient and not reproducible research practices, and on increasing quality of research.


10:00 – 10:15 Joe Flanagan

Introduction – outline, context, and aims of the workshop

10:15 – 11:00 Sean Wallis

Grounding linguistic research, from raw data to sound conclusions: how can we be sure of our data, and what is the role of statistics?

11:00 – 11:15 Break
11:15 – 12:00 Martin Schweinberger

Practical session 1 – folder structure, file naming, documentation (bus factor), open repositories

See for the materials!

12:00 – 13:00 Break
13:00 – 13:45 Joe Flanagan

Practical session 2 – RStudio, Rproj, and Markdown

13:45 – 14:00 Break
14:00 – 14:30 Martin Schweinberger

Practical session 3 – integration of R to Git and renv (version control)

See for the materials!

14:30 – 14:45 Closing