Assessing the replication landscape in experimental linguistics

Replications are an integral part of cumulative experimental science. Yet many scientific disciplines do not replicate much because novel confirmatory findings are valued over direct replications. To provide a systematic assessment of the replication landscape in experimental linguistics, the present study estimated replication rates for over 50,000 articles across 98 journals. We used automatic string matching using the Web of Science combined with in-depth manual inspections of 274 papers. The median rate of mentioning the search string “replicat*” was as low as 1.7%. Subsequent manual analyses of articles containing the search string revealed that only 4% of these contained a direct replication, i.e., a study that aims to arrive at the same scientific conclusions as an initial study by using exactly the same methodology. Less than half of these direct replications were performed by independent researchers. Thus our data suggest that only 1 in 1250 experimental linguistic articles contains an independent direct replication. We conclude that, similar to neighboring disciplines, experimental linguistics replicates very little, a state of affairs that should be reflected upon.


Introduction
Understanding the inner workings of human language and its cognitive underpinnings has been increasingly shaped by experimental data. With a field that builds its theories on a rapidly growing body of experimental evidence, it is of critical importance to evaluate and substantiate existing findings in the literature because evidence provided by a single study is limited (e.g., Amrhein et al., 2019). Scientists are trained to ensure the reliability and generalizability of scientific findings by conducting direct replication studies, i.e., studies that aim to arrive at the same scientific conclusions as an initial study by collecting new data and completing new analyses but using the same methodology (for a comprehensive overview of different terminological uses, see assumption has to be questioned, thus relevant auxiliary hypotheses must be reconsidered, which in turn might weaken the theory. On the other hand, successful direct replications add important data to the discourse, allowing for more precise estimation of theoretically relevant parameters, and thus help to strengthen the derivation chain between theory and predictions (Meehl, 1990).
Given these arguments, we consider direct replications theoretically informative and a worthwhile endeavor. Conceptual replications, on the other hand, i.e., replication attempts that have changed multiple critical design properties of the original study, are often upheld as being more valuable than direct replications because they are assumed to simultaneously address concerns about reliability of an original claim and they are able to extend the original findings.
Conceptual replications are often considered sufficient for a field to move forward under the stipulation that repeated successful conceptual replications will occur only then when the prior research identified a true effect. However, there is increasing evidence that this strong assumption is empirically not supported. Without replicating individual studies, biases caused by questionable research practices (John et al., 2012), small sample size (Button et al., 2013) and publication bias (Fanelli, 2012) can lead to a set of studies that appear to form a coherent empirical foundation of an underlying theory, even if the underlying empirical claims cannot be replicated: There are now a number of widely studied theories and effects that have been supported by dozens, if not hundreds, of conceptual replications, but appear to crumble in light of meta-analyses or systematic direct replication attempts (e.g., Shanks et al. 2015;Wagenmakers et al., 2016). Moreover, conceptual replications can introduce interpretational ambiguity. A failed conceptual replication can never be considered evidence against the original claim. It is always possible to attribute a failed conceptual replication to the methodological changes that were made (e.g., Pashler & Harris, 2012).
In sum, direct replications are an under-appreciated tool to evaluate and cement the empirical and theoretical foundation of a field and must be considered an important complementary tool to conceptual replications.
The observed lack of replication studies across disciplines threatens the very fabric of cumulative progress in experimental science, because experimental results are often taken for granted without ever being replicated, which leads to a related problem: If we don't try, we won't fail. The recent past has shown that if we try, we fail more often than we would like to: Coordinated efforts to replicate published findings have uncovered alarmingly low rates of successful replications in fields such as psychology (Open Science Collaboration, 2015), economics (Camerer et al., 2016), and social sciences (Camerer et al., 2018), a state of affairs that has been referred to as the "replication crisis" (Fidler & Wilcox, 2018).
The replication crisis is not rooted in a singular cause, but pertains to a network of different practices and incentive structures, all of which conjointly lead to an increase in results that are not replicable. Researchers have identified practices that might have contributed to the widespread lack of replicability, including but not limited to too small sample sizes (e.g., Button et al., 2013;Vasishth et al., 2018), lack of data and materials sharing (e.g., Nosek et al., 2015), use of anti-conservative statistical methods (e.g., Yarkoni, 2019), large analytical flexibility (e.g., Simmons et al., 2011), and lack of generalizability across diverse contexts and populations (Henrich et al., 2010).
These limitations are present, and maybe even exacerbated in experimental linguistic research: Access to certain linguistic populations is often limited or too cost-intensive, making it difficult to collect sufficiently large samples. Experimental linguistic research is resourceintensive because of equipment cost and complexity, elaborateness of data collection procedures, and computational requirements of data analysis and curation. This often results in studies with small sample sizes and, consequently, with low statistical power (e.g., Casillas, 2021;Kirby & Sonderegger, 2018). Statistical analyses in linguistics are often ignoring important assumptions (e.g., Winter & Grice, 2021) and are characterized by a large number of researcher degrees of freedom (Roettger, 2019). Moreover, claims about human language are often based on a small set of languages, limiting their generalizability (e.g., Levisen, 2019;Majid & Levinson, 2010).
In light of the large overlap in research practices between linguistics and neighboring disciplines for which low replication rates and failures of attempts to replicate have been attested, there are rising concerns about both replication rates and replicability in the field of experimental linguistics (e.g., Marsden et al., 2018;Roettger & Baer-Henney, 2019;Sönning & Werner, 2021).
Despite these known problems, there might be only very few published direct replications in linguistics. In their detailed assessment of replications in second language (L2) research, Marsden et al. (2018) explored 67 self-labeled L2 replication studies for a wide variety of characteristics.
Their results indicate that for every 400 articles, only one replication study is published which translates into 0.25% of published studies containing a replication. Following Makel et al. (2012), we will refer to the proportion of published articles containing at least one replication as the replication rate. Moreover, the sample of Marsden et al. (2018) did not include a single direct replication study, i.e., a replication that strictly followed the design of the initial study. This is a state of affairs that is worrisome and warrants further investigation. To our knowledge, there is no systematic assessment of replication rates across experimental linguistics beyond Marsden et al., (2018). The present paper aims at filling this gap. To gauge the past and current replication landscape in experimental linguistics, track progress over time, and calibrate future policy and training initiatives, it will be useful to assess the prevalence of replications across experimental linguistics and explore their contributing factors.
The present study assesses the frequency of articles containing replications as well as the typology of replication studies that have been published in a representative sample of experimental linguistic journals from 1945 to 2020. Given the arguments presented above, we are primarily interested in the prevalence of direct replications in the field. Our study aimed at answering two main questions: "How many published papers in experimental linguistics contain at least one direct replication?" and "Are there factors that affect the replication rates and are they found either at the journal level (e.g., journal policies, open access, journal impact factor, etc.) or at the study level (e.g., composition of authors, investigated language, etc.)?" The study consisted of two analyses: First, we assessed the frequency of articles mentioning the term replication (search string: replicat*) across 98 linguistic journals. Second, we manually categorized the type of replication studies (direct, partial, conceptual) in a subset of twenty journals. We then related their replication rates to factors like the years of publication, and the citation counts of both initial and replication study.

How often do journals mention the term replicat*?
The key dependent variable of the first part of this study was the rate of replication mention for journals relevant to the field of experimental linguistics.

Material and methods
The study design has been preregistered at 2021-03-08 and can be inspected at https://osf.io/ a5xd7/.
In order to determine the rates of replication mention for individual journals, we drew on a method introduced by Makel et al. (2012). First, a sample of 100 journals relevant to the field of experimental linguistics was identified by making use of the search engine Web of Science (https://webofknowledge.com; access date: 2021-03-03). We restricted the search results to journals in the web of science category Linguistics which had at least 100 articles published and a high ratio of articles containing the term experiment* in title, abstract or keywords in order to ensure that the subset contained journals that are relevant for experimental linguistics research. Among those, all articles categorized as having been published in English and between 1945-2020 were taken into account. 1 1 The Web of Science catalog includes articles from 1945 to present. All full available years (at the date of retrieval) have been included in the analysis. The first entries for the category Linguistics date back to the year 1948 and the first hit for the search term replicat* was obtained for the year 1969.
The ratio between overall number of articles and those articles mentioning the term experiment* ranged between 6.1% and 60.3% (with a median of 11.5%) across journals. The full sample of journals can be inspected in Table 2 in the appendix of this article. 2 After journal selection, we obtained the total count of articles containing the search term replicat* in title, abstract or keywords for each journal. Following the method presented by Makel et al. (2012), the rates of replication mention were calculated by dividing the number of articles containing the term replicat* by the total number of eligible articles for each journal. As we were only interested in experimental linguistic studies, we only considered articles containing the search term experiment* as eligible.
Rates of replication mention were then related to three journal properties: journal policies with regards to replication studies, journal impact factor and whether the journal publishes open access or not. To gain an understanding of the journal policies with regards to replication studies, we examined the journals' submission guidelines, adopting a method suggested by Martin and Clarke (2017). They grouped psychology journals into categories dependent on whether they (explicitly or implicitly) encouraged replication studies or not in their Instructions to Authors and Aims and Scope sections on the journal websites. For our analysis, we only distinguished between those journals explicitly encouraging replication studies and those that do not. We extracted journal impact factors via Journal Citation Reports (https://jcr.clarivate.com). 3 We assessed whether journals offered open access publication or not via Web of Science. We distinguished between three access categories: those journals which are listed in the Directory of Open Access Journals (DOAJ) ("DOAJ gold"), those journals that contained some open access articles ("partial") and those journals with no option to publish open access ("no") whatsoever.
We would like to stress that journal-based predictors are not static and obviously change over time. We cannot reliably capture these dynamics. Instead, we snapshotted journal policies and impact factors in the year 2019 and use this information as a (rough) proxy for our preregistered objective to relate them to replication rates. As will be discussed below, the model estimates for these predictors are characterized by large amounts of uncertainty, leaving them rather uninformative.

Results and discussion
Out of the 52,302 articles in our sample, 8,437 mentioned the term experiment* in title, abstract, or keywords and were thus assumed to be articles presenting an experimental investigation. Out of these articles, 382 contained the term replicat*, which results in a mention rate of 4.5% across experimental linguistic articles.
The distribution of the rate of replication mention substantially varies across journals ranging from 0 to 12.82%. The median rate of replication mention is 1.7%, a rate that is comparable to what Makel et al. (2012) have reported in their assessment of replications in psychology. Almost half of all journals (n = 42) did not mention the term in any of their articles. Figure 1 illustrates the variation across those journals that exhibited at least one mention of the term.
We statistically estimated the rate of replication mention as predicted relative to the following factors: centered journal impact factors (continuous, henceforth jif), open access type (no, partial, DOAJ gold), and replication policies (binary: either explicitly encourage or not). 4 We used Bayesian parameter estimation based on generalized linear regression models with a binomial link function. 5 The model was fitted to the proportion of replication mentions per journal using the R package brms (Bürkner, 2016). We used weakly informative normal priors centered on 0 (sd = 2.5) for the intercept and Cauchy priors centered on 0 (scale = 2.5) for all population-level regression coefficients. These priors are what is referred to as regularizing (Gelman et al., 2008), thus making our model conservative with regards to the predictors under investigation. Four sampling chains with 2,000 iterations each have been run for each model, with a warm-up period of 1,000 iterations. For relevant predictor levels and contrasts between predictor levels, we report the posterior probability for the rate of replication mention. We summarize these distributions by reporting the posterior mean and the 95% credible intervals revised models are available in our repository. 5 A possible concern of our modelling strategy might be an inflation of zeroes if there are too many journals without a single mention of the search term. A zero-inflated binomial regression can account for such an inflation. Thus, we additionally ran a zero-inflated binomial model. The resulting estimates for our parameters are highly compatible with those from the simpler binomial model. Both models are available in our repository.  access: 1.6% [-1.5, 6.1]; rates decrease for no open access: -1.9% [-4.4, 7]; and rates increase for encouraging replication policies: 0.7% [-1.1, 2.8]). We thus won't discuss these results further.

How many articles containing the term replicat* are actual replications?
The second part of the study had two aims: First, the term replication is commonly used in ambiguous ways, so articles containing the search term were further analyzed to determine whether they indeed reported a replication study or whether they used the term in a different way. Second, we further investigated what types of replication studies are published and whether replications are becoming more frequent over time. Our target estimand is the proportion of experimental articles containing at least one replication.

Material and methods
From the superset of 98 journals obtained above, the 19 journals 7 with the highest proportion of experimental studies were selected for a more detailed analysis, while excluding journals for which less than 2 hits (TS = replicat*) could be obtained (see at https://osf.io/f3yp8/ for a list of article counts per journal). The sampling procedure above resulted in 274 possible self-labeled replication studies with publication years ranging between 1989 and 2020. We included the full set of articles in our sample for manual coding.
We identified whether the article in question indeed contained a replication study or not.
Parts of the papers that were examined were the title and abstract of the paper, text before and after occurrences of the search term replicat*, the paragraph before the Methods section as well as the first paragraph of the Discussion section (following and adapting the procedure specified by Makel et al., 2016). If the authors explicitly claimed that (one of) their research aim(s) was to replicate the result or methods of an initial study, this article was treated as a replication and was submitted to further analysis according to the preregistered coding scheme, which can be inspected at https://osf.io/ct2xj/.
When extracting the number and types of changes made to the initial study, we assumed that the authors of a replication study did not make any drastic changes without reporting them.
Following Marsden et al. (2018), replication studies were classified according to the number of changes made into three categories: direct replication (0 changes), partial replication (1 change) and conceptual replication (2 or more changes). We noted the nature of methodological changes as one of the following categories: experimental paradigm, sample, materials/experimental set-up, dependent variable, independent variable, and control. five categories that were used for identifying to which of the three types of replications an article belonged. All of the changes that have been identified by the manual coding procedure are changes that have been reported by the authors of the replication study. Most of these changes have been made by the authors in order to achieve specific goals: Either they aimed at showing that an effect extends to another language, that it is robust across different experimental paradigms or subject groups, or how different kinds of measurements, manipulations and controls affect the observed results. As such, we did not consider slight changes in the stimulus materials like the correction of typos, but only changes that were identified by the authors as expected to change the results or improve the study in a significant way. We also recorded the language under investigation. The information on whether the article was published open access as well as citation counts and years of publication for both studies were obtained from Web of Science. An author overlap was attested when at least one author was a (co-)author on both studies. During the coding procedure of the articles, we encountered edge cases that we did not anticipate in our preregistration: When several self-labeled replication studies were mentioned in one article, we chose the first mentioned study for our analysis. If there were one independent, but also one or more inner-paper replications, i.e., experiments that first replicated an independent initial study and then replicated results from a study in the same article, we selected the independent replication for analysis. Note that since our target estimand is the rate of published articles that contain at least one replication, this choice does not artificially reduce the replication rate.

Results and discussion
Out of the 274 articles in the subsample, 262 (95.6%) indeed presented experimental linguistics research. The remaining 12 (4.4%) were not experimental in nature, but rather comments, reviews or computational studies. Out of the 262 experimental studies, 151 were self-claimed replications according to our criteria. The remaining 111 mentions were articles that mentioned the term in other contexts or studies that did not specify the concrete aim of replicating an initial study's design or results. Moreover, many papers used the term replicated in a broad sense that roughly translates into "finding a similar result", thus not qualifying as a replication study as defined above. Out of the replication studies, we categorized 86 (57%) as conceptual, 56 (37.1%) as partial, and only 11 (7.3%) as direct replications.
Looking closer at direct replications, 5 studies were independent studies, i.e., there was no overlap between authors of the initial study and the replication study. Out of these independent direct replication studies, 3 were self-labeled as successful replications. In other words, our sample included only two failed, independent, direct replication attempts. These low rates indicate that replication attempts, and especially direct replication attempts, are rather rare  One possible reason for the fact that (direct) replication rates are not increasing for the field according to our analysis could be that experimental linguistics predominantly replicates experimental findings across languages, making the studies by definition only partial/conceptual replications. However, only 19.9% of replications targeted a different language than the initial study. The majority of replication efforts were conducted within the same language as the initial study. In fact, 67.5% of all replication studies in our sample had one variety of English as the main language of investigation either in the replication or in the corresponding initial study.
The median number of years between an initial and a replication study is 7 years. Initial studies were on average 50.1 times cited before a replication was published, which corresponds to an average yearly citation rate of 7.2 citations. This average citation rate is well above the impact factor of core linguistic journals (median journal impact factor in superset: 1.1).
Replication studies were on average cited only 21 times, which corresponds to an average yearly

Case study of Journal of Memory and Language
The Journal of Memory and Language (JML) accounts for the largest number of articles in our sample (114 out of 274) and is the journal with the highest impact factor (3.9). We conducted were affected by this skewed sample. 9 We find that 70 (61.9%) of the 113 experimental JML papers contain replication studies. Of these, 35 (50%) are conceptual, 30 (42.9%) are partial, and 5 (7.1%) are direct replication studies, which is in line with the results for the whole sample.
Only 3 of the studies published in JML were independent direct replication studies (one of which was successful). We conclude that we have little reason to believe that the large proportion of JML articles in our sample substantially affected our overall results and are confident that our results apply to the field rather than to one journal.

General discussion
The current study aimed at providing a comprehensive survey of published replications in experimental linguistic research. By analyzing the publication history of over 50,000 articles across 98 journals that publish experimental linguistic research, our study found that 4.5% of experimental linguistic publications used the term replicat* in title, abstract or keywords. A more thorough analysis of 274 sampled experimental articles containing the term replicat* revealed that only around half of the hits represented actual replication studies, reducing the effective replication rate to 2.5%. This rate is slightly higher than reports of comparable investigations in psychology (1.6%, Makel et al., 2012), educational science (0.1%, Makel & Plucker, 2014), and economics (0.1%, Mueller-Langer et al., 2019). The higher rate might be due to a methodological choice, however. Due to large plurality of methods in linguistics, we calculated the replication rate based on only those articles that contained the term experiment* (as opposed to all articles in the sample), reducing the denominator substantially.
A closer look at the nature of replication studies revealed that the majority of replication studies were studies that diverged from the initial study in at least one design choice. Only 7.3% were direct replications, i.e., studies that directly repeated an initial study without self-reported changes to the design, and only five of these were replications conducted by an independent team of researchers. Taking together replicat* mention rate and actual replication rate, 0.08% of experimental studies are independent direct replications in the field of linguistics. In other words, only 1 in 1,250 experimental linguistic articles contains an independent direct replication.
This clearly indicates that replication attempts, and especially independent direct replication attempts, are still very rare in the experimental linguistics literature.
Before interpreting the results and offering possible ways forward, we need to discuss two important caveats to our study. First, if research articles were not framed as experimental, then 9 Originally, this subset analysis was planned because in an earlier version of this paper we sampled 50 from the 114 articles published in JML. Following a reviewer's suggestion, we later submitted the full set of JML articles to manual coding. But we keep this analysis to show that our results apply to the whole field and are not mainly influenced by one journal. they were not included in the analysis. Similarly, if experimental articles were not framed as replications, then they were not categorized as such. These are clear limitations to our search strategy and might lead to an underestimation of the true replication rate. Assuming the false negative rate is not zero, the reported replication rates might change after correction. To circumvent this methodological problem, a large sample of articles would have to undergo manual coding, which is not feasible for a large-scale assessment. Future research using alternative assessment methods (possibly machine learning techniques) or more in-depth investigation of either subfields (e.g., Marsden et al., 2018) or specific journals might result in different replication rates. However, the existence of replication studies that are not referred to as such might also reflect a more general problem: If studies are not framed as replications by using the term replication, readers' ability to connect research to its intellectual precedents is severely limited.
Second, our assessment of replication types relied on two assumptions. On the one hand, we assume that the authors disclosed changes to the initial study in a transparent way. On the other hand, we assume that if changes were disclosed, we were able to extract and interpret these changes accurately. Neither of these assumptions must hold, thus any rates that are generated here are necessarily only a rough proxy of the true replication rate. Nevertheless, given that our findings seem to align well with evidence from other fields as well as an in-depth analysis of a subfield of linguistics (Marsden et al., 2018), we are confident that our conclusion holds.
Although the present study is the first systematic assessment of replication rates in linguistics, our conclusions are hardly surprising. Academic incentive systems do not reward replication studies. Neither journals nor funders encourage them. For example, Martin and Clarke's (2017) survey results suggest that in 2015 only 3% of psychology journals explicitly state that they will consider publishing replications. Similarly, out of the 98 journals in our sample, only 2 encouraged direct replications. And even if one manages to publish a replication, replication studies are characterized by much lower yearly citation counts compared to corresponding initial studies, leading to a lack of perceived prestige (e.g., Koole & Lakens, 2012;Marsden et al., 2018;Nosek et al., 2012). Direct replications simply do not seem worth their costs.
In order to overcome the asymmetry between the cost of direct replication studies and the presently low academic payoff for it, we must re-evaluate the value of direct replications.
Funding agencies, journals, but also editors and reviewers, need to start valuing direct replication attempts as much as they value novel findings. At the same time, we should attempt to find more resource-efficient ways to both identify replication targets and conduct replication studies. We believe, most people would agree that not every study needs direct replication. Take for example the McGurk effect, i.e., perceiving a sound that lies in-between an auditory presented component of one sound and a visually presented component of another one (McGurk & MacDonald, 1976). This phenomenon is probably replicated in dozens of linguistic classrooms every semester across the globe. On the other hand, it might be a good idea to evaluate more critically whether a given study is worth replicating. Resources can be saved if studies with poor experimental design, unsuitable measurement approach or inept model specifications are ruled out from direct replication attempts (Yarkoni, 2019). Finding convenient yet effective tools to identify worthwhile replication targets is an active metascientific field (e.g., Coles et al., 2018;Hardwicke et al., 2018;Isager et al., 2021a) and feasible algorithms are currently developed and tested (Isager et al., 2021b). When it comes to more accessible ways to conduct replication studies, several authors have suggested involving our students more rigorously (e.g., de Leeuw et al., 2019;Frank & Saxe, 2012;Grahe et al., 2012;Roettger & Baer-Henney, 2019), possibly creating a rich learning experience for our students while at the same time reducing the resource costs of replication studies. Alternatively, resources can be pooled across multi-lab replication efforts, effectively reducing the costs for individual researchers and labs (e.g., Frank et al., 2017;Nieuwland et al., 2018;Open Science Collaboration, 2015). The StudySwap platform, for example, allows researchers to identify independent labs for conducting a replication attempt of one's study, thus helping researchers to assess the independent replicability of their findings prior to publication (Chartier et al., 2018).
We are confident that the field of linguistics can function as a role model for neighboring fields.
Although major meta-scientific discourses are held in other fields, linguistics has demonstrated quick uptake of methodological reforms time and time again. A point in case is the swift uptake of Registered Reports, 12 a new article form in which a study proposal is reviewed before the research is undertaken. While the uptake across disciplines is slow, linguistics has already at least 12 high-impact journal outlets that offer Registered Reports. Moreover, an increasing number of reproducibility initiatives founded in the field during the last few years give hope that the field is continuing to evaluate their past, current, and future practices and successfully face the challenges ahead. This paper was an attempt to contribute to this development. We hope our assessment allows future efforts to track progress over time and calibrate policies across experimental linguistics. 11 https://royalsociety.org/blog/2018/10/reproducibility-meets-accountability/.

Data availability
All data and analyses are available online at https://osf.io/9ceas/.