Skip to main content
eScholarship
Open Access Publications from the University of California

HeCz: A large scale self-paced reading corpus of newspaper headlines

Abstract

Linguistic corpora have been a vital resource for understanding not only how we use language, but also how we process words and sentences. In order to better understand language processing, researchers have recently been creating corpora that integrate both traditional text annotations as well as behavioural measurements collected from human participants. In this paper we introduce the HeCz Corpus, which to our knowledge is the largest such example of a behavioural corpus, containing 1,919 newspaper headlines taken from a Czech language news website. The sample consisted of 1,872 participants, each reading approximately 120 headlines. Each headline was read using a self-paced reading, meaning that every word in the corpus can be analyzed for reading time. After reading each headline, each participant answered a question relating to a specific information contained within the headline, providing a measurement of comprehension. To facilitate better understanding of participant level variation in how the headlines are processed, we collected data on the participant's mood state immediately prior to their participation, along with other basic demographic information. We also collected data from a subset of participants who read the stimuli in the initial testing round, but also completed the same experiment in a second round after a one-month gap, which can provide new insights into how texts are processed and understood when being re-read. In order to highlight the practical uses of the corpus, our analyses focus on how reading times are modulated by i) headline length in words, ii) trial order, and iii) testing round, in addition to examining the role of targeted information location in comprehension accuracy. HeCz thus provides a unique and novel resource that can be used by psycholinguists and cognitive scientists more generally, in order to gain new insights into how real-world language is processed and understood.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View