Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Spoiler Recognition as Semantic Text Matching

Abstract

Engaging with a TV show in the age of the Internet often means avoiding show-related content for months out of fear of being spoiled. While spoiler detection research shows promising results for protecting viewers from generic spoilers, these approaches don't actually solve the problem of users avoiding show-related content during their watch. This is because what constitutes a spoiler is different depending on where a viewer is in the show, and spoiler detection on its own is too coarse to capture this complexity. Instead, we propose the task of spoiler recognition, which seeks to assign an episode number to a spoiler, given a show. We pose this task as semantic text matching and present a dataset of comments and episode summaries for evaluating model performance. The dataset consists of ~3.1K and ~2.8K manually-labeled test and validation comments respectively, and over 200K auto-labeled comments for training. We experimentally demonstrate the utility of this training set and use it to benchmark the performance of BigBird, Nyströmformer, and Longformer on this task. Specifically, we cross-encode summaries with comments and examine the mean reciprocal rank scores. Our results find Longformer to be best suited for this task. We also perform an error analysis to shed some light on the kinds of challenges spoiler recognition poses. In total, we present this dataset and these results to facilitate future research into spoiler recognition.