Scalar inferencing, polarity and cognitive load

According to the Polarity Hypothesis, the presence or absence of a processing cost for Scalar Inferences (SIs) depends on their polarity. This hypothesis predicts, among other things, that the processing of lower-bounding SIs should not be affected by cognitive load the same way upper-bounding SIs are. To date, evidence in support of this prediction comes from the comparison between upper-bounding and lower-bounding SIs elicited by disparate scalar words. In this paper, we report on two dual-task experiments testing this prediction in a more controlled way by comparing upper-bounding and lower-bounding SIs arising from the same scalar words or scale-mates operating over the same dimension. Results show that, for these more minimal comparisons, lower-bounding SIs involve comparable cognitive demands as their upper-bounding counterparts. These findings challenge the idea that load effects are consistently modulated by SI polarity and suggest instead that these effects are relatively consistent across different types of SIs.


Introduction
An utterance of (1-a) commonly conveys the Scalar Inference (SI) in (1-c).On most accounts of SIs, this pragmatic enrichment is assumed to involve the generation and negation of alternatives, that is, sentences which were not uttered, but would have been equally relevant and more informative in the given context (for an overview, see Chemla & Singh, 2014;Gotzner & Romoli, 2022;Sauerland, 2012).In our example, the relevant SI corresponds to the negation of the alternative to (1-a) in (1-b), where the weak scalar word some has been replaced with its stronger scale-mate, all.
(1) a.Some of the apples are red.WeakPos b.All of the apples are red.Alternative c. ↝ Not all the apples are red Upper-bounding SI There is concurring experimental evidence that processing the interpretation with SI of a sentence like (1-a) can be cognitively demanding (for response delay effects, see Bott & Noveck, 2004;Bott et al., 2012;Breheny et al., 2006;Chemla & Bott, 2014;Chevallier et al., 2008;Cremers & Chemla, 2014;Huang & Snedeker, 2009;Noveck & Posada, 2003;Tomlinson et al., 2013; for cognitive load effects, see De Neys & Schaeken, 2007;Dieussaert et al., 2011;Marty & Chemla, 2013;Marty et al., 2013).The apparent 'cost' of SIs is one of the most replicated effects in truth-value judgement studies and it is often thought to be an important marker of this sort of meaning-strengthening operations.To date, however, there has not been wide consensus as to the source of this extra cognitive cost (for an overview, see Khorsheed et al., 2022;Khorsheed & Gotzner, 2023), especially as it is absent from the comprehension and evaluation of semantically equivalent sentences with 'only' (Bott et al., 2012;Marty & Chemla, 2013).
A promising account has recently emerged.Investigating whether the response delay and cognitive load effects observed for some generalise to other scalars, van Tiel, Pankratz, and Sun (2019) tested the processing of 7 scalar words differing, inter alia, in their scalarity.Their studies included 5 positive scalars with a literal lower-bound meaning (i.e., some, most, or, might and try), giving rise to upper-bounding SIs (e.g., 'some of the food' implicates 'not all of the food'), and 2 negative scalars with a literal upper-bound meaning (i.e., scarce and low), giving rise to lower-bounding SIs (e.g., 'low on food' implicates 'some food').Their results show that, while all scalars from the first category displayed the classical effects, neither of those from the second category did (see also van Tiel, Marty, Pankratz, & Sun, 2019;van Tiel & Pankratz, 2021).The authors explain these findings by hypothesising that only upper-bounding SIs are cognitively demanding and that the extra processing cost they incur stems from the fact that, unlike lower-bounding SIs, these SIs introduce negative propositions into the meaning of the sentence, the processing of which is independently known to be cognitively effortful (a.o., Clark & Chase, 1972;Geurts et al., 2010;Deschamps et al., 2015).This hypothesis, dubbed the Polarity Hypothesis (van Tiel & Pankratz, 2021; also referred to as the Scalarity hypothesis in van Tiel, Pankratz, & Sun, 2019;van Tiel, Marty, Pankratz, and Sun, 2019), is stated in (2).The polarity-based explanation departs from the former explanation in Marty et al. (2013) where the extra cognitive cost associated with SI interpretations is linked to ambiguity resolution and located in the processing stage involving the decision to derive or not the SI (see Gotzner, 2019 for a similar proposal). (2)

Polarity Hypothesis
SIs are cognitively demanding insofar as they introduce an upper-bound on the dimension over which the scalar word quantifies.
As van Tiel, Pankratz, and Sun (2019) acknowledge, however, the scalar words in their sample differ in more respects than just the polarity of the SI they can give rise to, such as the type of dimension over which they quantify or the parts of speech they come from (e.g., only the negative scalars were adjectival).In the absence of more minimal comparisons, van Tiel et al.'s results leave open the possibility that the contrasts they observed reflect idiosyncrasies of the negative scalars they tested, rather than a general difference in the processing signature of upperbounding and lower-bounding SIs. 1 In this paper, we focus on the cognitive load effects associated with the derivation of SIs and offer a direct test of the predictions of the Polarity Hypothesis by comparing the upper-bounding and lower-bounding SIs arising (i) from the same scalars and (ii) from different scalars belonging to the same scale.Crucially, these SIs differ in terms of polarity but otherwise involve the same words and concepts, e.g., 'some' implicates not all while 'some not' and 'not all' implicate some.
Our results demonstrate that, for such comparisons, lower-bounding SIs involve comparable cognitive demands as their upper-bounding counterparts.

Experiments
We conducted two experiments, both based on the same method and procedure as in van Tiel, Pankratz, and Sun (2019, Experiment 2).In both experiments, participants had to perform a sentence-picture verification task.In the target conditions, sentences were presented with a picture that made them false if the relevant SI is derived, but true otherwise.Participant's cognitive resources during sentence verification were experimentally burdened by adding a 1 It should also be noted that the Polarity Hypothesis only offers a partial explanation of previous findings on SI costs.
For instance, it does not explain why there is an extra cost to the SI interpretation of 'some'-sentences compared to the literal interpretation of their 'only'-variants (Bott et al., 2012;Marty & Chemla, 2013).Similarly, it does not explain the findings in Bott and Frisson (2022) that SI interpretations of 'some'-sentences are faster to come about when preceded by their canonical 'all'-alternative.Findings like the above suggest that the cost of (upper-bounding) SIs is not reducible to the processing of negative propositions.
secondary memory task and by modulating further the complexity of the visual pattens to be memorised (see also De Neys & Schaeken, 2007;Marty & Chemla, 2013;van Tiel, Marty, et al., 2019).If the derivation of a given SI requires additional processing resources, then that SI should become less available under higher cognitive load, i.e., in situations where these resources are impaired by the concurrent memory task, resulting in higher acceptance rates in the target conditions.

Participants
For each experiment, 150 participants were recruited online through Prolific (first language: English; country of residence: UK, USA; minimum prior approval rate: 90%) and paid for their participation (£9.5/hr).

Design and Materials
Building upon the materials and method in van Tiel, Pankratz, and Sun (2019), we constructed, for each experiment, three tasks that manipulated the cognitive load on participants' executive resources during sentence comprehension: one sentence-picture verification task (NoLoad), and two dual-tasks in which participants had to perform that verification task while trying to remember either a simple square pattern (LowLoad) or a more complex one (HighLoad), as Crossing sentence and picture types gave rise, in each experiment, to 18 conditions (3 scales × 2 sentence types × 3 picture types), each of which was instantiated 3 times by varying the contents of the pictures, resulting in 54 test trials.Following the suggestion in Marty and Chemla (2013), we added in both experiments 3 true and 3 false instances of the only-variants to the WeakPos some-sentences (e.g., Only some of the socks are pink) to serve as additional controls.The mean acceptance rate for these sentences was above 97.5% in their true conditions and below 11.5% in their false conditions in all three tasks of both experiments.
These results are in line with the findings from Marty and Chemla (2013) in showing that the interpretation of only-sentences was largely unaffected by load manipulations.These items can be thus set aside in the following. 2The NegStrong sentence for ⟨or, and⟩ involved a clausal conjunction embedded under a sentence-internal negation (i.e.,Not both the apple and the pepper is red).We chose this construction because it was structurally closer to the WeakPos and other NegStrong sentences than other candidates involving external negation (e.g., It is not the case that both the apple and the pepper are red) or nominal conjunction (e.g., The apple and the pepper are not both red).
We note, however, that this construction is marked and that its use is felicitous in fewer contexts than the other candidates we just mentioned.The results for these NegStrong conditions should be thus interpreted with caution.

Procedure
Participants were pseudo-randomly assigned one of the three tasks so as to reach a balanced number of subjects per task.They were presented with the instructions corresponding to the relevant task and were given one example trial.Each survey started with 4 unannounced practice trials and then continued with the test trials, presented in random order.In the NoLoad task, each trial consisted of the presentation of a sentence-picture item.Participants had to decide whether or not the sentence was a good description of the depicted situation by pressing one of two response keys on their keyboard.In the LowLoad and HighLoad tasks, each trial started with the brief presentation of a pattern of squares (1200 ms for low-load patterns and 1500 ms for high-load patterns).Afterwards, a sentence-picture item was displayed on the screen, exactly as in the NoLoad task.Once participants had entered their answer, they were presented with an empty matrix and asked to recreate the pattern of squares presented at the start of the trial.
Participants could fill or unfill squares in the matrix by clicking on them.

Data treatment
4 participants in Exp 1 and 5 participants in Exp 2 were excluded either for failing to complete the whole survey or for making mistakes in more than 25% of the control sentence-picture items.
The mean accuracy rate on control items of the remaining participants was above 93% across all load conditions in both experiments, indicating that these participants had no problem judging the test sentences in their True and False conditions, even under high cognitive load.The mean number of correctly localised squares was above .92for the simple 1-square patterns and above 2.85 for the complex 4-square patterns in both experiments, indicating that participants performed the memory tasks appropriately.

Data analysis
To analyse the effects of cognitive load and determine whether they differ across upperbounding and lower-bounding SIs, we fitted a Bayesian mixed effects logistic regression model to the results of the experiments using the brms package (Bürkner, 2017(Bürkner, , 2018(Bürkner, , 2021) )  The priors were weakly informative priors, which we constructed based on the results of van Tiel, Marty, et al. (2019).Their experiment is identical to the ones reported here, except that no linguistic stimuli contained overt negation.We fitted a logistic mixed effect regression model to the data from their target conditions using the glmer function from the lme4 package, which predicted responses on the basis of two sum-coded categorical variables, Load and Scale, and their interaction.The mixed effects were by-item random intercepts and by-participant random intercepts and slopes for Scale and their correlation.We took the estimates β i of this model and used N(β i , 1) as the priors for the respective fixed effect parameters.For the missing fixed effects, all of which have to do with Polarity, we assumed a fairly broad prior distribution N(1, 1).For mixed effects, the standard deviations were all assumed to come from the Half-Cauchy distribution with σ = 2 and the variance-covariance matrix from the Lewandowski-Kurowicka-Joe distribution with η = 1. 3 The posterior distributions reported below were estimated using four Hamilton Monte Carlo Markov Chains implemented in Stan.Each of these chains consisted of 10,000 samples, of which 1,000 were used for warm-up.Both the trace plots (omitted here) and the R ̂ values indicated convergence.

Results
Figure 3 shows the observed mean acceptance rates in the verification task.Overall, results replicate previous findings that people derive fewer upper-bound SIs when their executive cognitive resources are burdened.Responses to the WeakPos sentences were as expected in showing that, in the target conditions, participants accepted these sentences more often WeakPos having larger differences than WeakNeg and NegStrong were 37% and 32% with evidence ratios of 0.60 and 0.47, respectively, and the differences between the differences were estimated to be -0.13 and -0.24 with 90% quantiles being [-0.82, 0.56] and [-1.11, 0.63].
For the High-No comparisons, the corresponding posterior probabilities were both 80% with evidence ratios of 3.98 and 4.03, respectively, and the differences between the differences were estimated to be 0.37 and 0.45 with 90% quantiles being [-0.34, 1.11] and [-0.42, 1.35].Thus, our data provides evidence against Hypothesis 1 and weak evidence for Hypothesis 2.
3 We follow here the recommendations of Gelman (2006), Gelman, Carlin, Stern, andRubin (2013) andMcElreath (2015).In particular, the motivation for using (Half-)Cauchy distributions as priors for variance parameters is that Cauchy distributions have relatively long and even tails, especially compared to normal distributions, which make them good candidates for weakly informative priors.Other commonly used priors for variance parameters include Half-Normal (Schad, Betancourt, & Vasishth, 2021;Vasishth, Yadav, Schad, & Nicenboim, 2023) and Exponential (McElreath, 2020).All the sources cited above recommend Lewandowski-Kurowicka-Joe distributions for weakly informative priors for correlation matrices. 4M stands for mean acceptance rate and CI for (binomial proportion) confidence intervals.95% CIs were calculated from participants' binary responses using Wilson's method and transformed into percentages. 5Among other things, the hypothesis() function of brms computes an evidence ratio for each hypothesis.For a one-sided hypothesis of the form a > b, as in the present case, the evidence ratio is just the ratio of the posterior probability of a > b and the posterior probability of a < b, that is, the posterior probability under the hypothesis against the posterior probability under its alternative.Finally, the same two hypotheses were tested for each level of Scale.The results are summarised in Table 1.Very little variation was found between positive and negative scalar sentences among the three scales regarding the Load effects.Specifically, none of the test cases had notable evidence for Hypothesis 1 and only two of them had notable evidence for Hypothesis 2: the High-No difference for WeakPos was larger than that for WeakNeg with ⟨possible, certain⟩ and larger than that for NegStrong with ⟨or, and⟩.Hence, we conclude there is no across-the-board difference in the effect of Load depending on Polarity.

Discussion
We tested the predictions of the Polarity Hypothesis by comparing upper-bounding and lowerbounding SIs arising from scalar words operating over the same dimension.Our results reproduce the load effects associated with upper-bounding SIs arising from positive sentences and show that comparable effects extend to the lower-bounding SIs associated with the negative variants of these sentences, whether they involve the same scalars or their stronger scale-mate.We take these results to show that load effects are not specific to upper-bounding SIs and to suggest instead that, for such minimal comparisons, these effects are relatively uniform across different types of SIs.These findings are challenging for the Polarity Hypothesis and for the idea that the polarity of an SI is the only or main explanation of the load effects.On the other hand, they remain compatible with Marty and Chemla (2013)'s proposal that executive cognitive resources are needed to entertain and decide among competing readings and that, when these resources are impaired, speakers default to the more readily accessible interpretation -for scalar sentences, their literal interpretation.
This study leaves us with two open issues.First, it remains to be understood why the SI interpretation of certain negatively scalar words like scarce and low appear to be immune from load effects, as per van Tiel, Pankratz, and Sun (2019)'s results.But we note that this pattern is not unattested: Marty et al. (2013) found a similar pattern with other scalar expressions, specifically numerals.Second, it remains the case that response time results reported so far in the literature largely line up with the Polarity Hypothesis, since the classical response delay effects do not appear to generalise to lower-bounding SIs, whether they arise from negatively scalar words (van Tiel, Pankratz, & Sun, 2019;van Tiel & Pankratz, 2021) or negated scalars (Cremers & Chemla, 2014;Romoli & Schwarz, 2015).This suggests that dual-task and response time results need not pattern together with SIs and, consequently, that both types of measures may reflect distinct cognitive effects (see Marty et al., 2020 for similar suggestions).More work is thus required to evaluate why this disparity emerges.

Experiment 1 tested
WeakPos sentences like (1-a), where a weak scalar term appears in a positive sentence, and compared them to their WeakNeg variants, where negation is added below the same scalar term, as in (3-a).The latter give rise to lower-bounding SIs conveying what is literally expressed by their WeakPos counterparts.All theories of SIs explain the SI in (3-c) as arising from the alternative in (3-b).(3) a.Some of the apples are not red.WeakNeg b.All of the apples are not red.Alternative c. ↝ Some of the apples are red Lower-bounding SI Experiment 2 tested the same WeakPos sentences as in Experiment 1, but compared them to their NegStrong variants, where the stronger scale-mate of the weaker term is embedded under negation, as in (4).The latter give rise to the same lower-bounding SIs as the WeakNeg sentences above.All theories of SIs explain the SI in (4-c) as arising from the alternative in (4-b).(4) a.Not all of the apples are red.NegStrong b.Not some (=none) of the apples are red.Alternative c. ↝ Some of the apples are red Lower-bounding SI WeakPos sentences were expected to exhibit the load effects previously reported in the literature.Their negative variants provided us with novel and more minimal comparison points for testing the predictions of the Polarity Hypothesis.If the relevant effects are specific to upperbounding SIs, responses to WeakNeg and NegStrong sentences should be left unaffected by our manipulation of the cognitive load; consequently, these sentences should pattern distinctly from their WeakPos counterparts across load conditions.

exemplified in Figure 1 .
Each participant in our studies only ever completed one of these tasks.The verification task tested three scales: ⟨some, all⟩, ⟨or, and⟩ and ⟨possible, certain⟩.For each scale, we constructed one positive sentence with the weaker term (WeakPos), one negative sentence where negation takes scope below the weaker term (WeakNeg), and one negative sentence where negation takes scope above the stronger term (NegStrong).Each sentence was paired with three types of pictures depicting a situation in which it was unambiguously true (True), unambiguously false (False), or in which its truth-value depended on whether the relevant SI was computed (Target).Both Exp 1 and Exp 2 tested WeakPos sentences along with one of their negative variants, WeakNeg in Exp 1 and NegStrong in Exp 2. Figure2shows example sentences and pictures for each scale.2

Figure 1 :
Figure 1: Examples of (a) low-load and (b) high-load patterns used in Exp 1 and Exp 2.

Figure 2 :
Figure 2: Sentences and example displays for each scale tested in Exp 1 and Exp 2.
in R version 4.1.2(R Core Team, 2021).The model predicted responses in the target conditions on the basis of three categorical predictors, each with three levels -Polarity (WeakPos, WeakNeg, NegStrong), Load (No, Low, High) and Scale (⟨some, all⟩, ⟨or, and⟩ and ⟨possible, certain⟩) -and the interactions among them.These predictor variables were all sum-coded.The mixed effects structure of the model consisted of by-item random intercepts and by-participant random intercepts and slopes for Polarity and Scale, and their interactions with all correlations among them.

Figure 3 :
Figure 3: Mean acceptance rates for each scale by sentence type, cognitive load and picture condition in Exp 1 and Exp 2. Error bars represent 95% binomial confidence intervals.

Figure 4 :
Figure 4: Posterior predictions of the Polarity × Load coefficient.The dotted horizontal line represents the predicted grand mean, and the error bars represent 95% quantiles.

Table 1 :
Results of hypothesis testing about the Load effect at different Polarity levels for each Scale.