Lexicalization in the developing parser

We use children’s noun learning as a probe into the nature of their syntactic prediction mechanism and the statistical knowledge on which that prediction mechanism is based. We focus on verb-based predictions, considering two possibilities: children’s syntactic predictions might rely on distributional knowledge about specific verbs—i.e. they might be lexicalized — or they might rely on distributional knowledge that is general to all verbs. In an intermodal preferential looking experiment, we establish that, by as early as 19 months of age, verb-based predictions are lexicalized: children encode the syntactic distributions of particular verbs and use those distributions to make predictions, but they do not assume that these can be used for verbs in general. knowledge from specific lexical Our data suggests that syntactic knowledge begins with abstract categories and that lexically specific distributional information informs the development of parsing strategies, but not the knowledge itself. That knowledge is revealed when we take away children’s ability to rely on lexically specific knowledge, as in the current study.


Introduction
There is now a wealth of evidence that adult language comprehenders' parsing decisions are both predictive and guided, at least in part, by a language's distributional properties (Gordon and Chafetz 1990;Trueswell et al. 1993;MacDonald et al. 1994;Garnsey et al. 1997;Altmann and Kamide 1999). A major question in this literature is how these distributions are encoded and how these encodings are deployed for prediction (McRae et al. 1998;Hale 2001;Elman et al. 2004;Levy 2008;Linzen and Jaeger 2016).
In this paper, we approach this question of encoding and deployment from a developmental perspective, asking how predictive parsing interacts with syntactic bootstrapping. By 4-5 years of age, children appear to use prediction in the course of online sentence comprehension (Trueswell et al. 1999;Snedeker and Trueswell 2004;Fernald and Marchman 2006;Lew-Williams and Fernald 2007;Omaki 2010;Mani and Huettig 2012;Borovsky et al. 2012;Huang et al. 2013;Omaki et al. 2014). The nature of this developing prediction mechanism can often be seen most clearly in cases where children display interpretive biases that disallow them either from accessing a particular adult-like interpretation of a sentence or from accessing an adult-like interpretation in the first place.
Recent work has demonstrated that children utilize such predictive parsing mechanisms for the purposes of both comprehension and learning as early as 19 months of age .
But it remains unclear whether this predictive parsing mechanism is based on knowledge about the distributional characteristics of particular verbs-i.e. whether distributional knowledge is lexicalized-or whether it is based on knowledge of the particular structures that are likely to occur, regardless of the lexical items that occur in those structures-i.e. whether distributional knowledge is generalized.
We investigate this question using an intermodal preferential looking experiment, showing that, by as early as 19 months of age, the predictive parsing mechanism children deploy is lexicalized. This experiment builds on a paradigm introduced by Lidz et al. (2017), which we review below. Lidz et al. (2017) investigate 16-and 19-month-old children's predictive parsing mechanisms through the lens of syntactic bootstrapping (Gleitman 1990). Beginning with Brown 1957, a broad literature has shown that children use aspects of syntax to drive inferences about word meaning (see Lidz 2022 for a review). For example, children as young as 12 months have been shown to treat a novel word presented as a noun as referring to an object kind (Waxman and Booth 2001), and children as young as 18 months have been shown to expect a novel verb to refer to a category of events (He and Lidz 2017;Carvalho et al. 2019). Moreover, toddlers draw different inferences about verb meaning as a function of whether the novel verb occurs in a transitive or an intransitive clause (Naigles 1990;Yuan and Fisher 2009;Fisher et al. 2010). Gertner and Fisher (2012) suggest that one way syntactic context is used in inferring verb meaning is through the distinct thematic relations associated with the subject and object position of a clause. The evidence they adduce for this claim is indirect, however, given that it is measured by the meanings children assign to entire clauses, rather than the noun phrases in those clauses. Lidz et al. (2017) test the link between syntactic position and thematic relation more directly by asking what meaning children assign to a novel noun as a function of its syntactic position. In their experiments, children are exposed to sentences like (1) and (2) along with a scene involving an agent acting on a patient using an instrument.
Lidz et al. find that by 16 months of age, children are able to appropriately infer that the tiv refers to the patient in (1) and to the instrument in (2), suggesting that knowledge of the link between syntactic position and thematic relation is in place by this age. However, at 19 months of age, children incorrectly infer that the tiv refers to the patient in both (1) and (2).
The authors argue that 19-month-olds' incorrect inferences are driven by a ballistic predictive parsing strategy that is based on the fact that all the verbs used in the study-and as we show below, most verbs in children's input-are heavily biased toward at least taking a direct object and against only taking a prepositional phrase headed by with. This distributional bias, then, overshadows the contribution of the syntactic structure in children's noun learning because it leads them to erroneously represent (2) as though it were a simple transitive clause and consequently treat the tiv as though it were the direct object and hence as referring to the patient of the event.
Lidz et al. bolster this argument by showing that when 19-month-olds receive sentences that satisfy the purported prediction of a direct object, as in (3) and (4), they are able to correctly infer that the tiv refers to the patient in (3) and to the instrument in (4).
(3) She's wiping the tiv with that thing.
(4) She's wiping that thing with the tiv.
Further supporting this predictive parsing account, they show in a post hoc analysis that 19-month-old children with smaller verb vocabularies are better able to associate the tiv with the correct referent in (1) and (2) than are 19-month-old children with larger verb vocabularies.
One possible explanation suggested by Lidz et al. is that 19-month-old children with smaller verb vocabularies may not know the statistical distribution of known verbs well enough to use them for making predictions. Blue line shows unweighted cumulative mean going from right to left. Add 1 smoothing has been applied to each verb's subcategorization frame counts to avoid zeros in the denominator.
One implication of this account is that children must track distributional properties in the input. This implication raises the question of how those distributional properties are encoded: as properties of the particular verbs themselves (lexicalized encoding) or as properties of the category verb (generalized encoding).
The predictions of the generalized encoding hypothesis rely crucially on the distribution of verbs' subcategorization frame distributions in children's input. Nearly all verbs' distributions, at least in child-directed speech, turn out to be heavily biased toward transitive frames relative to intransitive frames with a prepositional phrase. This can be seen in Figure 1, which shows the ratio of [__ NP] frames to [__ with NP] extracted from all CHILDES corpora (MacWhinney 2014a; b) parsed using MEGRASP (Sagae et al. 2007). Each point in this figure is a verb, whose frequency is plotted on the x-axis. The blue line gives the unweighted cumulative mean ratio moving from right to left, with the idea that children are more likely to know higher frequency verbs. We see that this cumulative mean never dips below 10:1, suggesting a very heavy bias toward transitive frames across the frequency spectrum.
Thus, both the lexicalized encoding hypothesis and the generalized encoding hypothesis are plausible descriptions of how children encode syntactic distributions for deployment during predictive parsing. We now describe an experiment aimed at pulling these two hypotheses apart.

Experiment
In this experiment, we examine how children use the syntactic context of a noun phrase (NP) to make inferences about its thematic relation. Using a word-learning task in the intermodal preferential looking paradigm (Spelke 1976;Hirsh-Pasek and Golinkoff 1999), we test children's abilities to assign a meaning to a novel noun contained in a direct object NP as compared to a prepositional object NP. For instance, given a scenario in which someone is using one object to wipe another, adults interpret the NP containing the novel word (the tiv) to refer to the thing being wiped (the patient) in (5) but to the thing being used to do the wiping (the instrument) in (6).
If children are similarly able to use this thematic role information to learn the meaning of a novel noun, in (5), we expect them to be able to link the tiv to the patient, and in (6), we expect them to be able to link the tiv to the instrument.  (1) and (2), which use the known verb wipe.
We do this replacement in order to test two hypotheses about how children make predictions about upcoming arguments. On the one hand, children's predictions might be lexicalized. In this case, children would use distributional information they have about a particular verb to make predictions. On the other hand, children's predictions might be generalized, in which case children would use their knowledge of the distribution of subcategorization frames that occur in all clauses, regardless of the verb found in that clause.
In the case of generalized predictions, we would expect 19-month-old children to use the same predictive mechanism to parse (5) and (6) as they do to parse (1) and (2), which contain the real verb wipe. This would mean that 19-month-olds who hear (5) or (6) would always associate the tiv with the patient, as they did in Lidz et al.'s Experiment 1. In contrast, in the case of verb-specific or lexicalized predictions, we would instead expect 19-month-old children to use a distinct predictive mechanism-or no predictive mechanism at all-to parse (5) and (6), since children do not have information about the distributional properties of the novel verb meek. This means that 19-montholds that hear (5) or (6) will associate the tiv with the correct referent, similar to 16-month-old children in Lidz et al.'s Experiment 1 and 19-month-old children in their Experiment 3.
One possibility that arises here is that vocabulary knowledge may condition the parsing mechanism that children deploy. This is plausible in light of Lidz et al.'s finding that 19-monthold children with smaller verb vocabularies are better able to associate the tiv with the correct referent in (1) and (2) than are 19-month-old children with larger verb vocabularies. Here, we assess the possibility that a similar conditioning may be found in our paradigm by collecting information about children's vocabulary knowledge.

Apparatus and procedure
Each child arrived with his/her parent and was entertained by a researcher with toys while another researcher explained the experiment to the parent and obtained informed consent. The child and parent were then escorted into a sound proof room, where the child was either seated on the parent's lap or in a high chair, centered six feet from a 51" television, where the stimuli were presented at the child's eye-level. If the child was on the parent's lap, the parent wore a visor to keep them from seeing what was on the screen. Each experiment lasted approximately 5 minutes, and the child was given a break if they were too restless or started crying. If the child did not complete the experiment or was extremely fussy over the entire course, this was noted for later exclusion from the sample.
The child was recorded during the entire experiment using a digital camcorder with a sample rate of 30 frames/second centered over the screen. A researcher watched the entire trial with the audio off on a monitor in an adjacent room and was able to control the camcorder's pan and zoom in order to keep the child's face in focus throughout the trial. Videos were then coded offline frame-by-frame for direction of look by a research assistant blind to the experimental condition and without audio using the SuperCoder program (Hollich 2005).

Design
Our design and stimuli were exactly the same as those used by Lidz et al. (2017) except for the audio stimuli. Participants were presented with eight trials, each involving a different verb and concomitant scene. Each of these trials was separated into two phases: the familiarization phase and the test phase. These phases are described below and Table 1 gives a sample script.

Familiarization Phase
During the familiarization phase, children were shown videos of 15 second dynamic scenes involving three objects: a human hand, an instrument manipulated by the hand, and a patient causally affected via the instrument. A recorded linguistic stimulus of the form either she's verbing the novel noun (V NP) or she's verbing with the novel noun (V with NP) was associated with each scene. Each of these pairings constitutes a level in the between-subjects structure factor.
verb and novel noun in these frames were replaced with a known verb and a novel noun. All linguistic stimuli were recorded by the same adult female that recorded the stimuli for Lidz et al.'s experiments. The linguistic stimulus was presented three times as the scene progressed with different lead-in words-e.g. Look!.

Test Phase
A blank screen was then shown for two seconds after each scene, during which the question where's the novel noun? was asked once. The test video began at the offset of the novel noun in the first of these questions, when a screen with separate static images of both the instrument and the patient from the previous dynamic scene was displayed. One of these images took up approximately one third both by-width and by-height of the left portion of the screen and the other took up approximately one third by-width and by-height of the right portion, with an approximately one-third by-width separation in the middle of the screen. The side on which the instrument appeared was counterbalanced and pseudorandomized such that the instrument did not show up on the same side more than twice in a row.  Two seconds after the two images were presented, the question-which one's the novel noun?-was played. The split screen was presented for five seconds total, after which the screen went blank. After a two second blank screen, either the next learning phase started or an attention-getting phase involving a picture of a child and laughter was presented.

Materials
Eight verbs contained in the MacArthur-Bates Communicative Development Inventory (MCDI) checklist were chosen with the criterion that their associated event concept must support the use of an instrument. Eight novel nouns were constructed-one for each verb. Table 1 gives a sample script summarizing the above description. In the V with NP conditions, children heard with during the familiarization, while those in the V NP conditions did not, represented in the table by the parentheses. Table 2 shows each tuple of verb, novel noun, instrument object, and patient object. To control for possible order effects, we created two presentation orders for the trials by first building one pseudorandomized order according to the above sequencing criterion, then inverting it to create the second order. When crossed with the two linguistic structure levels (structure: V NP, V with NP), this yielded four stimulus sets.

Participants
We recruited 32 19-month-olds (16 females  Parents completed the MCDI checklist (Fenson 2007). By this index, participants' median productive verb vocabulary was 5 verbs (mean: 16 verbs, IQR: 1-30 verbs), and their median productive total vocabulary was 63 words (mean: 139.5 words, IQR: 41-251 words). The parent of one participant in the V NP condition did not submit an MCDI checklist, and for the purposes of analysis, that participant's verb vocabulary value was set to the mean across participants (but excluded from the above statistics).

Measures
Following Lidz et al., we compute two measures for each trial each child received. The first measure (familiarization proportion) is the proportion of the time each child was looking at the screen during the familiarization phase for a given trial. This measure provides a proxy for how well the child was paying attention to the pairing of the linguistic stimulus with the scene in the video. We expect that the less a child pays attention during a particular familiarization, the less likely it is that their behavior during the test phase associated with that familiarization provides evidence about the inferences they make based on the linguistic stimuli.
The second measure (object count) is the number of video frames on which each child was looking at the instrument (looks to instrument) paired with the number of frames on which they were looking at the patient (looks to patient) on each trial. 2 This measure was calculated by converting the left-right coding of the test phase into an instrument-patient coding and then computing the relevant counts by trial for each child. Note that, unlike the first measure, this second measure is not a proportion, though we can compute a proportion from it. For the purposes of visualization and basic comparisons of means, we work with proportions computed from these counts; for the purposes of statistical analysis, we work with the counts themselves.
In addition to the measures used by Lidz et al, we also compute two measures of vocab based on verb vocabulary and total vocabulary in MCDI. Because verb vocabulary and total vocabulary are highly correlated (r = 0.92), they cannot be entered into our analyses in their raw forms without giving rise to issues of collinearity. As such, we first apply principal component analysis to the logged form of these two measures.  observing an effect of verb knowledge, though it does not imply it. In the name of due diligence, however, we include both PC1 and PC2 in our statistical analyses with the caveat that PC2 is likely uninteresting because it explains so little variance-indeed, it may merely be capturing noise.
3 An anonymous reviewer suggests that the effect of verb vocabulary knowledge and total vocabulary knowledge might be distinguished using a residualization strategy: residualize one vocabulary variable against the other then analyze the effects of the residualized variable and the raw variable it was residualized against. Unfortunately, this method does not help in this context exactly because the variables are so highly correlated. Residualizing either total vocabulary against verb vocabulary or vice versa will necessarily result in the unresidualized variable being highly correlated with the first principal component-log (verb vocabulary) has a 0.99 correlation with PC1 and log (total vocabulary) has a 0.96 correlation. The extent to which the residualized variable is correlated with PC2 will depend on the class of models selected for use in residualization. Thus, residualization not only introduces two additional researcher degree of freedom-the direction in which to residualize and the class of models to use-but also raises the likelihood of misinterpretation: seeing a reliable effect for the raw variable does not mean that that variable indeed has an effect to the exclusion of the other. See Wurm and Fisicaro 2014 for further discussion.
For the purposes of reporting statistics, we use the continuous form of both variables. For the purpose of visualization, we discretize the first principal component at its median, referring to the group of children that have a vocabulary score above the median as the high vocab group and the group of children that have a vocabulary score below the median as the low vocab group, since scoring more positively on the first principal component implies having a larger total vocabulary and a larger verb vocabulary.  To assess the reliability of this pattern, we follow Lidz et al. in using a logistic mixed effects model with object count as the dependent variable, random intercepts for child and item, by-item random slopes for structure, and a loss weighted by familiarization proportion. We first fit such a model with fixed effects for structure, PC1, and PC2 as well as the two-way interaction between structure and PC1 and the two-way interaction between structure and PC2. We test the reliability of these interactions using a log-likelihood ratio test. We find that the model that includes both interactions is reliably better than the one that does not include the interaction between structure and PC1 (χ 2 (1) = 3.98, p < 0.05) but a similar pattern is not observed for the interaction between structure and PC2 (χ 2 (1) = 0.28, p = 0.60). Thus, the apparent interaction between structure and PC1 seen in Figure   3 is reliable.

Discussion
In What the failure of 19-month-olds with smaller vocabularies implies is less clear. We suggest two possibilities. 4 The first is that the failure of 19-month-olds with smaller vocabularies in our experiment is not indicative of the predictive parsing strategy these children use at all. It may simply be that having to process two novel words at once-both a verb and a noun-is particularly burdensome for children with smaller vocabularies for whatever reason it is that they have smaller vocabularies in the first place. Depending on what this reason is, this account might predict either that all 16-month-olds would similarly fail in our experiment-e.g. if the failure is simply about amount of vocabulary knowledge-or that, similar to the results of our experiment, 16-month-olds with larger vocabularies would succeed but those with smaller vocabularies would fail-e.g. because differences in vocabulary knowledge at a particular age index cognitive resources relevant to processing two novel words at once.
The second possibility is that 19-month-olds with smaller vocabularies-unlike those with larger vocabularies-make predictions in our experiment based on verb-general knowledgeplausibly because they are less certain about those specific verbs' distributional properties.
This uncertainty might arise in two different ways: (i) children who know fewer verbs tend to have less experience with the verbs they do know-e.g. because less vocabulary knowledge is indicative that the verbs that they do know were more recently learned; or (ii) children who know fewer verbs need additional evidence about a specific verb to become certain enough about its distribution to use that distribution in predictive parsing. This second version might be plausible insofar as knowledge of verbs' distributional properties is hierarchical (Perfors et al. 2010) and thus children who know more verbs require less evidence to acquire the distributional properties of a verb whose distribution is prototypical relative to the verbs they already know. 4 An anonymous reviewer suggests a third: that 19-month-olds with smaller vocabularies are more likely to replace the novel verbs in our experiment with known verbs and then rely on distributional knowledge about those known verbs for prediction. Because the known verbs corresponding to the actions in our experiment are biased to be found with direct objects, this account makes the same predictions as an account (discussed below) wherein these children use generalized encodings for prediction, while keeping constant that all 19-month-olds' predictions are based on lexicalized encodings.
While this account has the welcome consequence of keeping constant the knowledge on which children's predictive parsing is based, we suspect it will not turn out to be correct for two reasons. First, it is not clear why 19-month-olds with smaller vocabularies would be more likely than 19-month-olds with larger vocabularies to replace novel verbs with known verbs. And even if they did so, it is not clear why they would be able to use these verbs' distributional knowledge in our experiment when they were not able to in Lidz et al.'s Experiment 1. Second, the literature on fast mapping suggests that children do not assume that novel verbs are aliases for known verbs: as early as two years old, children tend to map novel verbs to novel actions (Merriman et al. 1996;Golinkoff et al. 1996). It is possible that their assumptions differ when it comes to predictive parsing, but we know of no evidence to this effect.
A major hurdle faced by either version of this account is that, if 19-month-olds with smaller vocabularies make predictive parsing decisions based on generalized encodings and thus fail in our experiment, it is unclear why they do not similarly do so in Lidz et al.'s Experiment 1. Why should they not fail in that experiment as well? To overcome this hurdle, such an account would likely need to posit that 19-month-olds with smaller vocabularies are unable to deploy predictive parsing for known verbs-e.g. because they attempt to make predictions on the basis of lexicalized encodings but fail to do so in the face of the uncertainty inherent to that knowledge.
One way to test this account might be to turn novel verbs in our experiment into "known" verbs by exposing children to dialogues containing the novel verb and then testing their noun learning using the same stimuli we use here (Yuan and Fisher 2009;Arunachalam and Waxman 2010;Yuan et al. 2011

Conclusion
The study just reported adds support to the view that 19-month-olds have knowledge of the link between syntactic position and thematic relation. The fact that they can use the syntactic position of an NP to assign it an interpretation supports theories of word learning that treat syntactic structure as informative (Gleitman 1990), and more indirectly, theories of verb-learning that use the thematic relations of the NPs in a clause as evidence about the meaning of the verb (Gertner and Fisher 2012;Perkins 2019). However, 19-month-olds' ability to deploy the link between syntactic position and thematic relation can be disrupted during sentence comprehension by lexicalized knowledge of verb-argument structure. Whereas prior work showed that 16-montholds, but not 19-month-olds, successfully map a novel noun phrase to different referents depending on its syntactic position, the current work shows that 19-month-olds' failure in previous work resulted from their knowledge of specific verb distributions. In the current study, 19-month-olds with larger vocabularies were able to correctly identify the referent of a novel noun phrase as a function of syntactic position even with novel verbs. The fact that having a larger vocabulary helped these children to avoid a parsing error with novel verbs suggests that their prior failures derive from knowledge of specific verb distributions and not from a general knowledge that transitive clauses are more likely than intransitive clauses.
The finding that 19-month-olds' syntactic predictions are driven by lexicalized subcategorization frequencies comports well with work from older children and adults (Trueswell et al. 1993;Trueswell and Kim 1998;Snedeker and Trueswell 2004;Altmann and Kamide 2007;Borovsky et al. 2012). It further adds to this literature by showing that lexically driven syntactic predictions occur from the earliest stages of language development. As soon as children have acquired lexical statistics, they appear to use that information to drive parsing predictions.
Our data also informs a debate concerning the origins of children's early syntactic knowledge.
To what degree is early syntactic knowledge associated with specific lexical items (Tomasello and Kruger 1992;Theakston et al. 2015;Lieven 2016) and to what degree does syntactic knowledge abstract away from specific lexical items (Gertner et al. 2006;Naigles 2002;Fisher et al. 2010;Viau and Lidz 2011)? Our data suggests that syntactic knowledge begins with abstract categories and that lexically specific distributional information informs the development of parsing strategies, but not the knowledge itself. That knowledge is revealed when we take away children's ability to rely on lexically specific knowledge, as in the current study.

Additional File
The additional file for this article can be found as follows: • Appendix A. Simulation-based post hoc power analysis. DOI: https://doi.org/10.5070/ G601148.s1