Immature Vocalizations Simplify the Speech of Tseltal Mayan and U.S. Caregivers

What is the function of immature vocalizing in early learning environments? Previous work on infants in the United States indicates that prelinguistic vocalizations elicit caregiver speech which is simpliﬁed in its linguistic structure. However, there is substantial cross-cultural variation in the extent to which children’s vocalizations elicit responses from caregivers. In the current study, we ask whether children’s vocalizations elicit similar changes in their immediate caregivers’ speech structure across two cultural sites with differing perspectives on how to interact with infants and young children. Here, we compare Tseltal Mayan and U.S. caregivers’ verbal responses to their children’s vocalizations. Similar to ﬁndings from U.S. dyads, we found that children from the Tseltal community regulate the statistical structure of caregivers’ speech simply by vocalizing. Following the interaction burst hypothesis, where clusters of child-adult contingent response alternations facilitate learning from limited input, we reveal a stable source of information that may facilitate language learning within ongoing interaction.


Introduction
Across several species, adults' contingent responses to their offspring's immature vocalizations play a key role in communicative development.Vocal learning studies on humans (Goldstein & Schwade, 2008), songbirds (Carouso-Peck & Goldstein, 2019), and marmoset monkeys (Gultekin & Hage, 2017;2018;Takahashi, Fenley, & Ghazanfar, 2016) demonstrate that adults who coordinate their vocalizations with those of their offspring create contingent social feedback that facilitates advancements in vocal communication.However, contingent verbal responses may be rare in human infants' experience (Fagan & Doveikis, 2017;Goldstein & Schwade, 2008;Goldstein, King, & West, 2003).When these responses are rare, how might they facilitate learning?Recent findings demonstrate that child-directed talk in both U.S. and non-Western communities is organized around children's vocalizations and predicted by routine activities throughout the day, although the style, context, and source of talk are highly variable between communities (Bergelson, Amatuni, Dailey, Koorathota, & Tor, 2019;Brown, 2011Brown, , 2014;;Casillas, Brown, & Levinson, 2020).Per the interaction burst hypothesis, predictable clusters of interactive language learning opportunities may maximize learning across contexts that vary widely in their child-directed language style (Casillas et al., 2020(Casillas et al., , 2021)).Learning may be maximized when adult speech within vocal turn-taking bouts with children is structurally simplified and thus, easier to learn from.Simpler instances of utterance structure may be particularly useful for children when first breaking into the structure of their ambient language.
What role do children play in structuring the learnability of ambient language?By 9 months, English-learning infants in the United States regulate the complexity of their caregivers' speech by vocalizing (Elmlinger, Park, Schwade, & Goldstein, 2021;Elmlinger, Schwade, & Goldstein, 2019a, 2019b).Maternal utterances that are child-directed and contingent on infant vocalizations are lexically and syntactically simplified compared to utterances that are similarly child-directed but not contingent on infant vocalizations.The contingent speech was found to have fewer unique word types, fewer words per utterance, and higher proportions of one-word utterances (Elmlinger, Schwade, & Goldstein, 2019b).Thus, responses to babbling reduce the complexity of caregivers' speech in ways that may facilitate learning (Schwab & Lew-Williams, 2016).Presently, however, our understanding of children's effects on caregiver speech is limited to a North American cultural context.The cross-cultural extent of this effect is unknown.Why would it vary?
Differences in caregivers' attitudes about child language development and socialization across cultures-specifically about when and how to respond to child vocalizations-may predict whether contingent responses to infants' vocalizations are simplified.If the simplification effect of contingent speech relies upon the pedagogical attitudes of the speaker, then Tseltal adult caregivers, who are less likely than U.S. adult caregivers to engage infants in child-centric and pedagogical speech interactions (see below), may not show the simplification effect found in U.S. adults.Alternatively, the simplification effect may be independent of pedagogical attitudes, and the contingent simplification found in U.S. caregivers may present as a stable feature of language learnability across multiple languages and cultures.To shed light on these possibilities, we focus on comparing the speech simplification of Tseltal Mayan and U.S. caregivers.

Ethnographic background
The Tseltal participants in the present study live in a rural Mayan community in the mountains of southern Chiapas, Mexico.Most caregivers in the sample are horticulturalists and most children are raised in multigenerational patrilocal family compounds (i.e., typically near their nuclear family, paternal grandparents, and paternal uncles' nuclear families).Tseltal is the primary language spoken at home; Spanish is acquired in primary school, after the first couple of years in Tseltal.
Young children are carried for much of their first year and are socialized to attend to the social interactions occurring around them rather than expecting to be the center of adult attention.Longitudinal ethnographic research suggests that speech directed to children is often brief, involves three or more participants, and focuses on appropriate actions and responses rather than words and word meanings (Brown, 2014).Across a waking day, children under age 3;0 hear an average of 3.6 min of speech directly addressed to them per hour (Casillas et al., 2020), which may be comparable to averages from other (e.g., the United States) communities but includes a greater preponderance of directed speech from other children (Bunce et al., under review).As children become more competent language users, they begin to more effectively engage other children and adults as conversational partners (Brown, 2011(Brown, , 2014)).

Infant-directed speech in Tseltal
Tseltal infant-directed speech is not recognizable as such by naïve Western listeners, who cannot effectively distinguish it from an adult-directed speech by the same speakers, despite high reported confidence ratings (Soderstrom et al., 2021).The ethnographic report suggests that imperatives, repetitive social routines, and immediate turn repetition are the most common types of speech to infants and young children, while questions and pedagogical talk are much less common (Brown, 2011(Brown, , 2014).Pye's (1986) in-depth analysis of infant-directed speech in K'iche', a different Mayan language, demonstrates that while there is a distinct register for talking to infants, the ways in which pitch, phonology, lexical forms, and morphosyntactic choices are modified are language-specific and do not necessarily involve simplification.For example, there were no appreciable differences in MLU (mean length of utterance) in morphemes between adult-and infant-directed speech.

Present study
We focused on three aspects of variation in caregiver linguistic structure-lexical diversity, syntactic complexity, and one-word utterances-that may influence the language acquisition process via simplification.We first measured lexical diversity, which has effects on learning that differ by timescale: partially repetitive speech positively predicts language learning in the short term (Onnis & Edelman, 2019;Schwab & Lew-Williams, 2016), while variability in 4 S. L. Elmlinger, M. H. Goldstein, M. Casillas / Topics in Cognitive Science 0 (2022) lexical input predicts outcomes the same in the long term (Huttenlocher, Waterfall, Vasilyeva, Vevea, & Hedges, 2010).To measure lexical diversity, we counted the number of unique words (types) in parents' talk to children.We next measured syntactic complexity using utterance length as a proxy; the length of parents' utterances to children in everyday learning environments is sensitive to when those words emerge in children's own production, getting shorter just before and then increasing again after a word emerges (Roy, Frank, & Roy, 2009).We assessed the syntactic complexity of parents' speech by determining the mean length of utterances in words (MLUw) (Parker & Brorson, 2005).Finally, we measured one-word utterances because isolated words spoken to children are predictive of the words that are most likely to be produced by children later in development (Brent & Siskind, 2001).We measured the proportion of utterances composed of a single word.By investigating the structure of contingent and non-contingent speech across Tseltal and U.S. caregivers, we can better understand the role that children's vocalizations may play in influencing their own communicative development.
Because of the limited comparability across samples included in this study, here we treated the Tseltal and U.S. measures as two individual case studies.We did not make direct statistical comparisons across sites due to differences in participants' age and recording context.The central question of interest is whether speech structure is altered in qualitatively similar ways when organized around children's vocalizations across different cultural contexts.The extent to which we see similar patterns of speech structure change across cultures serves as evidence for or against a generalized mechanism driving simplification in contingent caregiver talk.

Participants
Ten Tseltal children between 2 and 36 months were recorded in 2015 during their everyday routines in Chiapas, Mexico.Families were recruited via snowball sampling in the community and were given a small cash gift for their participation in the study.
Thirty U.S. caregiver-infant pairs participated when infants were 5 and 10 months of age (Table 1).We recruited these subjects from birth announcements in advertisements and local newspapers.As a gift for participation in the study, families received a t-shirt or a bib.

Recordings and procedure
The Tseltal recordings analyzed here are the same used in Casillas et al. (2020).The recordings were sampled from a total set of 55 to achieve an overall balance in sex, maternal education, and age range between 0 and 36 months (Soderstrom et al., 2021).On the morning of each recording, children donned an elastic vest containing a horizontally stored Olympus WS-832 stereo audio recorder and a small camera on a vertical shoulder strap (images are not analyzed here).Infants too small to comfortably wear both pieces of equipment were outfitted with a onesie shirt that had a horizontal pocket to store the recorder (Fig. 1).Tseltal children wore the recorder continuously throughout the ∼9-h recording unless they needed to be bathed or if wearing the equipment during a nap would inhibit their sleep-in this case, caregivers were instructed to place the recorder nearby the child.That same evening, the experimenters returned to collect the equipment.
All recordings of the U.S. data took place in a naturalistic environment in a 12-foot by 18-foot playroom which included a toy box, toys, and animal posters.This environment afforded infants the freedom to play and explore around the room as they wished.Three digital cameras were stationed in the room and remote-controlled by experimenters capturing the video recordings.Infants wore overalls which concealed a wireless microphone (Telex FLM-22) paired to a transmitter (Telex USR-100).Before each session, wireless lapel microphones (Telex FLM-22) were affixed to caregivers' shirts.Caregiver microphones were connected to transmitters hidden in a pouch at their waist (Telex USR-100) (Fig. 1).Distinct audio channels were utilized in the recording of infants' vocalization and caregiver speech, respectively.See Table 1 for more details of the participants in the study across recording sites.
Each U.S. participant engaged in 15-min play sessions in the lab.During these sessions, caregivers were asked to play like they would at home, resulting in unstructured free play.

Speech transcription
Tseltal parents' speech sample consists of 60 min of transcription per recording.Fortyfive of the 60 total minutes were randomly selected 5-min clips.These speech samples were annotated and transcribed jointly by the visiting Western researcher and a local member of the community who natively spoke Tseltal and knew all the recorded families personally.Annotations included full transcriptions of all hearable speech, with an indication for each utterance regarding to whom it was addressed (e.g., to the target child only "TCDS"; to any child(ren) present "CDS"; and to adults "ADS")."TCDS" and "CDS" are examined in the present work.Note that because Tseltal is a mildly polysynthetic language, words typically contain multiple morphemes.The further 15 of the 60 total minutes were hand-selected from the remaining, unannotated portions of each recording.A comprehensive review of each audio recording, excluding the original random clips, allowed us to identify the five top 1-min segments of turn-taking between the target child and their interactants, then the five top 1-min segments of target child vocalization from the remaining recording times.The most active interaction captured in those 10 1-min clips was then expanded a further 5 min, with all additional 15 min of clip time per recording fully annotated using the same standards as the random clips.This process resulted in 1 h of fully transcribed and annotated recording time from each of the 10 daylong recordings, representing both baseline and high-activity speech periods (i.e., 10 h of audio in total).Speech in these recordings comes from many speakers; here, we focus exclusively on the speech from each target child's mother for better comparison to the U.S. data.
We considered two groups of Tseltal caregiver speech: target-child-directed speech (TCDS), which was caregiver speech directed to the target child being recorded, and all childdirected speech in general (+CDS), which included TCDS plus any caregiver speech which was directed at other children.Because Tseltal children may treat both TCDS and +CDS as relevant learning cues, we analyze both for changes when they are contingent and not contingent on the target child's noncry vocalizations.
The speech that U.S. caregivers produced was completely transcribed.If caregivers' utterances were separated by silence longer than 2 s in duration and/or if their utterance exhibited a terminal pitch contour, they were segmented into separate utterances (Stockman, 2010;Venker et al., 2015).All caregiver utterances were directed to their infant.We excluded caregivers' vocal sound effects from the analyses.
When Tseltal and U.S. caregivers' utterances occurred within 2 s of the offset of the target children's vocalizations, then they were considered contingent utterances (Elmlinger et al., 2019a).Caregiver utterances which occurred after a 2-s time frame were considered non-contingent (Fig. 2).Two seconds were used following previous studies which originally reported on the simplification of contingent speech (Elmlinger et al., 2019b).Responses to infant vegetative vocalizations, such as coughs, cries, and fusses, were excluded from the analysis.Fig. 2. U.S. and Tseltal caregiver utterances were considered contingent when they occurred within 2 s of the target child's vocalizations.This same contingency definition was used for Tseltal TCDS and +CDS analyses.

Child utterances
The onsets and offsets of all Tseltal infant nonvegetative, communicative vocalizations (i.e., including laughter, fussing, and crying) were annotated and segmented approximately according to breath groups, with some exceptions (e.g., longer bouts of crying).When lexical, vocalizations were transcribed, and were otherwise classified as containing canonical syllables or not, or containing laughter or crying.

Analytic approach
We employed linear and logistic mixed regressions with the lme4 package in R (Bates, Mächler, Bolker, & Walker) to predict linguistic structure from contingency, controlling for target child age, with participants as a random effect.We used linear regression in comparing the proportion of child noncry vocalizations which elicited a response across the site and in comparing caregivers' distribution of contingent to non-contingent utterances.

Caregiver speech: Linguistic comparisons
To compare lexical diversity across contingent and non-contingent speech, the number of unique words caregivers produced was calculated for contingent and non-contingent utterances (Fig. 5a and Table 2).When only considering Tseltal target-child-directed speech (TCDS), contingent and non-contingent speech contained equal counts of unique words.When including all child-directed speech (+CDS) in Tseltal unique word counts, there were significantly fewer unique words spoken in contingent speech.U.S. caregivers produced significantly fewer unique words in their contingent speech relative to non-contingent speech.
To compare syntactic complexity across speech types, caregivers' MLUw was calculated for contingent and non-contingent utterances (Fig. 5b and Table 2).Tseltal TCDS and +CDS had significantly shorter contingent utterances than non-contingent utterances.U.S. contingent utterances were significantly shorter than non-contingent utterances.
To further test syntactic complexity, the proportion of utterances which contained a single word was calculated for contingent and non-contingent utterances (Fig. 5c and Table 2).Tseltal TCDS and +CDS had a significantly higher proportion of contingent than noncontingent utterances that were a single word (Table 2).U.S. contingent utterances were also more likely to contain only a single word compared to non-contingent utterances (Table 2).

Discussion
U.S. and Tseltal caregivers simplified the statistical and syntactic structure of their speech in response to their child's vocalizations.The simplification pattern generally holds despite cultural differences in how caregivers talk to children, as documented in prior ethnographic work (Brown, 2011(Brown, , 2014)).Here, we observed the simplification pattern despite differences in the extent to which child vocalizations elicited caregiver responses and in the relative frequency of non-contingent utterances.In both groups, contingent speech largely contained fewer unique words, contained shorter utterances, and was more likely to be a single-word utterance.Together, these characteristics of caregivers' contingent speech suggest a stable Abbreviations: TCDS, Tseltal caregivers' target-child-directed speech; +CDS, all Tseltal child-directed speech (i.e., target-child-directed speech and other-child-directed speech).
form of influence of children's immature vocalizing on the ambient linguistic environment.Children's vocal behavior may create language learning opportunities by eliciting responses from caregivers that contain more learnable information.
The lexical and syntactic simplification found in contingent speech may benefit children at the beginning of their language-learning process.Reduced contingent lexical diversity is likely beneficial as clusters of successive word repetitions predict children's learning of the repeated words (Schwab & Lew-Williams, 2016).Shorter caregiver utterances, and singleword utterances, in particular, simplify the task of finding word boundaries and thereby facilitate language learning (Lew-Williams, Pelucchi, & Saffran, 2011).
As the pattern of contingent simplification appears largely similar across Tseltal and U.S. caregivers, the simplification effect of contingent speech may occur independently of the attitudes toward language pedagogy in a given community.Reports demonstrate that adult Tseltal speech to children does not typically include the attention-getting and child-centric features typical of U.S. English infant-directed speech (Brown, 2011(Brown, , 2014;;Soderstrom et al., 2021).Thus, the source of the simplification effect of immediate responses to children's vocalizations may have more to do with the immaturity of children's vocalizations than adults' goals when interacting with young children.The language proficiency of one interaction partner may predict the extent of linguistic simplification in the other's contingent responses (Elmlinger et al., 2021).Future work may investigate this further with TCDS versus +CDS; here, we found similar but nonidentical simplification effects in Tseltal.If children's learning benefits from simplification, they may benefit indirectly from speech directed contingently to their peers.While the main focus of the present work was to test whether similar contingent speech simplification effects emerge across Tseltal and U.S. cultures, future research will need to extensively examine the underlying mechanisms of simplification, including the extent to which they are similarly or differently employed both with and across cultural groups.
Tseltal children's vocalizations elicited lower rates of caregiver responses than the U.S. infants.This difference may arise from a number of limitations of the present study.Tseltal and U.S. recording durations, contexts, and age differences may have contributed to this difference.Recent research on home recordings of U.S. infants suggests that infant vocalizations' may elicit caregiver responses at a rate of 21%, a rate comparable to the Tseltal data presented here (Lopez, Walle, Pretzer, & Warlaumont, 2020).However, because Lopez et al. (2020) relied upon automatic coding of caregiver and infant utterances, which may overestimate the amount of turn-taking in a recording, future work conducted with manual annotation is required to fully understand cross-cultural differences in response rates (Ferjan Ramírez, Hippe, & Kuhl, 2021).
Our results suggest that children, via immature vocalizing, play an important role in shaping their own language environment in distinct cultural contexts.Future research is required to address the comparative limitations in the present work, including the differences in recording duration, interpersonal context, and target child age.Currently, we are also adapting the primary measures so that they are based on lemmas (lexical diversity) and morphemes (MLU) rather than words.This more morphosyntactically informed approach gives us a more accurate view of simplification effects considering the differing morphological systems of the languages.In spite of these limitations, this work substantially advances our understanding of how children's real-time interaction with adults changes their linguistic experiences and thereby may facilitate the process of language learning.
Fig. 3. Proportion of child vocalizations which elicited a contingent response.Black lines indicate ± 1 standard error around the mean.

Fig. 4 .
Fig. 4. Distribution of caregiver utterance type per site.Each bar along the x-axis represents an individual caregiver.

Fig. 5 .
Fig. 5. Simplification of contingent speech across Tseltal and U.S. caregivers.Each of the three dependent measures is given per contingent (left) and non-contingent (right) utterances: (A) number of unique words, (B) mean length of utterance in words (MLUw), and (C) proportion of single-word utterances.Line-connected data points indicate data from an individual caregiver.Mean and ± standard error are shown in bold.Abbreviations: C, contingent; NC, non-contingent.

Table 2
Comparison of contingent and non-contingent speech structure across sites Estimates derived from the following model structure: caregiver speech structure ∼ contingency + infant age + (1|subject).