Search

Scholarly Works (10 results)

Sort By:

Article
Peer Reviewed

Semantic structure of infant first-person scenes changes with development

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 41 (2019)

The co-occurrence of objects in visual scenes reflects the semantic structure of the world: cups are more likely to appearin scenes with tables than airplanes, for example. Both human and machine vision use these co-occurrences to supportrecognition of individual objects. A reasonable assumption is that these co-occurrences are ubiquitous and present forall perceivers. However, the scenes observed by infants are highly dependent on their body postures and locations, bothof which change dramatically over the first year of post-natal life. To measure these changing co-occurrences in infant-perspective scenes, we collected images from infants wearing head cameras in everyday home environments comparingthree age groups: 1-3, 6-8 and 11-12 months. Using graph theoretical analysis, we conclude that the semantic structure ofscenes in 6-8 months differs from whats in younger and older infants.

Cover page: Semantic structure of infant first-person scenes changes with development

Article
Peer Reviewed

Detecting Hands in Children's Egocentric Views to Understand Embodied Attention during Social Interaction

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 36 (2014)

Article
Peer Reviewed

Grounding Action Verbs in Egocentric Visual Perception

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 44 (2022)

It has been conjectured that verb learning is hard because verb meanings are not readily "packaged'' from the physical world. To provide new empirical evidence on this account, we analyzed egocentric video collected from natural toy-play interaction and focused on the naming events when action verbs were uttered in parent speech. Using the Human Simulation Paradigm, we showed egocentric videos of those naming events to adult observers and asked them to guess the target verb in parent speech. We found that adult observers used many different verbs to describe the same visual event, and only one of them matched with the verb in parent speech. We analyzed mismatched verbs and identified several sources of mismatch, and found that all of the mismatched verbs are relevant to the target verb, but they capture different properties (temporal, semantic, etc) of the visual event. We also found that different naming events for the same verb also differ in terms of the degree of ambiguity. Taken together, the results in the present paper provided new evidence from the child’s view, showing that verb learning is hard not only because multiple possible meanings are embedded in each learning situation, but also because these candidate meanings expand across multiple dimensions of the physical world, overlap with each other, and relate to the target meaning in many different ways.

Cover page: Grounding Action Verbs in Egocentric Visual Perception

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

Using manual actions to create visual saliency: an outside-in solution to sustained attention and joint attention

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 45 (2023)

Human cognition is shaped by our bodies and actions. The influence of embodiment on cognition is particularly crucial during early development. Recent evidence shows that infants use actions to accomplish cognitive and social tasks that may later be solved internally. In our study, we propose that a sensorimotor mechanism to hand-eye coordination is through a full path from manual action, to visual saliency, and to visual attention. To provide a rigorous test of this pathway, we analyzed multimodal behavioral data collected from parent-infant toy play. We focused on linking infants’ manual actions with visual properties in the infant’s view and attention. Further, we extended our analyses to quantify the effects of manual actions on one’s own visual attention, infant’s actions on parent attention, and parent’s actions on infant attention. Results suggest that both infants’ and parents’ actions create visual saliency of objects to support visual attention and joint attention.

Cover page: Using manual actions to create visual saliency: an outside-in solution to sustained attention and joint attention

Article
Peer Reviewed

Active Viewing in Toddlers Facilitates Visual Object Learning:An Egocentric Vision Approach

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 38 (2016)

Early visual object recognition in a world full of cluttered vi-sual information is a complicated task at which toddlers areincredibly efficient. In their everyday lives, toddlers con-stantly create learning experiences by actively manipulatingobjects and thus self-selecting object views for visual learn-ing. The work in this paper is based on the hypothesis that ac-tive viewing and exploration of toddlers actually creates high-quality training data for object recognition. We tested thisidea by collecting egocentric video data of free toy play be-tween toddler-parent dyads, and used it to train state-of-the-artmachine learning models (Convolutional Neural Networks, orCNNs). Our results show that the data collected by parentsand toddlers have different visual properties and that CNNscan take advantage of these differences to learn toddler-basedobject models that outperform their parent counterparts in aseries of controlled simulations.

Cover page: Active Viewing in Toddlers Facilitates Visual Object Learning:An Egocentric Vision Approach

Article
Peer Reviewed

Modeling joint attention from egocentric vision

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 43 (2021)

Numerous studies in cognitive development have provided converging evidence that Joint Attention (JA) is crucial for children to learn about the world together with their parents. However, a closer look reveals that, in the literature, JA has been operationally defined in different ways. For example, some definitions require explicit signals of “awareness” of being in JA—such as gaze following, while others simply define JA as shared gaze to an object or activity. But what if “awareness” is possible without gaze following? The present study examines egocentric images collected via head-mounted eye-trackers during parent-child toy play. A Convolutional Neural Network model was used to process and learn to classify raw egocentric images as JA vs not JA. We demonstrate individual child and parent egocentric views can be classified as being part of a JA bout at above chance levels. This provides new evidence that an individual can be “aware” they are in JA based solely on the in-the-moment visual information. Moreover, both models trained on child views and those trained on parent views leveraged the visual properties associated with visual object holding to improve classification accuracy—suggesting a critical role for object handling in not only establishing JA, as shown in previous research, but also in inferring the social partner’s attentional state during JA.

Cover page: Modeling joint attention from egocentric vision

Article
Peer Reviewed

Human Learners Integrate Visual and Linguistic Information Cross-Situational Verb Learning

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 43 (2021)

Learning verbs is challenging because it is difficult to infer the precise meaning of a verb when there are a multitude of relations that one can derive from a single event. To study this verb learning challenge, we used children's egocentric view collected from naturalistic toy-play interaction as learning materials and investigated how visual and linguistic information provided in individual naming moments as well as cross-situational information provided from multiple learning moments can help learners resolve this mapping problem using the Human Simulation Paradigm. Our results show that learners benefit from seeing children's egocentric views compared to third-person observations. In addition, linguistic information can help learners identify the correct verb meaning by eliminating possible meanings that do not belong to the linguistic category. Learners are also able to integrate visual and linguistic information both within and across learning situations to reduce the ambiguity in the space of possible verb meanings.

Cover page: Human Learners Integrate Visual and Linguistic Information Cross-Situational Verb Learning

Article
Peer Reviewed

How do infants start learning object names in a sea of clutter?

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 41 (2019)

Infants are powerful learners. A large corpus of experimental paradigms demonstrate that infants readily learn distributional cues of name-object co-occurrences. But infants’ natural learning environment is cluttered: every heard word has multiple competing referents in view. Here we ask how infants start learning name-object co-occurrences in naturalistic learning environments that are cluttered and where there is much visual ambiguity. The framework presented in this paper integrates a naturalistic behavioral study and an application of a machine learning model. Our behavioral findings suggest that in order to start learning object names, infants and their parents consistently select a set of a few objects to play with during a set amount of time. What emerges is a frequency distribution of a few toys that approximates a Zipfian frequency distribution of objects for learning. We find that a machine learning model trained with a Zipf-like distribution of these object images outperformed the model trained with a uniform distribution. Overall, these findings suggest that to overcome referential ambiguity in clutter, infants may be selecting just a few toys allowing them to learn many distributional cues about a few name-object pairs.

Cover page: How do infants start learning object names in a sea of clutter?

Article
Peer Reviewed

A Computational Model of Early Word Learning from the Infant’s Point of View

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 42 (2020)

Human infants have the remarkable ability to learn the asso-ciations between object names and visual objects from inher-ently ambiguous experiences. Researchers in cognitive scienceand developmental psychology have built formal models thatimplement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the mod-els to find statistical regularities in the input data. In contrast toprevious modeling approaches, the present study used egocen-tric video and gaze data collected from infant learners duringnatural toy play with their parents. This allowed us to capturethe learning environment from the perspective of the learner’sown point of view. We then used a Convolutional Neural Net-work (CNN) model to process sensory data from the infant’spoint of view and learn name-object associations from scratch.As the first model that takes raw egocentric video to simulateinfant word learning, the present study provides a proof of prin-ciple that the problem of early word learning can be solved,using actual visual data perceived by infant learners. More-over, we conducted simulation experiments to systematicallydetermine how visual, perceptual, and attentional properties ofinfants’ sensory experiences may affect word learning.

Cover page: A Computational Model of Early Word Learning from the Infant’s Point of View

Article
Peer Reviewed

In-the-Moment Visual Information from the Infant's Egocentric View Determines the Success of Infant Word Learning: A Computational Study

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 43 (2021)

Infants learn the meaning of words from accumulated experiences of real-time interactions with their caregivers. To study the effects of visual sensory input on word learning, we recorded infant's view of the world using head-mounted eye trackers during free-flowing play with a caregiver. While playing, infants were exposed to novel label-object mappings and later learning outcomes for these items were tested after the play session. In this study we use a classification based approach to link properties of infants' visual scenes during naturalistic labeling moments to their word learning outcomes. We find that a model which integrates both highly informative and ambiguous sensory evidence is a better fit to infants' individual learning outcomes than models where either type of evidence is taken alone, and that raw labeling frequency is unable to account for the word learning differences we observe. Here we demonstrate how a computational model, using only raw pixels taken from the egocentric scene image, can derive insights on human language learning.

Cover page: In-the-Moment Visual Information from the Infant's Egocentric View Determines the Success of Infant Word Learning: A Computational Study