Human infants have the remarkable ability to learn the asso-ciations between object names and visual objects from inher-ently ambiguous experiences. Researchers in cognitive scienceand developmental psychology have built formal models thatimplement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the mod-els to find statistical regularities in the input data. In contrast toprevious modeling approaches, the present study used egocen-tric video and gaze data collected from infant learners duringnatural toy play with their parents. This allowed us to capturethe learning environment from the perspective of the learner’sown point of view. We then used a Convolutional Neural Net-work (CNN) model to process sensory data from the infant’spoint of view and learn name-object associations from scratch.As the first model that takes raw egocentric video to simulateinfant word learning, the present study provides a proof of prin-ciple that the problem of early word learning can be solved,using actual visual data perceived by infant learners. More-over, we conducted simulation experiments to systematicallydetermine how visual, perceptual, and attentional properties ofinfants’ sensory experiences may affect word learning.