This study introduces a neural network that models thesocial interactions from a video corpus. The corpusconsists of recordings of naturalistic observations ofsocial interactions among children and theirenvironment. The videos are annotated multimodallyincluding features like gestures. We explore how thisvideo corpus can be utilized for modelling by trainingour model on a portion of the annotated data extractedfrom the corpus, and then by using the model to predictnovel interaction sequences. We evaluate our model bycomparing its automatically generated sequences to anunseen portion of the corpus data. The initial resultsshow strong similarities between the generatedinteractions and those observed in the corpus.