Conversations between people are characterized by complex nonlinear combinations of nonverbal and neurocognitive responses complementing the words that are spoken. New tools are needed to integrate these multimodal components into coherent models of conversation. We present a study and analysis pipeline for integrating multimodal measures of conversation. Data were collected using video recordings and functional near-infrared spectroscopy (fNIRS), a portable neuroimaging technology, during dyadic conversations among strangers (N=70 dyads). Rather than running discrete analyses of neural and nonverbal data, we introduce a pipeline to combine time series data from each modality into multimodal deep neural networks (DNNs) – including channel-based fNIRS signals and OpenFace data that quantifies facial expressions over time – using S2S-RNN-Autoencoders. We explored two measures to examine the resulting t-SNE space: distance and synchrony. We found that across the dimensions integrating neural and nonverbal input features, conversing dyads tend to stay closer together than permuted pairs. Dyads exhibit significantly higher synchrony in their covariation in this space compared to permuted pairs. The results suggest a mixed methodological integration may contribute to a deeper understanding of the dynamics of communication.