With the integration of emoji into digital keyboards, people areincreasingly using multimodal interactions between text andimage in real-time interactions. One technique of using emojiis to substitute them into sentences. We here investigate theonline processing of these interactions, by modulating eitherthe grammatical category of those substitutions (Experiment 1:nouns vs. verbs) or the type and location of substitutions(Experiment 2: emoji vs. logos, within sentences vs. at theirend). We found a processing cost for self-paced reading timesof images compared to words, which indeed extended past theemoji itself, but no difference in comprehensibility ratingsbetween word and congruent-image substitutions. Overall,these results suggest that, despite costs of switching modalities,text and images can be integrated into holistic multimodalexpressions.