Theoretical and computational models such as transfer-appropriate processing (TAP) and global matching models have emphasized the encoding-retrieval interaction of memory representations in generating false memories, but relevant neural mechanisms are still poorly understood. By manipulating the sensory modalities (visual and auditory) at different processing stages (learning and test) in the Deese-Roediger-McDermott task, we found that the auditory-learning visual-test (AV) group produced more false memories (59%) than the other three groups (42∼44%) [i.e., visual learning visual test (VV), auditory learning auditory test (AA), and visual learning auditory test (VA)]. Functional imaging results showed that the AV group's proneness to false memories was associated with (i) reduced representational match between the tested item and all studied items in the visual cortex, (ii) weakened prefrontal monitoring process due to the reliance on frontal memory signal for both targets and lures, and (iii) enhanced neural similarity for semantically related words in the temporal pole as a result of auditory learning. These results are consistent with the predictions based on the TAP and global matching models and highlight the complex interactions of representations during encoding and retrieval in distributed brain regions that contribute to false memories.