Most current research on commonsense question answering (CQA) has focused on proposing different techniques in natural language processing and text information retrieval. However, for human cognition, retrieving and organizing desired answers from text knowledge related to commonsense questions is far less intuitive and comprehensive than it is when using multi-modal knowledge, such as related images and videos. Motivated by this, we propose a framework for trying the acquisition of diverse modal information, and embedding and integrating it into CQA tasks, further improving the performance and user experience. Specifically, this paper proposes the integration of multi-modal knowledge, including images, image description statements, image scene graphs, and knowledge sub-graphs, into a CQA system. It introduces a parallel embedding technique for this multi-modal knowledge and employs an alignment-interaction-fusion mechanism to facilitate the seamless integration of this multi-modal knowledge. Through extensive experiments, the effectiveness and superiority of our proposed method are demonstrated.