The unprecedented spread of smartphone usage and its various boarding sensors have been garnering increasing interest in automatic mental health detection. However, there are two major barriers to reliable mental health detection applications that can be adopted in real-life: (a)The outputs of the complex machine learning model are not explainable, which reduces the trust of users and thus hinders the application in real-life scenarios. (b)The sensor signal distribution discrepancy across individuals is a major barrier to accurate detection since each individual has their own characteristics. We propose an explainable mental health detection model. Spatial and temporal features of multiple sensory sequences are extracted and fused with different weights generated by the attention mechanism so that the discrepancy of contribution to classifiers across different modalities can be considered in the model. Through a series of experiments on real-life datasets, results show the effectiveness of our model compared to the existing approaches.