Currently, emotion recognition based on electroencephalograms (EEGs) has a wide range of applications. Although many approaches have been proposed for automatic emotion recognition with favorable performance, there are still several challenges: (1) how to sufficiently model the long- and short-term temporal feature discrepancies and redundant spatial information of EEGs and (2) how to alleviate the negative impact of the ambiguity of emotion classes. To tackle these issues, we propose the CSET-CCA, a novel framework for EEG-based emotion recognition. The feature extractor of this model combines the 1D convolutional neural network (CNN), channel Squeeze-and-Excitation (SE) module and transformer. It can extract the temporal features of EEG signals from local and global perspectives and select the critical channels in emotion recognition. Moreover, to adaptively perceive the confusion degrees of classes and increase the model's attention on confusing emotion classes, we design class confusion-aware (CCA) attention. We evaluate the CSET-CCA with the SEED and SEED-V datasets. The experimental results show that the proposed approach outperforms state-of-the-art methods.