With rapid increase in the size of videos online, analysis andprediction of affective impact that video content will haveon viewers has attracted much attention in the community.To solve this challenge several different kinds of informationabout video clips are exploited. Traditional methods normallyfocused on single modality, either audio or visual. Later onsome researchers tried to establish multi-modal schemes andspend a lot of time choosing and extracting features by differ-ent fusion strategy. In this research, we proposed an end-to-end model which can automatically extract features and targetan emotional classification task by integrating audio and vi-sual features together and also adding the temporal character-istics of the video. The experimental study on commonly usedMediaEval 2015 Affective Impact of Movies has shown thismethod’s potential and it is expected that this work could pro-vide some insight for future video emotion recognition fromfeature fusion perspective.