The dissertation proposes a new hierarchical computational framework for social interaction understanding, addressing four core challenges - interactiveness, shared attention, gaze communication and triadic belief dynamics. The goal of the framework is to represent and model the underlying structure of social interaction, from the outside to the inside, and from the bottom to the top, and to build a bridge between typical pattern recognition in computer vision and Theory of Mind understanding in Artificial Intelligence. To achieve the goal, we firstly build a computational model for interactiveness perception from motion trajectories. Secondly, we studied the typical phenomenon shared attention, and proposed a model to temporally and spatially detect whether there is shared attention or not in the social interaction scenario, and if yes, where is the shared attention. Thirdly, in a follow up work, we extend the work to further study the general gaze communication systematically. Based on previous studies in psychology, computer vision and robotics, we summarize and define the gaze communication in two hierarchical levels, i.e., the atomic level, and the event level. We proposed a graph neural network to model what kind of gaze communication is going on in the social interaction. Finally, we step further into the underlying mental world from the appearance observations. We learn what’s the on-going belief dynamics in a triadic social interaction caused by their nonverbal communication. We jointly parse the social interaction scenario into a six-level parse graph. At the bottom are the detected entities, such as objects and human agents; the entities belong to a certain frame, and the frames are then temporally clustered into interactive segments; we recognize what is the nonverbal communication category for each interactive segment; finally, we infer the belief dynamics caused by the nonverbal communication. Our hierarchical computational framework can give a more holistic and deeper understanding of social interaction in videos beyond simple pattern perception and recognition.