Modeling spatio-temporal contextual information is fundamental in computer vision, with particular relevance to robotic intelligence and autonomous driving. We develop several frameworks for context modeling in image, video, multi-modal, and multi-cue data with applications to human-robot interactivity, in particular to the domain of intelligent vehicles. With the goal of developing contextual systems for interactivity, several key contributions are proposed: (1) A contextual framework for robust image-level scene understanding, including detection and localization of vehicles, pedestrians, and parts of humans (e.g. hands) in on-road setting, (2) A spatio-temporal, multi-modal, and multi-cue model which reasons over the complex interplay between the human (hand, head, and foot coordination), vehicle (speed, yaw-rate, etc.), and surround spatio-temporal context (agents, scene information) cues for understanding behavior and predicting activities, (3) A human-centric framework for object recognition and visual scene analysis, developed by studying a notion of object importance and relevance as measured in a spatio-temporal context of navigating a vehicle. The final contribution unifies the aforementioned components of the thesis, including spatio-temporal object recognition, human perception modeling, and behavior and intent prediction into a single research task. Although the data and case studies in this work emphasize the safety-critical settings of navigating a vehicle, the contributions of this thesis are general and can therefore be applied to a wider array of applications involving human-machine interactivity.