Human activity analysis in unconstrained environments using far-field sensors is a challenging task. The fusion of audio and visual cues enables us to build robust and efficient human activity analysis systems. Traditional fusion schemes including feature-level, classifier-level and decision-level fusion have been explored in task- specific contexts to provide robustness to sensor and environmental noise. However, human activity analysis involves the extraction of information from audio and visual cues at multiple levels of semantic abstraction. This naturally leads to a hierarchical fusion framework. In this dissertation, the limitations of existing fusion schemes are explored and new algorithms are developed to address some of these limitations. The iterative decoding algorithm (IDA) fuses the audio and video modalities at the decision level but unlike other schemes, it uses an iterative strategy to infer the joint likelihood of the hidden states from the unimodal likelihoods. The iterative decoding is advantageous to joint modeling and other decision level fusion schemes in terms of ease of training of the models and the performance under low SNR scenarios. The extension of the IDA to more complex tasks, such as audio-visual person tracking and meeting scene analysis, leads to hierarchical fusion frameworks. The multilevel iterative decoding framework for audio-visual person tracking (MID-AVT) uses the iterative decoding framework for tracking multiple subjects using both audio and visual cues from multiple cameras and microphone arrays. The local sensor-level tracks are fused using the IDA to obtain globally consistent tracks. The MID-AVT framework is robust to sensor calibration errors and requires only a rough calibration step to learn the correspondences between different sensors. The location specific speaker modeling (LSSM) framework for audio-visual meeting scene analysis augments the tracking information with speaker recognition information. Speaker recognition using far- field microphones is a challenging task. The LSSM framework addresses this issue by using the speaker's location information to select the corresponding location specific speaker recognition model. In practice, training such contextual models requires intensive labeling of audio-visual datasets. Semi-supervised techniques for model learning and sensor calibration are presented in this dissertation to address this issue. A particular case, learning the LSSM models using face recognition information, is explored in detail and found to perform well in practice. The overall contribution of this dissertation is the exploration of various aspects of hierarchical fusion in audio-visual human activity analysis and the extensive analysis of these hierarchical fusion frameworks on real world audio-visual testbeds