Robust pose estimation and attribute classification are of particular interest and important tasks to the computer vision community, and they are frequently used as the intermediate representations for other high-level tasks, such as human identification, visual search and human tracking. In this dissertation, we present a unified framework for joint inferring human body pose and human attribute in a parse graph with attributes augmented to nodes in the hierarchical representation. Particularly, unlike previous existing approaches mostly train models for the two tasks separately and combine the inference sequentially, we build a unified framework by integrating the three traditional grammar formulations in an And-Or graph representation: (i) Phrase structure grammar representing the hierarchical decomposition of the human body; (ii) Dependency grammar modeling the geometric articulation; and (iii) Attribute grammar accounting for the compatibility relations between different parts in the hierarchy. Furthermore, we propose extension of our model to integrate the deep learned features efficiently to provide the better performance with richer and deeper appearance representation. We also propose a technique to handle large variation of appearance and geometry. Particularly, unlike previous approaches define the parts by drawing square bounding box around keypoints or annotating precise bounding box for parts, our approach defines parts through the separate part proposal process. Finally, we demonstrate the effectiveness of both of joint modeling and integrating deep models by showing state-of-the-art performance on several recent public benchmark datasets. We also collect our own dataset and compare our approach with existing methods.