Intelligent environments can be viewed as systems where humans and machines (rooms) collaborate. Intelligent (or smart) environments need to extract and maintain an awareness of a wide range of events and human activities occurring in these spaces. This requirement is crucial for supporting efficient and effective interactions among humans as well as humans and intelligent spaces. Visual information plays an important role for developing accurate and useful representation of the static and dynamic states of an intelligent environment. Accurate and efficient capture, analysis, and summarization of the dynamic context requires the vision system to work at multiple levels of semantic abstractions in a robust manner. In this paper, we present details of a long-term and ongoing research project, where indoor intelligent spaces endowed with a range of useful functionalities are designed, built, and systematically evaluated. Some of the key functionalities include: intruder detection; multiple person tracking; body pose and posture analysis; person identification; human body modeling and movement analysis; and for integrated systems for intelligent meeting rooms, teleconferencing, or performance spaces. The paper includes an overall system architecture to support design and development of intelligent environments. Details of panoramic (omnidirectional) video camera arrays, calibration, video stream synchronization, and real-time capture/processing are discussed. Modules for multicamera-based multiperson tracking, event detection and event based servoing for selective attention, voxelization, streaming face recognition, are also discussed. The paper includes experimental studies to systematically evaluate performance of individual video analysis modules as well as to evaluate basic feasibility of an integrated system for dynamic context capture and event based servoing, and semantic information summarization.