Semantic-guided Visual Analysis and Synthesis with Spatio-temporal Models
- Author(s): Tsai, Yi-Hsuan
- Advisor(s): Yang, Ming-Hsuan
- et al.
Visual analysis is concerned with problems to identify object status or scene layout in images or videos. There are numerous concepts that are of great interest for visual analysis and understanding in the computer vision and machine learning communities. For instance, researchers have been working on developing algorithms to recognize, detect and segment objects/scenes in images. To understand such content, numerous challenges make these problems significantly challenging in the real world scenario, since objects or scenes usually appear in different conditions such as viewpoints, scales, and background noise, and even may deform with different shapes, parts or poses.
In addition to images, video understanding has drawn much attention in various research areas due to the ease of obtaining video data and the importance of video applications, such as virtual reality, autonomous driving and video surveillance. Different from images, videos contain richer information in the temporal domain, thereby it also produces difficulties and requires larger computational powers to fully exploit video content. In this thesis, we propose to construct optimization frameworks for video object tracking and segmentation tasks. First, we utilize a spatial-temporal model to jointly optimize video object segmentation and optical flow estimation, and show that both results can be improved in the proposed framework. Second, we introduce a co-segmentation algorithm to further understand object semantics by considering relations between objects among a collection of videos. As a result, our proposed algorithms achieve state-of-the-art performance in video object segmentation.
With such visual understanding in images and videos, the following question would be how to use them in real world applications. In this thesis, we focus on the visual synthesis problem, where it is a task for people to create or edit contents in the original data. For instance, numerous image editing problems have been studied widely, such as inpainting, harmonization and colorization. For these tasks, as the human can easily discover unrealistic artifacts after the original data is edited, one important challenge is to create realistic contents. To tackle this challenge, we propose to extract semantics by utilizing visual analysis as the guidance to improve the realism of synthesized outputs. With such guidance, we show that our visual synthesis systems produce visually pleasing and realistic results on sky replacement and object/scene composition tasks.