- Main
Human Video Generation under Novel Views and Poses
- Wang, Tiantian
- Advisor(s): Yang, Ming-Hsuan
Abstract
Human video generation focuses on developing methodologies to synthesize high-fidelity, controllable, and temporally coherent video representations of humans, with broad applications in virtual reality, digital entertainment, telepresence, and human–computer interaction. This task is particularly challenging when generating novel views and poses from limited input data, such as a single image or monocular video. Issues such as unnatural wrinkles, distorted limbs, and lack of motion consistency frequently arise under these conditions.
To address these challenges, this dissertation leverages recent advancements in neural rendering and diffusion-based generative modeling to synthesize photorealistic human videos. First, we propose a neural rendering framework that generates realistic human appearances under unseen poses and novel views from monocular video input. This is achieved by (i) encoding frame-wise appearance features and (ii) integrating temporal information across frames using a temporal transformer. This framework captures fine-grained details and reconstructs missing structures and occluded regions in query frames. Second, we introduce a geometry-guided tri-plane representation that significantly improves the efficiency and robustness of feature representation over conventional tri-plane optimizations in Neural Radiance Fields (NeRF) framework. We apply this technique to 4D (3D space + time) video stylization, unifying style transfer, novel view synthesis, and human animation within a unified framework. Third, we extend human video synthesis from single-video optimization to generalization from a single image by designing a latent diffusion model. This model generates high-quality, 360-degree, spatiotemporally coherent human videos with controllable 3D pose and viewpoint, enabling animation from limited input.
To evaluate the effectiveness of our approach, we conduct extensive experiments on multiple benchmarks. Results show that our methods consistently outperform existing techniques in generating high-quality, temporally stable, and view-consistent human videos from either a single image or monocular video input.