This dissertation advances subtle realism in digital representations by developing methods forphysiologically-aware video synthesis, efficient 3D reconstruction, and temporally consistent
4D modeling. We propose a scalable framework that generates bio-realistic face videos by
preserving physiological signals, addressing demographic bias in remote health sensing. In
3D reconstruction, we introduce ALTO, a method that alternates between latent topologies
to achieve high-fidelity shape recovery with fast inference. Extending to 3D generation,
we present a multi-view diffusion model (MVDD) that synthesizes detailed 3D shapes from
multi-view depth maps, improving upon point-based generative models. Finally, we develop
a framework for dynamic 4D surface reconstruction from monocular video, ensuring temporal
coherence for applications like simulation and editing. Collectively, these contributions form
a cohesive progression toward realistic, scalable digital modeling of the physical world, with
applications across healthcare, graphics, and AI-driven simulation. Future directions include
training-free 3D reasoning, real-time dynamic modeling, and broader cross-modal generative
modeling.