Recent large-scale text-to-image (T2I) diffusion models have achieved remarkable success, enabling the generation of complex and realistic images from any text prompt that describes the target concept. Despite the significant advantages, the T2I diffusion model suffers from poor spatial controllability solely from text description. This thesis focuses on improving the pre-trained T2I diffusion models with additional support to take spatial reference.
The first part of this thesis proposed FreeControl, a training-free and guidance-based approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. FreeControl enforces structure guidance to facilitate the global alignment with a guidance image, and appearance guidance to collect visual details from images generated without control. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular, FreeControl enables convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality compared to training-based approaches.
The second part of this thesis presents Ctrl-X, a training-free and guidance-free method that supports structure and appearance customization from a large spectrum of mage modalities. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints.