Generative Models for Content Creation
- Author(s): Lee, Hsin-Ying
- Advisor(s): Yang, Ming-Hsuan
- et al.
Content creation can be broadly defined as a way of conveying thoughts and expressing ideas through some medium such as speech, text or any of various arts. The general goal of content creation is to generate contents that make the information accessible and understandable to audiences. In recent years, with the rapid progress in artificial intelligence, it has become an inevitable fact that the future of content creation is a powerful blend of machine technology and human creativity. However, it remains extremely challenging for machines to truly emulate what a content creator does. Therefore, instead of attempting to take over the role of professional content creators, we aim at (1) how to shorten the gap between professional content creators and general users with the help of machines, and (2) how to leverage machines to facilitate the creation process. In this work, we propose efficient algorithms based on generative models to tackle several content creation tasks that are originally time- and money-consuming.
First, we address the problem of image-to-image translation. We propose a disentangled representation image-to-image translation (DRIT) framework to perform diverse translation without paired training data. Our model disentangles images into a domain-invariant content space and a domain-specific attribute space to enable diverse translation. Furthermore, to improve the diversity of the generated images, we propose a simple yet effective model-seeking regularization term. The proposed regularization term can serve as a plug-in term to various conditional generation tasks.
Second, we address the music-to-dance translation task. Given an input music clip, we aim to generate a corresponding dancing sequence. We propose a synthesis-by-analysis learning framework. We first learn to perform basic movement, then learn how to combine basic. movements based on input music. The generated dancing sequences are consistent to the input music clips in terms of music style and audio beats.
Third, we address the design layout generation task. Given a set of desired components as well as user-specified constraints, we aim to generate visually reasonable and appealing layouts. We propose a multi-stage framework. We first learn to predict complete relationships among components given the user-specified constraints, then we predict bounding boxes of all components. Finally, we finetune the prediction to further improve the alignment and visual quality.