Authors: Tero Karras (NVIDIA) Timo Aila (NVIDIA) Samuli Laine (NVIDIA) Jaakko Lehtinen (NVIDIA and Aalto University)
For business inquiries, please contact researchinquiries@nvidia.com For press and other inquiries, please contact Hector Marinez at hmarinez@nvidia.com
Abstract: We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024². We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Progressive Growing of GANs for Improved Quality, Stability, and VariationTero Karras FI2017-11-01 | Submission video of our paper, published at ICLR 2018. Please see the final version at youtu.be/G06dEcZ-QTg
Authors: Tero Karras (NVIDIA) Timo Aila (NVIDIA) Samuli Laine (NVIDIA) Jaakko Lehtinen (NVIDIA and Aalto University)
For business inquiries, please contact researchinquiries@nvidia.com For press and other inquiries, please contact Hector Marinez at hmarinez@nvidia.com
Abstract: We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024². We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.A Style-Based Generator Architecture for Generative Adversarial NetworksTero Karras FI2019-03-03 | Paper (PDF): http://stylegan.xyz/paper
Authors: Tero Karras (NVIDIA) Samuli Laine (NVIDIA) Timo Aila (NVIDIA)
Abstract: We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.Progressive Growing of GANs for Improved Quality, Stability, and VariationTero Karras FI2018-02-23 | Final result video of our paper, published at ICLR 2018. We strongly recommend viewing at maximum quality (1080p @ 60).
Authors: Tero Karras (NVIDIA) Timo Aila (NVIDIA) Samuli Laine (NVIDIA) Jaakko Lehtinen (NVIDIA and Aalto University)
For business inquiries, please contact researchinquiries@nvidia.com For press and other inquiries, please contact Hector Marinez at hmarinez@nvidia.com
Abstract: We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024². We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.One hour of imaginary celebritiesTero Karras FI2017-11-01 | http://research.nvidia.com/publication/2017-10_Progressive-Growing-ofSpeech and Facial Animation at SIGGRAPH 2017 comparison of technical papersTero Karras FI2017-11-01 | Comparison of three technical papers presented at SIGGRAPH 2017 in the "Speech and Facial Animation" session. http://s2017.siggraph.org/technical-papers/sessions/speech-and-facial-animation
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graph. 36, 4, Article 95 (July 2017). https://grail.cs.washington.edu/projects/AudioToObama/
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017). disneyresearch.com/publication/deep-learning-speech-animationAudio-Driven Facial Animation by Joint End-to-End Learning of Pose and EmotionTero Karras FI2017-11-01 | ACM Transactions on Graphics (Proc. SIGGRAPH 2017)
Tero Karras (NVIDIA) Timo Aila (NVIDIA) Samuli Laine (NVIDIA) Antti Herva (Remedy Entertainment) Jaakko Lehtinen (NVIDIA and Aalto University)
We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.
We train our network with 3-5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.Production-Level Facial Performance Capture Using Deep Convolutional Neural NetworksTero Karras FI2017-11-01 | In Proceedings of SCA'17, Los Angeles, CA, USA, July 28-30, 2017
Samuli Laine (NVIDIA) Tero Karras (NVIDIA) Timo Aila (NVIDIA) Antti Herva (Remedy Entertainment) Shunsuke Saito (Pinscreen, University of Southern California) Ronald Yu (Pinscreen, University of Southern California) Hao Li (USC Institute for Creative Technologies, University of Southern California, Pinscreen) Jaakko Lehtinen (NVIDIA, Aalto University)
We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.