Design and Control of a Bipedal Robotic CharacterDisneyResearchHub2024-10-17 | Design and Control of a Bipedal Robotic CharacterVideo Compression with Entropy-Constrained Neural RepresentationsDisneyResearchHub2023-06-03 | Encoding videos as neural networks is a recently proposed approach that allows new forms of video processing. However, traditional techniques still outperform such neural video representation (NVR) methods for the task of video compression. This performance gap can be explained by the fact that current NVR methods: i) use architectures that do not efficiently obtain a compact representation of temporal and spatial information; and ii) minimize rate and distortion disjointly (first overfitting a network on a video and then using heuristic techniques such as post-training quantization or weight pruning to compress the model). We propose a novel convolutional architecture for video representation that better represents spatio-temporal information and a training strategy capable of jointly optimizing rateand distortion.Continuous Landmark Detection with 3D QueriesDisneyResearchHub2023-06-03 | We propose the first facial landmark detection network that can predict continuous, unlimited landmarks, allowing to specify the number and location of the desired landmarks at inference time. Our method combines a simple image feature extractor with a queried landmark predictor, and the user can specify any continuous query points relative to a 3D template face mesh as input.Kernel Aware ResamplerDisneyResearchHub2023-06-03 | Deep learning based methods for super-resolution have become state-of-the-art and outperform traditional approaches by a significant margin. From the initial models designed for fixed integer scaling factors (e.g. x2 or x4), efforts were made to explore different directions such as modeling blur kernels or addressing non-integer scaling factors. However, existing works do not provide a sound framework to handle them jointly.Frame Interpolation Transformer and Uncertainty GuidanceDisneyResearchHub2023-06-03 | We propose a transformer-based VFI architecture that processes both source and target frames in a unified framework and compensates motion through a tightly integrated optical flow estimation and cross-backward warping. Our model improves over the current state-of-the-art as supported by our extensive quantitative experiments and a user study.Physics-Informed Neural Corrector for Deformation-based FluidDisneyResearchHub2023-05-08 | We present a method to rectify deformed fluid flows using neural networks. Our neural corrector ensures the physical plausibility of edited simulation footprints at test time, enabling interactive control of fluids without re-simulations.Self-Supervised Effective Resolution Estimation with Adversarial AugmentationsDisneyResearchHub2023-01-10 | High-resolution, high-quality images of human faces are desired as training data and output for many modern applications, such as avatar generation, face super-resolution, and face swapping. The terms high-resolution and high-quality are often used interchangeably; however, the two concepts are not equivalent, and high-resolution does not always imply high-quality. To address this, we motivate and precisely define the concept of effective resolution in this paper. We thereby draw connections to signal and information theory and show why baselines based on frequency analysis or compression fail. Instead, we propose a novel self-supervised learning scheme to train a neural network for effective resolution estimation without human-labeled data. It leverages adversarial augmentations to bridge the domain gap between synthetic and real, authentic degradations – thus allowing us to train on domains, such as human faces, for which no or only few human labels exist. Finally, we demonstrate that our method outperforms state-of-the-art image quality assessment methods in estimating the sharpness of real and generated human faces, despite using only unlabeled data during training.
Publication link: studios.disneyresearch.com/app/uploads/2022/12/Self-Supervised-Effective-Resolution-Estimation-with-Adversarial-Augmentations_supplementary.pdfEfficient Neural Style Transfer For Volumetric SimulationsDisneyResearchHub2022-11-30 | Artistically controlling fluids has always been a challenging task. Recently, volumetric Neural Style Transfer (NST) techniques have been used to artistically manipulate smoke simulation data with 2D images. In this work, we revisit previous volumetric NST techniques for smoke, proposing a suite of upgrades that enable stylizations that are significantly faster, simpler, more controllable and less prone to artifacts. Moreover, the energy minimization solved by previous methods is camera dependent. To avoid that, a computationally expensive iterative optimization performed for multiple views sampled around the original simulation is needed, which can take up to several minutes per frame. We propose a simple feed-forward neural network architecture that is able to infer view-independent stylizations that are three orders of the magnitude faster than its optimization-based counterpart.
Publication link: studios.disneyresearch.com/2022/11/30/efficient-neural-style-transfer-for-volumetric-simulationsProduction Ready Face Re Aging for Visual EffectsDisneyResearchHub2022-11-30 | Photorealistic digital re-aging of faces in video is becoming increasingly common in entertainment and advertising. But the predominant 2D painting workflow often requires frame-by-frame manual work that can take days to accomplish, even by skilled artists. Although research on facial image re-aging has attempted to automate and solve this problem, current techniques are of little practical use as they typically suffer from facial identity loss, poor resolution, and unstable results across subsequent video frames. In this paper, we present the first practical, fully-automatic and production-ready method for re-aging faces in video images. Our first key insight is in addressing the problem of collecting longitudinal training data for learning to re-age faces over extended periods of time, a task that is nearly impossible to accomplish for a large number of real people. We show how such a longitudinal dataset can be constructed by leveraging the current state-of-the-art in facial re-aging that, although failing on real images, does provide photoreal re-aging results on synthetic faces. Our second key insight is then to leverage such synthetic data and formulate facial re-aging as a practical image-to-image translation task that can be performed by training a well-understood U-Net architecture, without the need for more complex network designs. We demonstrate how the simple U-Net, surprisingly, allows us to advance the state of the art for re-aging real faces on video, with unprecedented temporal stability and preservation of facial identity across variable expressions, viewpoints, and lighting conditions. Finally, our new face re-aging network (FRAN) incorporates simple and intuitive mechanisms that provides artists with localized control and creative freedom to direct and fine-tune the re-aging effect, a feature that is largely important in real production pipelines and often overlooked in related research work.
Publication link: studios.disneyresearch.com/2022/11/30/production-ready-face-re-aging-for-visual-effectsTempFormer: Temporally Consistent Transformer for Video DenoisingDisneyResearchHub2022-11-23 | Video denoising is a low-level vision task that aims to restore high-quality videos from noisy content. Vision Transformer (ViT) is a new machine learning architecture that has shown promising performance on both high-level and low-level image tasks, e.g., object detection, classification, and image restoration in the past year. In this paper, we propose a modified ViT architecture for video processing tasks, introducing a new training strategy and loss function to enhance temporal consistency without compromising spatial quality. Specifically, we propose an efficient hybrid Transformer-based model, TempFormer, which composes SpatioTemporal Transformer Blocks (STTB) and 3D convolutional layers. The proposed STTB learns the temporal information between neighboring frames implicitly by utilizing the proposed Joint Spatio-Temporal Mixer module for attention calculation and feature aggregation in each ViT block. Moreover, existing methods suffer from temporal inconsistency artifacts that are problematic in practical cases and distracting to the viewers. We propose a sliding block strategy with recurrent architecture, and use a new loss term, Overlap Loss, to alleviate the flickering between adjacent frames. Our method produces state-of-the-art spatio-temporal denoising quality with significantly improved temporal coherency and requires less computational resources to achieve comparable denoising quality with competing methods.
Publication link: studios.disneyresearch.com/2022/10/11/tempformer-temporally-consistent-transformer-for-video-denoisingLearning Dynamic 3D Geometry and Texture for Video Face SwappingDisneyResearchHub2022-10-11 | Face swapping is the process of applying a source actor’s appearance to a target actor’s performance in a video. This is a challenging visual effect that has seen increasing demand in film and television production. Recent work has shown that data-driven methods based on deep learning can produce compelling effects at production quality in a fraction of the time required for a traditional 3D pipeline. However, the dominant approach operates only on 2D imagery without reference to the underlying facial geometry or texture, resulting in poor generalization under novel viewpoints and little artistic control. Methods that do incorporate geometry rely on pre-learned facial priors that do not adapt well to particular geometric features of the source and target faces. We approach the problem of face swapping from the perspective of learning simultaneous convolutional facial autoencoders for the source and target identities, using a shared encoder network with identity-specific decoders. The key novelty in our approach is that each decoder first lifts the latent code into a 3D representation, comprising a dynamic face texture and a deformable 3D face shape, before projecting this 3D face back onto the input image using a differentiable renderer. The coupled autoencoders are trained only on videos of the source and target identities, without requiring 3D supervision. By leveraging the learned 3D geometry and texture, our method achieves face swapping with higher quality than when using off-the-shelf monocular 3D face reconstruction, and overall lower FID score than state-of-the-art 2D methods. Furthermore, our 3D representation allows for efficient artistic control over the result, which can be hard to achieve with existing 2D approaches.
Publication link: studios.disneyresearch.com/2022/10/05/learning-dynamic-3d-geometry-and-texture-for-video-face-swappingFacial Animation with Disentangled Identity and Motion using TransformersDisneyResearchHub2022-09-12 | We propose a 3D+time framework for modeling dynamic sequences of 3D facial shapes, representing realistic non-rigid motion during a performance. Our work extends neural 3D morphable models by learning a motion manifold using a transformer architecture. More specifically, we derive a novel transformer-based autoencoder that can model and synthesize 3D geometry sequences of arbitrary length. This transformer naturally determines frame-to-frame correlations required to represent the motion manifold, via the internal self-attention mechanism. Furthermore, our method disentangles the constant facial identity from the time-varying facial expressions in a performance, using two separate codes to represent neutral identity and the performance itself within separate latent subspaces. Thus, the model represents identity-agnostic performances that can be paired with an arbitrary new identity code and fed through our new identity-modulated performance decoder; the result is a sequence of 3D meshes for the performance with the desired identity and temporal length. We demonstrate how our disentangled motion model has natural applications in performance synthesis, performance retargeting, key-frame interpolation and completion of missing data, performance denoising and retiming, and other potential applications that include full 3D body modeling.
Publication link: studios.disneyresearch.com/2022/09/13/facial-animation-with-disentangled-identity-and-motion-using-transformersMoRF: Morphable Radiance Fields for Multiview Neural Head ModelingDisneyResearchHub2022-07-26 | Recent research work has developed powerful generative models (eg, StyleGAN2) that can synthesize complete human head images with impressive photorealism, enabling applications such as photorealistically editing real photographs. While these models can be trained on large collections of unposed images, their lack of explicit 3D knowledge makes it difficult to achieve even basic control over 3D viewpoint without unintentionally altering identity. On the other hand, recent Neural Radiance Field (NeRF) methods have already achieved multiview-consistent, photorealistic renderings but they are so far limited to a single facial identity. In this paper, we propose a new Morphable Radiance Field (MoRF) method that extends a NeRF into a generative neural model that can realistically synthesize multiview-consistent images of complete human heads, with variable and controllable identity. MoRF allows for morphing between particular identities and synthesizing arbitrary new identities, all while providing realistic and consistent rendering under novel viewpoints. We train MoRF in a simple supervised fashion by leveraging a high-quality database of multiview portrait images of several people, captured in studio with polarization-based separation of diffuse and specular reflection. Here, we demonstrate how MoRF is a strong new step towards 3D morphable neural head modeling.
Publication link: studios.disneyresearch.com/2022/07/24/morf-morphable-radiance-fields-for-multiview-neural-head-modelingFacial Hair Tracking for High Fidelity Performance CaptureDisneyResearchHub2022-07-23 | Facial hair is a largely overlooked topic in facial performance capture. Most production pipelines in the entertainment industry do not have a way to automatically capture facial hair or track the skin underneath it. Thus, actors are asked to shave clean before face capture, which is very often undesirable. In this paper, we propose the first multiview reconstruction pipeline that tracks both the dense 3D facial hair, as well as the underlying 3D skin for entire performances. We demonstrate the proposed capture pipeline on a variety of different facial hair styles and lengths, ranging from sparse and short to dense full-beards.
Publication link: studios.disneyresearch.com/2022/07/24/facial-hair-tracking-for-high-fidelity-performance-captureLocal Anatomically-Constrained Facial Performance RetargetingDisneyResearchHub2022-07-23 | We present a new method for high-fidelity offline facial performance retargeting that is neither expensive nor artifact-prone. Our two-step method first transfers local expression details to the target, and is followed by a global face surface prediction that uses anatomical constraints in order to stay in the feasible shape space of the target character. Our method further offers artists with familiar blendshape based controls to perform fine adjustments to the retargeted animation. As such, our method is ideally suited for the complex task of human-to-human 3D facial performance retargeting, where the quality bar is extremely high in order to avoid the uncanny valley.
Publication link: studios.disneyresearch.com/2022/07/24/local-anatomically-constrained-facial-performance-retargetingImplicit Neural Representation for Physics-driven Actuated Soft BodiesDisneyResearchHub2022-07-23 | Active soft bodies can affect their shape through an internal actuation mechanism that induces a deformation. Similar to recent work, this paper utilizes a differentiable, quasi-static, and physics-based simulation layer to optimize for actuation signals parameterized by neural networks. Our key contribution is a general and implicit formulation to control active soft bodies by defining a function that enables a continuous mapping from a spatial point in the material space to the actuation value. This property allows us to capture the signal’s dominant frequencies, making the method of discretization agnostic and widely applicable. We extend our implicit model to mandible kinematics for the particular case of facial animation and show that we can reliably reproduce facial expressions captured with high-quality capture systems. We apply the method to volumetric soft bodies, human poses, and facial expressions, demonstrating artist-friendly properties, such as simple control over the latent space and resolution invariance at test time.
Publication link: studios.disneyresearch.com/2022/07/24/implicit-neural-representation-for-physics-driven-actuated-soft-bodiesShape Transformers: Topology-Independent 3D Shape Models Using TransformersDisneyResearchHub2022-04-24 | Parametric 3D shape models (e.g., for faces) are heavily utilized in computer graphics and vision applications to provide priors on the observed variability of an object’s geometry. Original models were linear and operated on the entire shape at once. They were later enhanced to provide localized control on different shape parts separately. In deep shape models, nonlinearity was introduced via a sequence of fully-connected layers and activation functions, and locality was introduced in recent models that use mesh convolution networks. As common limitations, these models often dictate, in one way or another, the allowed extent of spatial correlations and also require that a fixed mesh topology be specified ahead of time. To overcome these limitations, we present a new nonlinear parametric 3D shape model based on transformer architectures. A key benefit of this new model comes from using the transformer’s “self-attention” mechanism to automatically learn nonlinear spatial correlations for a class of 3D shapes. This is in contrast to global models that correlate everything and local models that dictate the correlation extent. Our transformer 3D shape autoencoder is a better alternative to mesh convolution models, which require specially- crafted convolution, and down/up-sampling operators that can be difficult to design. Additionally, our model is topologically independent: it can be trained once and then evaluated on any mesh topology, unlike previous methods. We demonstrate the application of our model to different datasets, including 3D faces, 3D hand shapes and full human bodies. Our experiments demonstrate the strong potential of our transformer-based 3D shape model in several applications in computer graphics and vision.
Publication link: studios.disneyresearch.com/2022/04/25/shape-transformers-topology-independent-3d-shape-models-using-transformersImproved Lighting Models for Facial Appearance CaptureDisneyResearchHub2022-04-24 | Facial appearance capture techniques estimate geometry and reflectance properties of facial skin by performing a computa- tionally intensive inverse rendering optimization in which one or more images are re-rendered a large number of times and compared to real images coming from multiple cameras. Due to the high computational burden, these techniques often make several simplifying assumptions to tame complexity and make the problem more tractable. For example, it is common to as- sume that the scene consists of only distant light sources, and ignore indirect bounces of light (on the surface and within the surface). Also, methods based on polarized lighting often simplify the light interaction with the surface and assume perfect separation of diffuse and specular reflectance. In this paper, we move in the opposite direction and demonstrate the impact on facial appearance capture quality when departing from these idealized conditions towards models that seek to more accurately represent the lighting, while at the same time minimally increasing computational burden. We compare the results obtained with a state-of-the-art appearance capture method [RGB∗20], with and without our proposed improvements to the lighting model.
Link to publication file: studios.disneyresearch.com/2022/04/25/improved-lighting-models-for-facial-appearance-captureNeural Frame Interpolation for Rendered ContentDisneyResearchHub2021-11-30 | The demand for creating rendered content continues to drastically grow. As it often is extremely computationally expensive and thus costly to render high-quality computer generated images, there is a high incentive to reduce this computational burden. Recent advances in learning-based frame interpolation methods have shown exciting progress but still have not achieved the production-level quality which would be required to render less pixels and achieve savings in rendering times and costs. Therefore, in this paper we propose a method specifically targeted to achieve high quality frame interpolation for rendered content. In this setting, we assume that we have full input every $n$-th frame in addition to auxiliary feature buffers that are cheap to evaluate (e.g. depth, normals, albedo) for every frame. We propose solutions for leveraging such auxiliary features to obtain better motion estimates, more accurate occlusion handling, and to correctly reconstruct non-linear motion between keyframes. With this our method is able to significantly push the state-of-the-art in frame interpolation for rendered content and we are able to obtain production-level quality results.
Link to publication file: studios.disneyresearch.com/2021/11/30/neural-frame-interpolation-for-rendered-contentRendering with Style Combining Traditional and Neural Approaches for High Quality Face RenderingDisneyResearchHub2021-11-30 | In this work we propose to combine incomplete, high-quality renderings showing only facial skin with recent methods for neural rendering of faces, in order to automatically and seamlessly create photo-realistic full-head portrait renders from captured data without the need for artist intervention. Our method begins with traditional face rendering, where the skin is rendered with the desired appearance, expression, viewpoint, and illumination. These skin renders are then projected into the latent space of a pre-trained neural network that can generate arbitrary photo-real face images (StyleGAN2). The result is a sequence of realistic face images that match the identity and appearance of the 3D character at the skin level, but is completed naturally with synthesized hair, eyes, inner mouth and surroundings.
Link to publication file: studios.disneyresearch.com/2021/11/30/rendering-with-style-combining-traditional-and-neural-approaches-for-high-quality-face-renderingAdaptive Convolutions for Structure-Aware Style TransferDisneyResearchHub2021-06-18 | Style transfer between images is an artistic application of CNNs, where the ‘style’ of one image is transferred onto another image without modifying its content. The current state-of-the-art in neural style transfer uses a technique called Adaptive Instance Normalization (AdaIN), which transfers the statistical properties of style features to a content image, and can transfer an infinite number of styles in real time. However, AdaIN is a global operation, and thus local geometric structures in the style image are often ignored during the transfer. We propose Adaptive convolutions; a generic extension of AdaIN, which allows for the simultaneous transfer of both statistical and structural styles in real time. Apart from style transfer, our method can also be readily extended to style-based image generation, and other tasks where AdaIN has already been adopted.
Link to publication file: studios.disneyresearch.com/2021/06/19/adaptive-convolutions-for-structure-aware-style-transferA Versatile Inverse Kinematics Formulation for Retargeting Motions onto Robots with Kinematic LoopsDisneyResearchHub2021-02-09 | Robots with kinematic loops are known to have superior mechanical performance. However, due to these loops, their modeling and control is challenging, and prevents a more widespread use. In this paper, we describe a versatile Inverse Kinematics (IK) formulation for the retargeting of expressive motions onto mechanical systems with loops. We support the precise control of the position and orientation of several end-effectors, and the Center of Mass (CoM) of slowly walking robots. Our formulation safeguards against a disassembly when IK targets are moved outside the workspace of the robot, and we introduce a regularizer that smoothly circumvents kinematic singularities where velocities go to infinity. With several validation examples and three physical robots, we demonstrate the versatility and efficacy of our IK on overactuated systems with loops, and for the retargeting of an expressive motion onto a bipedal robot.FaceMagic: Real-time Facial Detail Effects on MobileDisneyResearchHub2020-12-03 | We present a novel real-time face detail reconstruction method capable of recovering high quality geometry on consumer mobile devices. Our system firstly uses a morphable model and semantic segmentation of facial parts to achieve robust self-calibration. We then capture fine-scale surface details using a patch-based Shape from Shading (SfS) approach. We pre-compute the patch-wise constant Moore–Penrose inverse matrix of the resulting linear system to achieve real-time performance. Our method achieves high interactive frame-rates and experiments show that our new approach is capable of reconstructing high-fidelity geometry with corresponding results to off-line techniques. We illustrate this through comparisons with off-line and on-line related works, and include demonstrations of novel face detail shader effects processingADD: Analytically Differentiable Dynamics for Multi -Body Systems with Frictional ContactDisneyResearchHub2020-12-03 | We present a differentiable dynamics solver that is able to handle frictional contact for rigid and deformable objects within a unified framework. Through a principled mollification of normal and tangential contact forces, our method circumvents the main difficulties inherent to the non-smooth nature of frictional contact. We combine this new contact model with fullyimplicit time integration to obtain a robust and efficient dynamics solver that is analytically differentiable. In conjunction with adjoint sensitivity analysis, our formulation enables gradient-based optimization with adaptive trade-offs between simulation accuracy and smoothness of objective function landscapes. We thoroughly analyse our approach on a set of simulation examples involving rigid bodies, visco-elastic materials, and coupled multibody systems. We furthermore showcase applications of our differentiable simulator to parameter estimation for deformable objects, motion planning for robotic manipulation, trajectory optimization for compliant walking robots, as well as efficient self-supervised learning of control policies.Automated Routing of Muscle Fibers for Soft RobotsDisneyResearchHub2020-12-02 | This video introduces a computational approach for routing thin artificial muscle actuators through hyperelastic soft robots, in order to achieve a desired deformation behavior. Provided with a robot design, and a set of example deformations, we continuously co-optimize the routing of actuators, and their actuation, to approximate example deformations as closely as possible.
We introduce a data-driven model for McKibben muscles, modeling their contraction behavior when embedded in a silicone elastomer matrix. To enable the automated routing, a differentiable hyperelastic material simulation is presented. Because standard finite elements are not differentiable at element boundaries, we implement a Moving Least Squares formulation, making the deformation gradient twice-differentiable.
Our robots are fabricated in a two-step molding process, with the complex mold design steps automated. While most soft robotic designs utilize bending, we study the use of our technique in approximating twisting deformations on a bar example. To demonstrate the efficacy of our technique in soft robotic design, we show a continuum robot, a tentacle, and a 4-legged walking robot.RobotSculptor: Artist-Directed Robotic Sculpting of ClayDisneyResearchHub2020-11-30 | We present an interactive design system that allows users to create sculpting styles and fabricate clay models using a standard 6-axis robot arm. Given a general mesh as input, the user iteratively selects sub-areas of the mesh through decomposition and embeds the design expression into an initial set of toolpaths by modifying key parameters that affect the visual appearance of the sculpted surface finish. These parameters were identified and extracted through a series of design experiments, using a customized loop tool to cut the water-based clay material. The initialized toolpaths are fed into the optimization component of our system afterwards for optimal path planning, aiming to find the robotic sculpting motions that match the target surface, maintaining the design expression, and resolving collisions and reachability issues. We demonstrate the versatility of our approach by designing and fabricating different sculpting styles over a wide range of clay models.Designing Robotically-Constructed Metal Frame StructuresDisneyResearchHub2020-11-30 | We present a computational technique that aids with the design of structurally-sound metal frames, tailored for robotic fabrication using an existing process that integrate automated bar bending, welding, and cutting. Aligning frames with structurally-favorable orientations, and decomposing models into fabricable units, we make the fabrication process scale-invariant, and frames globally align in an aesthetically-pleasing and structurally-informed manner.Semantic Deep Face ModelsDisneyResearchHub2020-11-24 | Face models built from 3D face databases are often used in computer vision and graphics tasks such as face reconstruction, replacement, tracking and manipulation. For such tasks, commonly used multi-linear morphable models, which provide semantic control over facial identity and expression, often lack quality and expressivity due to their linear nature. Deep neural networks offer the possibility of non-linear face modeling, where so far most research has focused on generating realistic facial images with less focus on 3D geometry, and methods that do produce geometry have little or no notion of semantic control, thereby limiting their artistic applicability. We present a method for nonlinear 3D face modeling using neural architectures that provides intuitive semantic control over both identity and expression by disentangling these dimensions from each other, essentially combining the benefits of both multi-linear face models and nonlinear deep face networks. The result is a powerful, semantically controllable, nonlinear, parametric face model. We demonstrate the value of our semantic deep face model with applications of 3D face synthesis, facial performance transfer, performance editing, and 2D landmark-based performance retargeting.
Link to publication file: studios.disneyresearch.com/2020/11/25/semantic-deep-face-modelsRealistic and Interactive Robot GazeDisneyResearchHub2020-10-29 | This video describes the development of a system for lifelike gaze in human-robot interactions using a humanoid animatronic bust. We present a general architecture that seeks not only to create gaze interactions from a technological standpoint, but also through the lens of character animation where the fidelity and believability of motion is paramount; that is we seek to create an interaction which demonstrates the illusion of life.Realistic and Interactive Robot GazeDisneyResearchHub2020-10-27 | This video describes the development of a system for lifelike gaze in human-robot interactions using a humanoid animatronic bust. We present a general architecture that seeks not only to create gaze interactions from a technological standpoint, but also through the lens of character animation where the fidelity and believability of motion is paramount; that is, we seek to create an interaction which demonstrates the illusion of life.
Link to publication page: la.disneyresearch.com/publication/realistic-and-in…ctive-robot-gazeData-driven Extraction and Composition of Secondary Dynamics in Facial Performance CaptureDisneyResearchHub2020-08-17 | Performance capture of expressive subjects will inevitably incorporate some fraction of motion that is due to inertial effects and dynamic overshoot due to ballistic motion. Normally these secondary dynamic effects are unwanted, as the captured facial performance is often retargeted to different head motion. This paper advances the hypothesis that for a highly constrained elastic medium such as the human face, these secondary inertial effects are predominantly due to the motion of the underlying bony structures, and present the ability to either subtract parasitic secondary dynamics that resulted from unintentional motion during capture, or compose such effects on top of a quasistatic performance to simulate a new dynamic motion of the actor's body and skull, either artist-prescribed or acquired via motion capture.
Link to publication file: http://studios.disneyresearch.com/2020/08/17/data-driven-extraction-and-composition-of-secondary-dynamics-in-facial-performance-captureSingle Shot High Quality Facial Geometry and Skin Appearance CaptureDisneyResearchHub2020-08-14 | We propose a new light-weight face capture system capable of reconstructing both high-quality geometry and detailed appearance maps from a single exposure. Unlike currently employed appearance acquisition systems, the proposed technology does not require active illumination and hence can readily be integrated with passive photogrammetry solutions. The proposed algorithm leverages images captured under two different polarization states to reconstruct the geometry and to recover appearance properties. We do so by means of an inverse rendering framework, which solves per-texel diffuse albedo, specular intensity, and high-resolution normals, as well as global specular roughness considering the subsurface scattering nature of skin.
Link to publication page: studios.disneyresearch.com/2020/08/14/single-shot-high-quality-facial-geometry-and-skin-appearance-captureRig space Neural RenderingDisneyResearchHub2020-07-16 | Movie productions use high resolution 3d characters with complex proprietary rigs to create the highest quality images possible for large displays. Unfortunately, these 3d assets are typically not compatible with real-time graphics engines used for games, mixed reality and real-time pre-visualization. Consequently, the 3d characters need to be re-modeled and re-rigged for these new applications, requiring weeks of work and artistic approval. Our solution to this problem is to learn a compact image-based rendering of the original 3d character, conditioned directly on the rig parameters. Our idea is to render the character in many different poses and views, and to train a deep neural network to render high resolution images, from the rig parameters directly. Many neural rendering techniques have been proposed to render from 2d skeletons, or geometry and UV maps. However these require additional steps to create the input structure (e.g. a low res mesh), often hold ambiguities between front and back (e.g. 2d skeletons) and most importantly, do not preserve the animator's workflow of manipulating specific type of rigs, as well as the real-time game engine pipeline of interpolating rig parameters. In contrast, our model learns to render an image directly from the rig parameters at a high resolution. We extend our architecture to support dynamic re-lighting and composition with other objects in the scene. By generating normals, depth, albedo and a mask, we can produce occlusion depth tests and lighting effects through the normals.
Link to publication file: http://studios.disneyresearch.com/2020/03/24/rig-space-neural-renderingInteractive Sculpting of Digital Faces Using an Anatomical Modeling ParadigmDisneyResearchHub2020-07-03 | Digitally sculpting 3D human faces is a very challenging task. It typically requires either 1) highly-skilled artists using complex software packages for high quality results, or 2) highly-constrained simple interfaces for consumer-level avatar creation, such as in game engines. We propose a novel interactive method for the creation of digital faces that is simple and intuitive to use, even for novice users, while consistently producing plausible 3D face geometry, and allowing editing freedom beyond traditional video game avatar creation. At the core of our system lies a specialized anatomical local face model (ALM), which is constructed from a dataset of several hundred 3D face scans. User edits are propagated to constraints for an optimization of our data-driven ALM model, ensuring the resulting face remains plausible even for simple edits like clicking and dragging surface points. We show how several natural interaction methods can be implemented in our framework, including direct control of the surface, indirect control of semantic features like age, ethnicity, gender, and BMI, as well as indirect control through manipulating the underlying bony structures. The result is a simple new method for creating digital human faces, for artists and novice users alike. Our method is attractive for low-budget VFX and animation productions, and our anatomical modeling paradigm can complement traditional game engine avatar design packages.
Link to publication file: http://studios.disneyresearch.com/2020/07/06/interactive-sculpting-of-digital-faces-using-an-anatomical-modeling-paradigmHigh Resolution Neural Face Swapping for Visual EffectsDisneyResearchHub2020-06-29 | We propose an algorithm for fully automatic neural face swapping in images and videos. To the best of our knowledge, this is the first method capable of rendering photo-realistic and temporally coherent results at megapixel resolution. To this end, we introduce a progressively trained multi-way comb network and a light- and contrast-preserving blending method. We also show that while progressive training enables generation of high-resolution images, extending the architecture and training data beyond two people allows us to achieve higher fidelity in generated expressions. When compositing the generated expression onto the target face, we show how to adapt the blending strategy to preserve contrast and low-frequency lighting. Finally, we incorporate a refinement strategy into the face landmark stabilization algorithm to achieve temporal stability, which is crucial for working with high-resolution videos. We conduct an extensive ablation study to show the influence of our design choices on the quality of the swap and compare our work with popular state-of-the-art methods.
Link to publication file: http://studios.disneyresearch.com/2020/06/29/high-resolution-neural-face-swapping-for-visual-effectsAttention Driven Cropping for Very High Resolution Facial Landmark DetectionDisneyResearchHub2020-06-17 | Facial landmark detection is a fundamental task for many consumer and high-end applications. Today, landmark detection is almost entirely solved by machine learning methods that are trained on a dataset of hand annotated images. Existing datasets are primarily made up of only low resolution images, and current algorithms are limited to inputs of comparable quality and resolution as the training dataset. On the other hand, high resolution imagery is becoming increasingly more common as consumer cameras improve in quality every year. Therefore, there is need for algorithms that can leverage the rich information available in high resolution imagery. Na{“i}vely attempting to reuse existing network architectures on high resolution imagery is prohibitive due to memory bottlenecks on GPUs. The only current solution is to downsample the images, sacrificing resolution and quality. Building on top of recent progress in attention-based networks, we present a novel, fully convolutional regional architecture that is specially designed for predicting landmarks on very high resolution facial images without downsampling. We demonstrate the flexibility of our architecture by training the proposed model with images of resolutions ranging from 256 x 256 to 4K. In addition to being the first method for facial landmark detection on high resolution images, our approach achieves superior performance over traditional (holistic) state-of-the-art architectures across ALL resolutions, leading to a general-purpose, extremely flexible, high quality landmark detector.
Link to publication file: http://studios.disneyresearch.com/2020/06/16/attention-driven-cropping-for-very-high-resolution-facial-landmark-detectionFast Nonlinear Least Squares Optimization of Large Scale Semi Sparse ProblemsDisneyResearchHub2020-05-26 | Many problems in computer graphics and vision can be formulated as a nonlinear least squares optimization problem, for which numerous off-the-shelf solvers are readily available. Depending on the structure of the problem, however, existing solvers may be more or less suitable, and in some cases the solution comes at the cost of lengthy convergence times. One such case is semi-sparse optimization problems, emerging for example in localized facial performance reconstruction, where the nonlinear least squares problem can be composed of hundreds of thousands of cost functions, each one involving many of the optimization parameters. While such problems can be solved with existing solvers, the computation time can severely hinder the applicability of these methods. We introduce a novel iterative solver for nonlinear least squares optimization of large-scale semi-sparse problems. We use the nonlinear Levenberg-Marquardt method to locally linearize the problem in parallel, based on its first- order approximation. Then, we decompose the linear problem in small blocks, using the local Schur complement, leading to a more compact linear system without loss of information. The resulting system is dense but its size is small enough to be solved using a parallel direct method in a short amount of time. The main benefit we get by using such an approach is that the overall optimization process is entirely parallel and scalable, making it suitable to be mapped onto graphics hardware (GPU). By using our minimizer, results are obtained up to one order of magnitude faster than other existing solvers, without sacrificing the generality and the accuracy of the model. We provide a detailed analysis of our approach and validate our results with the application of performance-based facial capture using a recently-proposed anatomical local face deformation model.
Link to publication file: http://studios.disneyresearch.com/2020/05/25/fast-nonlinear-least-squares-optimization-of-large-scale-semi-sparse-problemsFacial Expression Synthesis using a Global-Local Multilinear FrameworkDisneyResearchHub2020-05-26 | We present a practical method to synthesize plausible 3D facial expressions that preserve the identity of a target subject. The ability to synthesize an entire facial rig from a single neutral expression has a large range of applications both in computer graphics and computer vision, ranging from the efficient and cost-effective creation of CG characters to scalable data generation for machine learning purposes. Unlike previous methods based on multilinear models, the proposed approach is capable to extrapolate well outside the sample pool, which allows it to accurately reproduce the identity of the target subject and create artifact free expression shapes while requiring only a small input dataset. We introduce local-global multilinear models that leverage the strengths of expression-specific and identity-specific local models combined with coarse motion estimations from a global model. Experimental results show that we achieve high-quality, identity-preserving facial expression synthesis results that outperform existing methods both quantitatively and qualitatively.
Link to publication file: http://studios.disneyresearch.com/2020/05/25/facial-expression-synthesis-using-a-local-global-multilinear-frameworkPoseMMR: A Collaborative Mixed Reality Authoring Tool for Character AnimationDisneyResearchHub2020-03-19 | Augmented reality devices enable new approaches for character animation, e.g., given that character posing is three dimensional in nature it follows that interfaces with higher degrees-of-freedom (DoF) should outperform 2D interfaces. We present PoseMMR, allowing Multiple users to animate characters in a Mixed Reality environment, like how a stop-motion animator would manipulate a physical puppet, frame-by-frame, to create the scene. We explore the potential advantages of the PoseMMR can facilitate immersive posing, animation editing, version control and collaboration, and provide a set of guidelines to foster the development of immersive technologies as tools for collaborative authoring of character animation.
Link to publication file: la.disneyresearch.com/publication/posemmr-a-collaborative-mixed-reality-authoring-tool-for-character-animationMakeSense: Automated Sensor Design for Proprioceptive Soft RobotsDisneyResearchHub2020-01-22 | Soft robots have applications in safe human-robot interactions, manipulation of fragile objects, and locomotion in challenging and unstructured environments. In this paper, we present a computational method for augmenting soft robots with proprioceptive sensing capabilities. Our method automatically computes a minimal stretch-receptive sensor network to user-provided soft robotic designs, which is optimized to perform well under a set of user-specified deformation-force pairs. The sensorized robots are able to reconstruct their full deformation state, under interaction forces. We cast our sensor design as a sub-selection problem, selecting a minimal set of sensors from a large set of fabricable ones which minimizes the error when sensing specified deformation-force pairs. Unique to our approach is the use of an analytical gradient of our reconstruction performance measure with respect to selection variables. We demonstrate our technique on a bending bar and gripper example, illustrating more complex designs with a simulated tentacle.
Link to publication file: la.disneyresearch.com/publication/makesense-automated-sensor-design-for-proprioceptive-soft-robotsX-CAD: Optimizing CAD Models with Extended Finite ElementsDisneyResearchHub2019-11-19 | We propose a novel generic shape optimization method for CAD models based on the eXtended Finite Element Method (XFEM). Our method works directly on the intersection between the model and a regular simulation grid, without the need to mesh or remesh, thus removing a bottleneck of classical shape optimization strategies. This is made possible by a novel hierarchical integration scheme that accurately integrates finite element quantities with sub-element precision. For optimization, we efficiently compute analytical shape derivatives of the entire framework, from model intersection to integration rule generation and XFEM simulation. Moreover, we describe a differentiable projection of shape parameters onto a constraint manifold spanned by user-specified shape preservation, consistency, and manufacturability constraints. We demonstrate the utility of our approach by optimizing mass distribution, strength-to-weight ratio, and inverse elastic shape design objectives directly on parameterized 3D CAD models.
Link to publication page: la.disneyresearch.com/publication/x-cad-optimizing-cad-models-with-extended-finite-elementsRecycling a Landmark Dataset for Real-time Face Tracking with Low Cost HMD Integrated CamerasDisneyResearchHub2019-11-15 | Preparing datasets for use in the training of real-time face tracking algorithms for HMDs is costly. Manually annotated facial landmarks are accessible for regular photography datasets, but introspectively mounted cameras for VR face tracking have incompatible requirements with these existing datasets. Such requirements include operating ergonomically at close range with wide angle lenses, low-latency short exposures, and near infrared sensors. In order to train a suitable face solver without the costs of producing new training data, we automatically repurpose an existing landmark dataset to these specialist HMD camera intrinsics with a radial warp reprojection. Our method separates training into local regions of the source photos, \ie mouth and eyes for more accurate local correspondence to the mounted camera locations underneath and inside the fully functioning HMD. We combine per-camera solved landmarks to yield a live animated avatar driven from the user's face expressions. Critical robustness is achieved with measures for mouth region segmentation, blink detection and pupil tracking. We quantify results against the unprocessed training dataset and provide empirical comparisons with commercial face trackers.
The system has the appearance of a robot character, with a bear-like head and a soft anthropomorphic hand and uses Bezier curves to achieve smooth minimum-jerk motions. Fast timing is enabled by low latency motion capture and real-time trajectory generation: the robot initially moves towards an expected handover location and the trajectory is updated on-the-fly to converge smoothly to the actual handover location. A hybrid automaton provides robustness to failure and unexpected human actions.
In a 3×3 user study, we vary the speed of the robot and add variable sensorimotor delays. We evaluate the social perception of the robot using the Robot Social Attribute Scale (RoSAS). Inclusion of a small delay, mimicking the delay of the human sensorimotor system, leads to an improvement in perceived qualities over both no delay and long delay conditions. Specifically, with no delay the robot is perceived as more discomforting and with a long delay, it is perceived as less warm.
Link to publication page: la.disneyresearch.com/publication/fast-handovers-with-a-robot-character-small-sensorimotor-delays-improve-perceived-qualitiesTowards a Natural Motion Generator: a Pipeline to Control a Humanoid based on Motion DataDisneyResearchHub2019-11-02 | Imitation of the upper body motions of human demonstrators or animation characters to human-shaped robots is studied in this paper. We present a pipeline for motion retargeting by defining the joints of interest (JOI) of both the source skeleton and the target humanoid robot. To this end, we deploy an optimization-based motion transfer method utilizing link length modifications of the source skeleton and a task (Cartesian) space fine-tuning of JOI motion descriptors. To evaluate the effectiveness of the proposed pipeline, we use two different 3-D motion datasets from three human demonstrators and an Ogre animation character, Bork, and successfully transfer the motions to four different humanoid robots: DARwIn-OP, COmpliant HuMANoid Platform (COMAN), THORMANG, and Atlas. Furthermore, COMAN and THORMANG are actually controlled to show that the proposed method can be deployed to physical robots.
Link to publication page: projects.disneyresearch.com/pubproc/1259Parameterized Animated ActivitiesDisneyResearchHub2019-10-28 | This work addresses the development of a character animation editing method that accommodates animation changes while preserving the animator’s original artistic intent. Our goal is to give the artist control over the automatic editing of animations by ex-tending them with artist-defined metadata. We propose a metadata representation that describes which aspects of an animation can be varied. To make the authoring process easier, we have developed an interface for specifying the metadata. Our method extracts a collection of trajectories of both effectors and objects for the animation. We approximate and parameterize the trajectories with a series of cubic Bézier curves. Then, we generate a set of high-level parameters for editing which are related to trajectory deformations.The only possible deformations are those that preserve the fine structure of the original motion. From the trajectories, we use in-verse kinematics to generate a new animation that conforms to the user’s edits while preserving the overall character of the original.
Link to publication page: http://studios.disneyresearch.com/2019/10/27/parameterized-animated-activitiesVibration-Minimizing Motion Retargeting for Robotic CharactersDisneyResearchHub2019-07-29 | Creating animations for robotic characters is very challenging due to the constraints imposed by their physical nature. In particular, the combination of fast motions and unavoidable structural deformations leads to mechanical oscillations that negatively affect their performances. Our goal is to automatically transfer motions created using traditional animation software to robotic characters while avoiding such artifacts. To this end, we develop an optimization-based, dynamics-aware motion retargeting system that adjusts an input motion such that visually salient low-frequency, large amplitude vibrations are suppressed. The technical core of our animation system consists of a differentiable dynamics simulator that provides constraint-based two-way coupling between rigid and flexible components. We demonstrate the efficacy of our method through experiments performed on a total of five robotic characters including a child-sized animatronic figure that features highly dynamic drumming and boxing motions.
Link to publication page: la.disneyresearch.com/publication/publication-process-vibration-minimizing-motion-retargeting-for-robotic-charactersNeural Importance SamplingDisneyResearchHub2019-07-17 | We propose to use deep neural networks for generating samples in Monte Carlo integration. Our work is based on non-linear independent components estimation (NICE), which we extend in numerous ways to improve performance and enable its application to integration problems. First, we introduce piecewise-polynomial coupling transforms that greatly increase the modeling power of individual coupling layers. Second, we propose to preprocess the inputs of neural networks using one-blob encoding, which stimulates localization of computation and improves inference. Third, we derive a gradient-descent-based optimization for the KL and the chi-square divergence for the specific application of Monte Carlo integration with unnormalized stochastic estimates of the target distribution. Our approach enables fast and accurate inference and efficient sample generation independently of the dimensionality of the integration domain. We show its benefits on generating natural images and in two applications to light-transport simulation: first, we demonstrate learning of joint path-sampling densities in the primary sample space and importance sampling of multi-dimensional path prefixes thereof. Second, we use our technique to extract conditional directional densities driven by the product of incident illumination and the BSDF in the rendering equation and we leverage the densities for path guiding. In all applications, our approach yields on-par or higher performance than competing techniques at equal sample count.
Link to publication page: http://studios.disneyresearch.com/2019/07/12/neural-importance-samplingAccurate Markerless Jaw Tracking for Facial Performance CaptureDisneyResearchHub2019-07-12 | We present the first method to accurately track the invisible jaw based solely on the visible skin surface, without the need for any markers or augmentation of the actor. As such the method can readily be integrated with off-the-shelf facial performance capture systems. The core idea is to learn a non-linear mapping from the skin deformation to the underlying jaw motion on a dataset where ground-truth jaw poses have been acquired, and then to retarget the mapping to new subjects. Solving for the jaw pose plays a central role in visual effects pipelines, since accurate jaw motion is required when retargeting to fantasy characters and for physical simulation. Currently, this task is performed mostly manually to achieve the desired level of accuracy, and the presented method has the potential to fully automate this labour intense and error prone process.
Link to publication page: studios.disneyresearch.com/2019/07/12/accurate-markerless-jaw-tracking-for-facial-performance-captureTangent Space Optimization of Controls for Character AnimationDisneyResearchHub2019-07-12 | We formulate the control of interpolations in animation with positional constraints over time as a space-time optimization problem in the tangent space of the curves driving the animation controls. Our method has the key properties that it (1) allows for the manipulation of positions and orientations over time, extending inverse kinematics, (2) does not add new keyframes, and 3) works in the space of editable animation curves and hence integrates seamlessly with current pipelines.
Link to publication page: http://studios.disneyresearch.com/2019/07/12/tangent-space-optimization-of-controls-for-character-animationPractical Person Specific Eye RiggingDisneyResearchHub2019-05-27 | We present a novel parametric eye rig for eye animation, including a new multi-view imaging system that can reconstruct eye poses at submillimeter accuracy to which we fit our new rig. This allows us to accurately estimate person-specific eyeball shape, rotation center, interocular distance, visual axis, and other rig parameters resulting in an animation-ready eye rig. We demonstrate the importance of several aspects of eye modeling that are often overlooked, for example that the visual axis is not identical to the optical axis, that it is important to model rotation about the optical axis, and that the rotation center of the eye should be measured accurately for each person. Since accurate rig fitting requires hand annotation of multi-view imagery for several eye gazes, we additionally propose a more user-friendly “lightweight” fitting approach, which leverages an average rig created from several pre-captured accurate rigs. Our lightweight rig fitting method allows for the estimation of eyeball shape and eyeball position given only a single pose with a known look-at point (e.g. looking into a camera) and few manual annotations.
Link to publication page: studios.disneyresearch.com/2019/05/06/practical-person-specific-eye-riggingTrajectory-based Probabilistic Policy Gradient for Learning Locomotion BehaviorsDisneyResearchHub2019-05-16 | We propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion skills. We define the policy function as a probability distribution over trajectories and train the policy using a deep latent variable model to achieve sample efficient skill learning. We first evaluate the sample efficiency of DLPG compared to the state-of-the-art reinforcement learning methods in simulated environments. Then, we apply the proposed method to a four-legged walking robot named Snapbot to learn three basic locomotion skills of turn left, go straight, and turn right. We demonstrate that, by properly designing two reward functions for curriculum learning, Snapbot successfully learns the desired locomotion skills with moderate sample complexity.