Andrew Chan, Jaymo Kang, Julia Luo
1: CS 184/CS 280.
2: CS 280.
We propose a full face reenactment program that - given source and target videos of individual actors’ facial expressions - can synthesize a convincing, photo-realistic re-animation of the target actor using the source actor’s 3D head position, rotation, and face expression.
Face reenactment is a recent, disruptive application of existing computer vision and graphics techniques where facial expressions are transferred from an actor in a source video to a person in a target video. Face2Face, a CVPR 2016 Oral Presentation from five researchers in three different institutions, attracted considerable media attention when their real-time facial reenactment method was used to transfer facial expressions onto videos of public figures including Putin, Trump, Obama, and Bush. Deep Video Portraits, a SIGGRAPH 2018 paper, improved on this method by allowing for transfer from source to target of not only facial expressions, but also 3D head position, rotation, eye gaze, and eye blinking using a novel deep learning-based renderer instead of a compositor in the video synthesis step.
The above approaches for face reenactment utilize a wide array of vision and graphics algorithms. At a high level, Kim et al.’s approach is the following:
The method can thus be divided into 3 stages: (1) monocular face reconstruction, (2) conditioning input synthesis, and (3) rendering-to-video translation.
Monocular Face Reconstruction
3-D Morphable Models are a statistical method from Blanz and Vetter which allow for a decomposition of facial information into a shape and texture vector, allowing for robust mappings of faces onto vector spaces and independent manipulation of various facial attributes. When expressed in Booth et. al’s application, we can first decompose the shape vector , where is the number of 3-d coordinates used in the mesh, and decompose it into a contribution from one’s identity and a contribution from one’s expression. With PCA on neutral face scans, one can procure the identity basis, and use the expression basis to capture the rest of the variation from non-neutral face scans. As a result, we can write a mesh vector as , where represents some mean vector, are orthogonal (but weighted by singular values) id vectors, are orthogonal expression vectors and . For this basis, we will be using the 3D Basel Face Model from 2009.
For extracting texture, we will use Trãn et al.’s CNN to extract textures into the same basis as the Basel Face Model. The architecture, as well as the weights, are available online. The camera parameters can also be expressed as a vector which contains intrinsic camera parameters, three rotation parameters, and three translation parameters.
To extract these vectors from a video, we try to minimize the energy function, where is an loss between the input image’s color at the points given by the mesh vertices and the texture given by the , are an loss between the mesh’s projected facial landmarks and the landmarks found directly from input video frames, and imposes a loss on the change in to prevent jittering. Because a person only has one identity throughout the course of a video, only one and is generated from a video from averaging. However, we have a and from every video frame, which change the person’s expression.
Conditioning Input Synthesis
Using our monocular face reconstruction procedure, we can reconstruct the parameterized face in each frame of both the source and target video. Given the source and target face parameters, we can transfer expression and pose from source to target by copying over the parameters in a relative manner. Then we render conditioning images of the target actor’s face using the modified parameters to synthesize and rasterize a morphable model mesh. These images are used as input to our rendering-to-video translation network.
Specifically, for temporal coherence, to generate the th frame of our output, we stack the current conditioning image with the last rasterized conditioning images and use this (3 channels for each image, and images in our sliding window) tensor as input.
After obtaining the video frames of our rasterized mesh, our goal is to convert these video frames into our final output video frames, which should resemble our target video. Here we will try a network architecture similar to the Video-to-video synthesis paper from Nvidia (vid2vid). Specifically, given our sequence of input frames and a sequence of target video frames , we want to output reconstructed video frames such that .
Here, vid2vid uses a conditional Generative Adversarial Network with a single generator and two conditional discriminators and . The generator produces sequential video frames with a Markov assumption where the current video frame depends on only the past video frames (they ultimately set for their experiments). The discriminator ensures that that our reconstructed video resembles the target video and thus discriminates between image frames in our reconstructed video and those in the original target video (i.e. it outputs 0 for “fake” video frames and 1 for “real” video frames). The discriminator ensures that our reconstructed video has similar temporal dynamics as the original video and discriminates between consecutive frames of the reconstructed video and those of the original video given the optical flow for the past frames of the original video. Then, they train both discriminators with the generator using the GAN minimax loss, and find the generator the minimizes the sum of the both discriminator losses as well as a flow estimation loss term.
generate_conditioning_images.pyprogram that can synthesize a sequence of 2000 conditioning images of Putin with modified expressions, given an arbitrary source video and a target video of Putin.