CS 184 Final Project Proposal

Deep Puppets: Face Expression Transfer for Videos

Andrew Chan, Jaymo Kang, Julia Luo

1: CS 184/CS 280.

2: CS 280.


We propose a full face reenactment program that - given source and target videos of individual actors’ facial expressions - can synthesize a convincing, photo-realistic re-animation of the target actor using the source actor’s 3D head position, rotation, and face expression.

Problem Overview

Face reenactment is a recent, disruptive application of existing computer vision and graphics techniques where facial expressions are transferred from an actor in a source video to a person in a target video. Face2Face, a CVPR 2016 Oral Presentation from five researchers in three different institutions, attracted considerable media attention when their real-time facial reenactment method was used to transfer facial expressions onto videos of public figures including Putin, Trump, Obama, and Bush. Deep Video Portraits, a SIGGRAPH 2018 paper, improved on this method by allowing for transfer from source to target of not only facial expressions, but also 3D head position, rotation, eye gaze, and eye blinking using a novel deep learning-based renderer instead of a compositor in the video synthesis step.

The above approaches for face reenactment utilize a wide array of vision and graphics algorithms. At a high level, Kim et al.’s approach is the following:

  1. Given a source and target video, construct a low-dimensional parametric representation of both videos using monocular face reconstruction.
  2. Transfer the head pose and expression from source parameters to target parameters.
  3. Render conditioning input images that are converted to a photo-realistic video of the target actor.

The method can thus be divided into 3 stages: (1) monocular face reconstruction, (2) conditioning input synthesis, and (3) rendering-to-video translation.

Monocular Face Reconstruction

3-D Morphable Models are a statistical method from Blanz and Vetter which allow for a decomposition of facial information into a shape and texture vector, allowing for robust mappings of faces onto vector spaces and independent manipulation of various facial attributes. When expressed in Booth et. al’s application, we can first decompose the shape vector , where is the number of 3-d coordinates used in the mesh, and decompose it into a contribution from one’s identity and a contribution from one’s expression. With PCA on neutral face scans, one can procure the identity basis, and use the expression basis to capture the rest of the variation from non-neutral face scans. As a result, we can write a mesh vector as , where represents some mean vector, are orthogonal (but weighted by singular values) id vectors, are orthogonal expression vectors and . For this basis, we will be using the 3D Basel Face Model from 2009.

For extracting texture, we will use Trãn et al.’s CNN to extract textures into the same basis as the Basel Face Model. The architecture, as well as the weights, are available online. The camera parameters can also be expressed as a vector which contains intrinsic camera parameters, three rotation parameters, and three translation parameters.

To extract these vectors from a video, we try to minimize the energy function, where is an loss between the input image’s color at the points given by the mesh vertices and the texture given by the , are an loss between the mesh’s projected facial landmarks and the landmarks found directly from input video frames, and imposes a loss on the change in to prevent jittering. Because a person only has one identity throughout the course of a video, only one and is generated from a video from averaging. However, we have a and from every video frame, which change the person’s expression.

Conditioning Input Synthesis

Using our monocular face reconstruction procedure, we can reconstruct the parameterized face in each frame of both the source and target video. Given the source and target face parameters, we can transfer expression and pose from source to target by copying over the parameters in a relative manner. Then we render conditioning images of the target actor’s face using the modified parameters to synthesize and rasterize a morphable model mesh. These images are used as input to our rendering-to-video translation network.

Specifically, for temporal coherence, to generate the th frame of our output, we stack the current conditioning image with the last rasterized conditioning images and use this (3 channels for each image, and images in our sliding window) tensor as input.

Rendering-to-Video Translation

After obtaining the video frames of our rasterized mesh, our goal is to convert these video frames into our final output video frames, which should resemble our target video. Here we will try a network architecture similar to the Video-to-video synthesis paper from Nvidia (vid2vid). Specifically, given our sequence of input frames and a sequence of target video frames , we want to output reconstructed video frames such that .

Here, vid2vid uses a conditional Generative Adversarial Network with a single generator and two conditional discriminators and . The generator produces sequential video frames with a Markov assumption where the current video frame depends on only the past video frames (they ultimately set for their experiments). The discriminator ensures that that our reconstructed video resembles the target video and thus discriminates between image frames in our reconstructed video and those in the original target video (i.e. it outputs 0 for “fake” video frames and 1 for “real” video frames). The discriminator ensures that our reconstructed video has similar temporal dynamics as the original video and discriminates between consecutive frames of the reconstructed video and those of the original video given the optical flow for the past frames of the original video. Then, they train both discriminators with the generator using the GAN minimax loss, and find the generator the minimizes the sum of the both discriminator losses as well as a flow estimation loss term.

Goals and Deliverables

Planned Goals

Stretch Goals



  1. Kim, Hyeongwoo, et al. “Deep video portraits.” ACM Transactions on Graphics (TOG) 37.4 (2018): 163.
  2. Thies, Justus, et al. “Face2face: Real-time face capture and reenactment of rgb videos.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
  3. Zollhöfer, Michael, et al. “State of the art on monocular 3D face reconstruction, tracking, and applications.” Computer Graphics Forum. Vol. 37. No. 2. 2018.
  4. Paysan, Pascal, et al. “A 3D face model for pose and illumination invariant face recognition.” 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance. Ieee, 2009.
  5. Tuan Tran, Anh, et al. “Regressing robust and discriminative 3D morphable models with a very deep neural network.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
  6. Wang, Ting-Chun, et al. “Video-to-video synthesis.” arXiv preprint arXiv:1808.06601 (2018).
  7. Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques (SIGGRAPH ‘99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 187-194. DOI=http://dx.doi.org/10.1145/311535.311556
  8. Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. 2009. A 3D Face Model for Pose and Illumination Invariant Face Recognition. In Proceedings of the 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS ‘09). IEEE Computer Society, Washington, DC, USA, 296-301. DOI: https://doi.org/10.1109/AVSS.2009.58