Real-time 3D-aware Portrait Video Relighting

¹ Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences

² Beijing Jiaotong University ³ University of California San Diego ⁴ Cardiff University ⁵ City University of Hong Kong

⁶ The Hong Kong University of Science and Technology ⁷ University of Chinese Academy of Sciences

⁸ National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

⁹ National Engineering Research Center of Visual Technology, School of Computer Science, Peking University

Figure: Given a portrait video shown in the leftmost column, our method reconstructs a 3D relightable face for each video frame. Users can then adjust their viewpoints and lighting conditions interactively. The second column displays relighted video frames with a head pose yaw of 0.3, while the third column presents faces relighted under an alternative lighting condition with a frontal head pose. The rightmost column provides the predicted albedo and geometry of the reconstructed face.

Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However, most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper, we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video, our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically, we infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions.