Emote Portrait Alive (EMO): Generating Videos with A Photo + Audio Clip

What is EMO Technology?
Technical Features and Functions
Experience and Applications
Future Possibilities

Dear friends, have you ever imagined making your photos speak or sing? Today, I want to introduce an exciting technology from Alibaba called EMO (Emote Portrait Alive). It's a revolutionary innovation that makes such fantasies a reality.

What is EMO Technology?

EMO is Alibaba's latest development, a technology that combines a static photo with an audio file to generate a new, dynamic video where the subject speaks and sings. This means with just your photo and a recording of your voice, EMO can create a lively representation of you, expressing various emotions and actions.

The framework of EMO is driven by audio and creates expressive portrait videos. By inputting a reference image and audio data (such as conversation or singing), the method generates a video with rich facial expressions and various head poses. The video duration can be flexibly adjusted based on the length of the input audio.

The entire framework consists of two main phases. In the "Framework Encoding" phase, features are extracted from the reference image and action frames using ReferenceNet. In the "Diffusion Process" phase, a pre-trained audio encoder processes the audio data and embeds it. Facial image generation is controlled by combining facial region masks and multiple-frame noise. The main network is then used for denoising, incorporating two attention mechanisms: reference attention and audio attention, crucial for maintaining character identity and adjusting character actions. Additionally, a time module is employed to modulate the time dimension and adjust the action speed.

Technical Features and Functions

Audio-Driven Portrait Video Generation

With just a photo and audio input, EMO can generate a virtual portrait video with natural expressions and dynamic head movements. Whatever emotion you want to convey, EMO can make your digital alter ego do it just like the real you.

Rich and Natural Dynamic Rendering of Expressions

EMO emphasizes the naturalness of facial movements and the diversity of expressions in videos. It captures subtle emotional differences in the audio and accurately reflects them in the virtual portrait's expressions.

Support for Multiple Head Pose Variations

Besides facial expressions, EMO can generate various changes in head poses, making the video more vivid and realistic.

Compatible with Multiple Languages and Portrait Styles

This technology is not limited to any specific language or musical style. It can handle audio input in various languages and support diverse portrait styles, including real portraits, historical figures, and artworks

Rapid Rhythm Synchronization

For fast-paced audio, such as quick songs or rapid speeches, EMO ensures perfect synchronization of virtual portrait movements with the audio rhythm.

Cross-Actor Performance Transformation

This technology can also achieve performance transformations between different actors, allowing your virtual image to mimic specific performances.

Experience and Applications

Imagine uploading just a photo and a voice recording, and EMO creates a video where your virtual self moves and speaks. Whether it's a special birthday wish for a friend or creating entertaining social media content, it's bound to be fascinating and unique. Various generated videos have been officially released.

Future Possibilities

Alibaba's EMO technology not only opens up new forms of entertainment but also offers limitless possibilities for digital art, online education, remote communication, and more. With technology advancing continuously, we have reason to believe that in the not-too-distant future, everyone will have their own virtual image, transcending time and space to communicate and interact with others.