AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text


Abstract

Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high-resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation.

pipeline

Example generated avatars

AvatarStudio generates high-quality avatars in a multi-view consistent way.

Messi
Bruce Lee
Donald Trump
Kim Kardashian
Terracotta Warriors
Albert Einstein
A man wearing kilt
Captain America
A chef wearing in white
A man with dreadlocks
Lara Croft in Tomb Raider
A karate master wearing a black belt
A professional boxer
A man with curly hair wearing glasses
An American football player
Wolfgang Amadeus Mozart
Harry Potter
Michael Jackson
A ninja
Abraham Lincoln

Avatar creation with more complicated prompts

AvatarStudio has shown promising results, effectively aligning the generated avatars with the detailed descriptions of the complex prompts.

Elderly woman, dressed in a traditional Native American outfit, holding dream catchers, braided hair
Cute chibi Lara Croft, game, Pixar design, studio lighting, modern Disney style, 3D character
Chibi Thor with Mjolnir, cute, volumetric lighting, reflective textures, game, character
Medieval solder holding two longswords on hands, fantasy, game, character
Tesla trooper, wearing Mecha suit, scifi, game character, unreal, 3D rendering, fantasy
Chibi, single boy, cute, magician's outfit, top hat, magic wand, curly hair, shiny shoes
Young man, dressed in a futuristic cyberpunk outfit, neon accents, holding a high-tech gadget
Elderly gentleman, dressed in a vintage suit, monocle, holding walking canes on hands
Teenage boy, dressed in a modern hip-hop style, baseball cap tilted, holding basketballs
Chibi, 1boy, cute, knight armor, helmet, holding toy knife on hands, Pixar design
Elderly man, dressed in a traditional samurai outfit, holding katana
Chibi, 1girl, hanfu, cat ears, cat girl, silk robe, wavy hair, wearing traditional sandals
Stealthy hinja holding dual katanas, 3D, game character, unreal
A little girl dressed as Wonder Woman, chibi style, volumetric lighting, Disney style
Strong Slayer, holding machete on hands, game character, 3D rendering, unreal
Cute chibi Son Goku, Sporty style outfit, shoes, nike jacket, little boy, cartoon

Comparison Results

We compare AvatarStudio with other text-guided generation methods.

DreamFusion

Magic3D-Fine

DreamAvatar

DreamWaltz

Ours

Assassin Creed

A standing Captain Jack Sparrow from Pirates of the Caribbean

DreamFusion

Magic3D-Fine

DreamHuman

AvatarVerse

Ours

A man wearing a bomber jacket

A karate master wearing a black belt


Multimodal Avatar Animation

AvatarStudio provides high-quality and easy-to-use animation, allowing users to drive the generated avatars with multimodal signals, such as text or video.

Text-driven animation. We adopt MDM to convert text prompts, like "A person is punching a bag", into SMPL sequences for animation.

Video-driven animation. We use VIBE to estimate SMPL sequences from driving videos for animation.

More animation results

Our AvatarStudio can achieve plausible animation results on loose clothing to a certain extent, such as a person wearing ballet costume, skirts or dresses.

A ballerina
A woman wearing a short jean skirt and a cropped top
A pregnant person of color

Stylized avatar creation

AvatarStudio supports stylized avatar creation by simply providing an additional style image.

Style image
A chef
A karate master
A girl wearing skirt
Style image
Kratos
A karate master
Gintoki
Style image
A girl wearing dress
A karate master
A girl wearing skirt
Style image
A chef
A karate master
A ninja

Comparison between different SDS guidances

We compare avatar results generated using different SDS guidances, i.e., Stable Diffusion (left), Skeleton-based ControlNet (middle) and DensePose-conditional ControlNet (right) guidances.

Stable Diffusion
Skeleton-based ControlNet
DensePose-based ControlNet
Albus Dumbledore
Wolverine, Marvel Character
Zeus

Comparison between different 3D representations

We compare avatar results generated using different 3D representations, i.e., DMTet-only (left), NeRF-only (middle) and ours (right) representations. The NeRF-only representation, despite achieving reasonable results, still struggles with issues like coarse faces and noisy geometry with floating artifacts.

DMTet
NeRF
Ours
Albert Einstein
Harry Potter

Citation

@article{anon2024avatarstudio,
  author = {Anonymous},
  title = {AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text},
  joural = {ECCV},
  year = {2024},
}