Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high-resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation.
AvatarStudio generates high-quality avatars in a multi-view consistent way.
AvatarStudio has shown promising results, effectively aligning the generated avatars with the detailed descriptions of the complex prompts.
We compare AvatarStudio with other text-guided generation methods.
DreamFusion
Magic3D-Fine
DreamAvatar
DreamWaltz
Ours
Assassin Creed
A standing Captain Jack Sparrow from Pirates of the Caribbean
DreamFusion
Magic3D-Fine
DreamHuman
AvatarVerse
Ours
A man wearing a bomber jacket
A karate master wearing a black belt
AvatarStudio provides high-quality and easy-to-use animation, allowing users to drive the generated avatars with multimodal signals, such as text or video.
Our AvatarStudio can achieve plausible animation results on loose clothing to a certain extent, such as a person wearing ballet costume, skirts or dresses.
AvatarStudio supports stylized avatar creation by simply providing an additional style image.
We compare avatar results generated using different SDS guidances, i.e., Stable Diffusion (left), Skeleton-based ControlNet (middle) and DensePose-conditional ControlNet (right) guidances.
We compare avatar results generated using different 3D representations, i.e., DMTet-only (left), NeRF-only (middle) and ours (right) representations. The NeRF-only representation, despite achieving reasonable results, still struggles with issues like coarse faces and noisy geometry with floating artifacts.
@article{anon2024avatarstudio,
author = {Anonymous},
title = {AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text},
joural = {ECCV},
year = {2024},
}