| | --- |
| | license: apache-2.0 |
| | tags: |
| | - text-to-motion |
| | - motion-generation |
| | - diffusion-forcing |
| | - humanml3d |
| | - computer-animation |
| | library_name: transformers |
| | pipeline_tag: other |
| | --- |
| | |
| | # FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation |
| |
|
| | <div align="center"> |
| |
|
| | **A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing** |
| |
|
| | [Paper](https://arxiv.org/abs/2512.03520) | [Github](https://github.com/ShandaAI/FloodDiffusion) | [Project Page](https://shandaai.github.io/FloodDiffusion/) |
| |
|
| | </div> |
| |
|
| | ## Overview |
| |
|
| | We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. |
| |
|
| | ## Model Architecture |
| |
|
| | The model consists of three main components: |
| |
|
| | 1. **Text Encoder**: UMT5-XXL encoder for text feature extraction |
| | 2. **Latent Diffusion Model**: Transformer-based diffusion model operating in latent space |
| | 3. **VAE Decoder**: 1D convolutional VAE for decoding latent features to motion sequences |
| |
|
| | **Technical Specifications:** |
| | - Input: Natural language text |
| | - Output: Motion sequences in two formats: |
| | - 263-dimensional HumanML3D features (default) |
| | - 22×3 joint coordinates (optional, with EMA smoothing support) |
| | - Latent dimension: 4 |
| | - Upsampling factor: 4× (VAE decoder) |
| | - Frame rate: 20 FPS |
| |
|
| | ## Installation |
| |
|
| | ### Prerequisites |
| |
|
| | - Python 3.8+ |
| | - CUDA-capable GPU with 16GB+ VRAM (recommended) |
| | - 16GB+ system RAM |
| |
|
| | ### Dependencies |
| |
|
| | **Step 1: Install basic dependencies** |
| |
|
| | ```bash |
| | pip install torch transformers huggingface_hub |
| | pip install lightning diffusers omegaconf ftfy numpy |
| | ``` |
| |
|
| | **Step 2: Install Flash Attention (Required)** |
| |
|
| | Flash attention requires CUDA and may need compilation. Choose the appropriate method: |
| |
|
| | ```bash |
| | pip install flash-attn --no-build-isolation |
| | ``` |
| |
|
| | **Note:** Flash attention is **required** for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features). |
| |
|
| | ## Quick Start |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from transformers import AutoModel |
| | |
| | # Load model |
| | model = AutoModel.from_pretrained( |
| | "ShandaAI/FloodDiffusion", |
| | trust_remote_code=True |
| | ) |
| | |
| | # Generate motion from text (263-dim HumanML3D features) |
| | motion = model("a person walking forward", length=60) |
| | print(f"Generated motion: {motion.shape}") # (~240, 263) |
| | |
| | # Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0) |
| | motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5) |
| | print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3) |
| | ``` |
| |
|
| | ### Batch Generation |
| |
|
| | ```python |
| | # Generate multiple motions efficiently |
| | texts = [ |
| | "a person walking forward", |
| | "a person running quickly", |
| | "a person jumping up and down" |
| | ] |
| | lengths = [60, 50, 40] # Different lengths for each motion |
| | |
| | motions = model(texts, length=lengths) |
| | |
| | for i, motion in enumerate(motions): |
| | print(f"Motion {i}: {motion.shape}") |
| | ``` |
| |
|
| | ### Multi-Text Motion Transitions |
| |
|
| | ```python |
| | # Generate a motion sequence with smooth transitions between actions |
| | motion = model( |
| | text=[["walk forward", "turn around", "run back"]], |
| | length=[120], |
| | text_end=[[40, 80, 120]] # Transition points in latent tokens |
| | ) |
| | |
| | # Output: ~480 frames showing all three actions smoothly connected |
| | print(f"Transition motion: {motion[0].shape}") |
| | ``` |
| |
|
| | ## API Reference |
| |
|
| | ### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)` |
| | |
| | Generate motion sequences from text descriptions. |
| | |
| | **Parameters:** |
| | |
| | - **text** (`str`, `List[str]`, or `List[List[str]]`): Text description(s) |
| | - Single string: Generate one motion |
| | - List of strings: Batch generation |
| | - Nested list: Multiple text prompts per motion (for transitions) |
| | |
| | - **length** (`int` or `List[int]`, default=60): Number of latent tokens to generate |
| | - Output frames ≈ `length × 4` (due to VAE upsampling) |
| | - Example: `length=60` → ~240 frames (~12 seconds at 20 FPS) |
| | |
| | - **text_end** (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions |
| | - Only used when `text` is a nested list |
| | - Specifies when to switch between different text descriptions |
| | - **IMPORTANT**: Must have the same length as the corresponding text list |
| | - Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts) |
| | - Must be in ascending order |
| |
|
| | - **num_denoise_steps** (`int`, optional): Number of denoising iterations |
| | - Higher values produce better quality but slower generation |
| | - Recommended range: 10-50 |
| |
|
| | - **output_joints** (`bool`, default=False): Output format selector |
| | - `False`: Returns 263-dimensional HumanML3D features |
| | - `True`: Returns 22×3 joint coordinates for direct visualization |
| | |
| | - **smoothing_alpha** (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`) |
| | - `1.0`: No smoothing (default) |
| | - `0.5`: Medium smoothing (recommended for smoother animations) |
| | - `0.0`: Maximum smoothing |
| | - Range: 0.0 to 1.0 |
| |
|
| | **Returns:** |
| | - Single motion: |
| | - `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)` |
| | - `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)` |
| | - Batch: `List[numpy.ndarray]` with shapes as above |
| |
|
| | **Example:** |
| | ```python |
| | # Single generation (263-dim features) |
| | motion = model("walk forward", length=60) # Returns (240, 263) |
| | |
| | # Single generation (joint coordinates) |
| | joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3) |
| | |
| | # Batch generation |
| | motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays |
| | |
| | # Multi-text transitions |
| | motion = model( |
| | [["walk", "turn"]], |
| | length=[60], |
| | text_end=[[30, 60]] |
| | ) # Returns list with 1 array of shape (240, 263) |
| | ``` |
| |
|
| | ## Update History |
| |
|
| | - **2025/12/8**: Added EMA smoothing option for joint positions during rendering |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @article{cai2025flooddiffusion, |
| | title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation}, |
| | author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu}, |
| | journal={arXiv preprint arXiv:2512.03520}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## Troubleshooting |
| |
|
| | ### Common Issues |
| |
|
| | **ImportError with trust_remote_code:** |
| | ```python |
| | # Solution: Add trust_remote_code=True |
| | model = AutoModel.from_pretrained( |
| | "ShandaAI/FloodDiffusion", |
| | trust_remote_code=True # Required! |
| | ) |
| | ``` |
| |
|
| | **Out of Memory:** |
| | ```python |
| | # Solution: Generate shorter sequences |
| | motion = model("walk", length=30) # Shorter = less memory |
| | ``` |
| |
|
| | **Slow first load:** |
| | The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant. |
| |
|
| | **Module import errors:** |
| | Ensure all dependencies are installed: |
| | ```bash |
| | pip install lightning diffusers omegaconf ftfy numpy |
| | ``` |