Add EMA smoothing support for joint positions

82d5f99 3 months ago

7.01 kB

	---
	license: apache-2.0
	tags:
	- text-to-motion
	- motion-generation
	- diffusion-forcing
	- humanml3d
	- computer-animation
	library_name: transformers
	pipeline_tag: other
	---

	# FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

	<div align="center">

	A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing

	[Paper](https://arxiv.org/abs/2512.03520) \| [Github](https://github.com/ShandaAI/FloodDiffusion) \| [Project Page](https://shandaai.github.io/FloodDiffusion/)

	</div>

	## Overview

	We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

	## Model Architecture

	The model consists of three main components:

	1. Text Encoder: UMT5-XXL encoder for text feature extraction
	2. Latent Diffusion Model: Transformer-based diffusion model operating in latent space
	3. VAE Decoder: 1D convolutional VAE for decoding latent features to motion sequences

	Technical Specifications:
	- Input: Natural language text
	- Output: Motion sequences in two formats:
	- 263-dimensional HumanML3D features (default)
	- 22×3 joint coordinates (optional, with EMA smoothing support)
	- Latent dimension: 4
	- Upsampling factor: 4× (VAE decoder)
	- Frame rate: 20 FPS

	## Installation

	### Prerequisites

	- Python 3.8+
	- CUDA-capable GPU with 16GB+ VRAM (recommended)
	- 16GB+ system RAM

	### Dependencies

	Step 1: Install basic dependencies

	```bash
	pip install torch transformers huggingface_hub
	pip install lightning diffusers omegaconf ftfy numpy
	```

	Step 2: Install Flash Attention (Required)

	Flash attention requires CUDA and may need compilation. Choose the appropriate method:

	```bash
	pip install flash-attn --no-build-isolation
	```

	Note: Flash attention is required for this model. If installation fails, please refer to the [official flash-attention installation guide](https://github.com/Dao-AILab/flash-attention#installation-and-features).

	## Quick Start

	### Basic Usage

	```python
	from transformers import AutoModel

	# Load model
	model = AutoModel.from_pretrained(
	"ShandaAI/FloodDiffusion",
	trust_remote_code=True
	)

	# Generate motion from text (263-dim HumanML3D features)
	motion = model("a person walking forward", length=60)
	print(f"Generated motion: {motion.shape}") # (~240, 263)

	# Generate motion as joint coordinates (22 joints × 3 coords) with ema (alpha: 0.0-1.0)
	motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
	print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
	```

	### Batch Generation

	```python
	# Generate multiple motions efficiently
	texts = [
	"a person walking forward",
	"a person running quickly",
	"a person jumping up and down"
	]
	lengths = [60, 50, 40] # Different lengths for each motion

	motions = model(texts, length=lengths)

	for i, motion in enumerate(motions):
	print(f"Motion {i}: {motion.shape}")
	```

	### Multi-Text Motion Transitions

	```python
	# Generate a motion sequence with smooth transitions between actions
	motion = model(
	text=[["walk forward", "turn around", "run back"]],
	length=[120],
	text_end=[[40, 80, 120]] # Transition points in latent tokens
	)

	# Output: ~480 frames showing all three actions smoothly connected
	print(f"Transition motion: {motion[0].shape}")
	```

	## API Reference

	### `model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)`

	Generate motion sequences from text descriptions.

	Parameters:

	- text (`str`, `List[str]`, or `List[List[str]]`): Text description(s)
	- Single string: Generate one motion
	- List of strings: Batch generation
	- Nested list: Multiple text prompts per motion (for transitions)

	- length (`int` or `List[int]`, default=60): Number of latent tokens to generate
	- Output frames ≈ `length × 4` (due to VAE upsampling)
	- Example: `length=60` → ~240 frames (~12 seconds at 20 FPS)

	- text_end (`List[int]` or `List[List[int]]`, optional): Latent token positions for text transitions
	- Only used when `text` is a nested list
	- Specifies when to switch between different text descriptions
	- IMPORTANT: Must have the same length as the corresponding text list
	- Example: `text=[["walk", "turn", "sit"]]` requires `text_end=[[20, 40, 60]]` (3 endpoints for 3 texts)
	- Must be in ascending order

	- num_denoise_steps (`int`, optional): Number of denoising iterations
	- Higher values produce better quality but slower generation
	- Recommended range: 10-50

	- output_joints (`bool`, default=False): Output format selector
	- `False`: Returns 263-dimensional HumanML3D features
	- `True`: Returns 22×3 joint coordinates for direct visualization

	- smoothing_alpha (`float`, default=1.0): EMA smoothing factor for joint positions (only used when `output_joints=True`)
	- `1.0`: No smoothing (default)
	- `0.5`: Medium smoothing (recommended for smoother animations)
	- `0.0`: Maximum smoothing
	- Range: 0.0 to 1.0

	Returns:
	- Single motion:
	- `output_joints=False`: `numpy.ndarray` of shape `(frames, 263)`
	- `output_joints=True`: `numpy.ndarray` of shape `(frames, 22, 3)`
	- Batch: `List[numpy.ndarray]` with shapes as above

	Example:
	```python
	# Single generation (263-dim features)
	motion = model("walk forward", length=60) # Returns (240, 263)

	# Single generation (joint coordinates)
	joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3)

	# Batch generation
	motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays

	# Multi-text transitions
	motion = model(
	[["walk", "turn"]],
	length=[60],
	text_end=[[30, 60]]
	) # Returns list with 1 array of shape (240, 263)
	```

	## Update History

	- 2025/12/8: Added EMA smoothing option for joint positions during rendering

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{cai2025flooddiffusion,
	title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
	author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
	journal={arXiv preprint arXiv:2512.03520},
	year={2025}
	}
	```

	## Troubleshooting

	### Common Issues

	ImportError with trust_remote_code:
	```python
	# Solution: Add trust_remote_code=True
	model = AutoModel.from_pretrained(
	"ShandaAI/FloodDiffusion",
	trust_remote_code=True # Required!
	)
	```

	Out of Memory:
	```python
	# Solution: Generate shorter sequences
	motion = model("walk", length=30) # Shorter = less memory
	```

	Slow first load:
	The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

	Module import errors:
	Ensure all dependencies are installed:
	```bash
	pip install lightning diffusers omegaconf ftfy numpy
	```