This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: HeliosAutoBlocks

Description: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (HeliosTextEncoderStep)
- Text Encoder step that generates text embeddings to guide the video generation
vae_encoder (HeliosAutoVaeEncoderStep)
- Encoder step that encodes video or image inputs. This is an auto pipeline block.
- video_encoder: HeliosVideoVaeEncoderStep
  - Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
- image_encoder: HeliosImageVaeEncoderStep
  - Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
denoise (HeliosAutoCoreDenoiseStep)
- Core denoise step that selects the appropriate denoising block.
- video2video: HeliosV2VCoreDenoiseStep
  - V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
- image2video: HeliosI2VCoreDenoiseStep
  - I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
- text2video: HeliosCoreDenoiseStep
  - Denoise block that takes encoded conditions and runs the chunk-based denoising process.
decode (HeliosDecodeStep)
- Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.

Model Components

text_encoder (UMT5EncoderModel)
tokenizer (AutoTokenizer)
guider (ClassifierFreeGuidance)
vae (AutoencoderKLWan)
video_processor (VideoProcessor)
transformer (HeliosTransformer3DModel)
scheduler (HeliosScheduler)

Input/Output Specification

Inputs Required:

prompt (str): The prompt or prompts to guide image generation.
history_sizes (list): Sizes of long/mid/short history buffers for temporal context.
sigmas (list): Custom sigmas for the denoising process.

Optional:

negative_prompt (str): The prompt or prompts not to guide the image generation.
max_sequence_length (int), default: 512: Maximum sequence length for prompt encoding.
video (Any): Input video for video-to-video generation
height (int), default: 384: The height in pixels of the generated image.
width (int), default: 640: The width in pixels of the generated image.
num_latent_frames_per_chunk (int), default: 9: Number of latent frames per temporal chunk.
generator (Generator): Torch generator for deterministic generation.
image (PIL.Image.Image | list[PIL.Image.Image]): Reference image(s) for denoising. Can be a single image or list of images.
num_videos_per_prompt (int), default: 1: Number of videos to generate per prompt.
image_latents (Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.
video_latents (Tensor): Encoded video latents for V2V generation.
image_noise_sigma_min (float), default: 0.111: Minimum sigma for image latent noise.
image_noise_sigma_max (float), default: 0.135: Maximum sigma for image latent noise.
video_noise_sigma_min (float), default: 0.111: Minimum sigma for video latent noise.
video_noise_sigma_max (float), default: 0.135: Maximum sigma for video latent noise.
num_frames (int), default: 132: Total number of video frames to generate.
keep_first_frame (bool), default: True: Whether to keep the first frame as a prefix in history.
num_inference_steps (int), default: 50: The number of denoising steps.
latents (Tensor): Pre-generated noisy latents for image generation.
timesteps (Tensor): Timesteps for the denoising process.
None (Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
attention_kwargs (dict): Additional kwargs for attention processors.
fake_image_latents (Tensor): Fake image latents used as history seed for I2V generation.
output_type (str), default: np: Output format: 'pil', 'np', 'pt'.

Outputs - `videos` (`list`): The generated videos.