This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: HeliosAutoBlocks

Description: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (HeliosTextEncoderStep)
    • Text Encoder step that generates text embeddings to guide the video generation
  2. vae_encoder (HeliosAutoVaeEncoderStep)
    • Encoder step that encodes video or image inputs. This is an auto pipeline block.
    • video_encoder: HeliosVideoVaeEncoderStep
      • Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
    • image_encoder: HeliosImageVaeEncoderStep
      • Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
  3. denoise (HeliosAutoCoreDenoiseStep)
    • Core denoise step that selects the appropriate denoising block.
    • video2video: HeliosV2VCoreDenoiseStep
      • V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
    • image2video: HeliosI2VCoreDenoiseStep
      • I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
    • text2video: HeliosCoreDenoiseStep
      • Denoise block that takes encoded conditions and runs the chunk-based denoising process.
  4. decode (HeliosDecodeStep)
    • Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.

Model Components

  1. text_encoder (UMT5EncoderModel)
  2. tokenizer (AutoTokenizer)
  3. guider (ClassifierFreeGuidance)
  4. vae (AutoencoderKLWan)
  5. video_processor (VideoProcessor)
  6. transformer (HeliosTransformer3DModel)
  7. scheduler (HeliosScheduler)

Input/Output Specification

Inputs Required:

  • prompt (str): The prompt or prompts to guide image generation.
  • history_sizes (list): Sizes of long/mid/short history buffers for temporal context.
  • sigmas (list): Custom sigmas for the denoising process.

Optional:

  • negative_prompt (str): The prompt or prompts not to guide the image generation.
  • max_sequence_length (int), default: 512: Maximum sequence length for prompt encoding.
  • video (Any): Input video for video-to-video generation
  • height (int), default: 384: The height in pixels of the generated image.
  • width (int), default: 640: The width in pixels of the generated image.
  • num_latent_frames_per_chunk (int), default: 9: Number of latent frames per temporal chunk.
  • generator (Generator): Torch generator for deterministic generation.
  • image (PIL.Image.Image | list[PIL.Image.Image]): Reference image(s) for denoising. Can be a single image or list of images.
  • num_videos_per_prompt (int), default: 1: Number of videos to generate per prompt.
  • image_latents (Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.
  • video_latents (Tensor): Encoded video latents for V2V generation.
  • image_noise_sigma_min (float), default: 0.111: Minimum sigma for image latent noise.
  • image_noise_sigma_max (float), default: 0.135: Maximum sigma for image latent noise.
  • video_noise_sigma_min (float), default: 0.111: Minimum sigma for video latent noise.
  • video_noise_sigma_max (float), default: 0.135: Maximum sigma for video latent noise.
  • num_frames (int), default: 132: Total number of video frames to generate.
  • keep_first_frame (bool), default: True: Whether to keep the first frame as a prefix in history.
  • num_inference_steps (int), default: 50: The number of denoising steps.
  • latents (Tensor): Pre-generated noisy latents for image generation.
  • timesteps (Tensor): Timesteps for the denoising process.
  • None (Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
  • attention_kwargs (dict): Additional kwargs for attention processors.
  • fake_image_latents (Tensor): Fake image latents used as history seed for I2V generation.
  • output_type (str), default: np: Output format: 'pil', 'np', 'pt'.

Outputs - videos (list): The generated videos.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support