This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.
Pipeline Type: HeliosAutoBlocks
Description: Auto Modular pipeline for text-to-video, image-to-video, and video-to-video tasks using Helios.
This pipeline uses a 4-block architecture that can be customized and extended.
Example Usage
[TODO]
Pipeline Architecture
This modular pipeline is composed of the following blocks:
- text_encoder (
HeliosTextEncoderStep)- Text Encoder step that generates text embeddings to guide the video generation
- vae_encoder (
HeliosAutoVaeEncoderStep)- Encoder step that encodes video or image inputs. This is an auto pipeline block.
- video_encoder:
HeliosVideoVaeEncoderStep- Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation.
- image_encoder:
HeliosImageVaeEncoderStep- Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation.
- denoise (
HeliosAutoCoreDenoiseStep)- Core denoise step that selects the appropriate denoising block.
- video2video:
HeliosV2VCoreDenoiseStep- V2V denoise block that seeds history with video latents and uses I2V-aware chunk preparation.
- image2video:
HeliosI2VCoreDenoiseStep- I2V denoise block that seeds history with image latents and uses I2V-aware chunk preparation.
- text2video:
HeliosCoreDenoiseStep- Denoise block that takes encoded conditions and runs the chunk-based denoising process.
- decode (
HeliosDecodeStep)- Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output.
Model Components
- text_encoder (
UMT5EncoderModel) - tokenizer (
AutoTokenizer) - guider (
ClassifierFreeGuidance) - vae (
AutoencoderKLWan) - video_processor (
VideoProcessor) - transformer (
HeliosTransformer3DModel) - scheduler (
HeliosScheduler)
Input/Output Specification
Inputs Required:
prompt(str): The prompt or prompts to guide image generation.history_sizes(list): Sizes of long/mid/short history buffers for temporal context.sigmas(list): Custom sigmas for the denoising process.
Optional:
negative_prompt(str): The prompt or prompts not to guide the image generation.max_sequence_length(int), default:512: Maximum sequence length for prompt encoding.video(Any): Input video for video-to-video generationheight(int), default:384: The height in pixels of the generated image.width(int), default:640: The width in pixels of the generated image.num_latent_frames_per_chunk(int), default:9: Number of latent frames per temporal chunk.generator(Generator): Torch generator for deterministic generation.image(PIL.Image.Image | list[PIL.Image.Image]): Reference image(s) for denoising. Can be a single image or list of images.num_videos_per_prompt(int), default:1: Number of videos to generate per prompt.image_latents(Tensor): image latents used to guide the image generation. Can be generated from vae_encoder step.video_latents(Tensor): Encoded video latents for V2V generation.image_noise_sigma_min(float), default:0.111: Minimum sigma for image latent noise.image_noise_sigma_max(float), default:0.135: Maximum sigma for image latent noise.video_noise_sigma_min(float), default:0.111: Minimum sigma for video latent noise.video_noise_sigma_max(float), default:0.135: Maximum sigma for video latent noise.num_frames(int), default:132: Total number of video frames to generate.keep_first_frame(bool), default:True: Whether to keep the first frame as a prefix in history.num_inference_steps(int), default:50: The number of denoising steps.latents(Tensor): Pre-generated noisy latents for image generation.timesteps(Tensor): Timesteps for the denoising process.None(Any): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.attention_kwargs(dict): Additional kwargs for attention processors.fake_image_latents(Tensor): Fake image latents used as history seed for I2V generation.output_type(str), default:np: Output format: 'pil', 'np', 'pt'.
Outputs - videos (list): The generated videos.
- Downloads last month
- -