OpenDriveLab
/

SparseVideoNav_VGM

video-generation

vision-language-navigation

Model card Files Files and versions

SparseVideoNav_VGM / README.md

OpenDriveLab-org's picture

OpenDriveLab-org

Update README.md

ba0dc92 verified 2 days ago

|

history blame contribute delete

3.13 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	tags:
	- video-generation
	- vision-language-navigation
	- embodied-ai
	- pytorch
	---

	![SparseVideoNav Architecture](assets/caption.png)
	# SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

	## Model Details

	### Model Description

	SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It pioneers a paradigm shift from continuous to sparse video generation for longer prediction horizons. By guiding trajectory inference with a generated sparse future spanning a 20-second horizon, it achieves sub-second inference (a 27× speed-up). It also marks the first realization of beyond-the-view navigation in challenging night scenes.

	- Developed by: Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
	- Shared by: The University of Hong Kong & OpenDriveLab
	- Model type: Video Generation-based Model for Vision-Language Navigation
	- Language(s) (NLP): English (Instruction prompts)
	- License: CC BY-NC-SA 4.0
	- Finetuned from model: Based on UMT5-XXL (text encoder) and Wan2.1 VAE.

	### Model Sources

	- Repository: [https://github.com/OpenDriveLab/SparseVideoNav](https://github.com/OpenDriveLab/SparseVideoNav)
	- Paper: [arXiv:2602.05827](https://arxiv.org/abs/2602.05827)
	- Project Page: [https://opendrivelab.com/SparseVideoNav](https://opendrivelab.com/SparseVideoNav)

	## Uses

	### Direct Use

	The model is designed for generating sparse future video frames based on a current visual observation (video) and a natural language instruction (e.g., "turn right"). It is primarily intended for research in Embodied AI, specifically Vision-Language Navigation (VLN) in real-world environments.

	### Out-of-Scope Use

	The model is a research prototype and is not intended for deployment in safety-critical real-world autonomous driving or robotic navigation systems without further extensive testing, safety validation, and fallback mechanisms.

	## How to Get Started with the Model

	Use the code below to get started with the model using our custom pipeline.

	Ensure you have cloned the [GitHub repository](https://github.com/OpenDriveLab/SparseVideoNav) and installed the requirements.

	```python
	from omegaconf import OmegaConf
	from inference import SVNPipeline

	# Load configuration
	cfg = OmegaConf.load("config/inference.yaml")
	cfg.ckpt_path = "/path/to/models/SparseVideoNav-Models" # Path to your downloaded weights
	cfg.inference.device = "cuda:0"

	# Initialize pipeline
	pipeline = SVNPipeline.from_pretrained(cfg)

	# Run inference (Returns np.ndarray (T, H, W, C) uint8)
	video = pipeline(video="/path/to/input.mp4", text="turn right")
	```

	## BibTeX
	```python
	@article{zhang2026sparse,
	title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
	author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
	journal={arXiv preprint arXiv:2602.05827},
	year={2026}
	}