H-EmbodVis/VEGA-3D-Spatial-Reasoning
Video-Text-to-Text • 10B • Updated • 21 • 1
None defined yet.
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models