InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms
AI & ML interests
Computer Vision
Recent Activity
Papers
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
RIVER: A Real-Time Interaction Benchmark for Video LLMs
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.
-
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Paper • 2510.12793 • Published • 5 -
OpenGVLab/InternVL3_5-241B-A28B-Flash
Image-Text-to-Text • 242B • Updated • 186 • 7 -
OpenGVLab/InternVL3_5-38B-Flash
Image-Text-to-Text • 40B • Updated • 592 • 8 -
OpenGVLab/InternVL3_5-30B-A3B-Flash
Image-Text-to-Text • 31B • Updated • 107 • 7
This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 224 -
OpenGVLab/InternVL3_5-241B-A28B
Image-Text-to-Text • 241B • Updated • 1.05k • 139 -
OpenGVLab/InternVL3_5-38B
Image-Text-to-Text • 38B • Updated • 12.9k • 44 -
OpenGVLab/InternVL3_5-30B-A3B
Image-Text-to-Text • 31B • Updated • 94.7k • 43
Sequential Diffusion Language Models
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 8 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 11 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 11
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 6.04k • 91 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 208 • 6 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 66 • 4 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 382 • 27 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 101 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 703 • 13 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 7
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 36 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 41 • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 60 • 14
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 60 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 6.76k • 417 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 413 • 76 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 2.85k • 62
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 29 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 225 • 26 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 57 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 67 • 8
State Space Model for Efficient Video Understanding
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 29.9M • Updated • 122 • 2 -
OpenGVLab/internimage_s_1k_224
Image Classification • 50.1M • Updated • 70 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 97.5M • Updated • 394 • 1
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
-
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper • 2512.01342 • Published • 21 -
revliter/internvideo_next_base_p14_res224_f16
91M • Updated • 625 • 5 -
revliter/internvideo_next_large_p14_res224_f16
0.3B • Updated • 6.76k • 7 -
revliter/internvideo_next_large_p14_res224_f16_stage1
Updated • 11 • 2
[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 224 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 71 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 2.05k • 6 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 592 • 6
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 311 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 232k • 85 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 56k • 46 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 70.2k • 106
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 6 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 16 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 1.17k • 38 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 9
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published • 1 -
OpenGVLab/VideoMAEv2-Base
Video Classification • 86.2M • Updated • 11.7k • 17 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 2.12k • 2 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 1.09k • 1
Better than InternVL 2.0
-
InternVL
⚡513Chat with an AI that understands images and text
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 162 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 273 • 193 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 533 • 14
Expanding Performance Boundaries of Open-Source MLLM
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 21 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 106 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 1.76k • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 41 • 34
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 136 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 22 • 7 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 93 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 70 • 2
Chat-Centric Video Understanding
A Large-Scale Video-Text Dataset
Improved Baselines with Pyramid Vision Transformer
InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
-
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper • 2512.01342 • Published • 21 -
revliter/internvideo_next_base_p14_res224_f16
91M • Updated • 625 • 5 -
revliter/internvideo_next_large_p14_res224_f16
0.3B • Updated • 6.76k • 7 -
revliter/internvideo_next_large_p14_res224_f16_stage1
Updated • 11 • 2
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.
-
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Paper • 2510.12793 • Published • 5 -
OpenGVLab/InternVL3_5-241B-A28B-Flash
Image-Text-to-Text • 242B • Updated • 186 • 7 -
OpenGVLab/InternVL3_5-38B-Flash
Image-Text-to-Text • 40B • Updated • 592 • 8 -
OpenGVLab/InternVL3_5-30B-A3B-Flash
Image-Text-to-Text • 31B • Updated • 107 • 7
This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 224 -
OpenGVLab/InternVL3_5-241B-A28B-HF
Image-Text-to-Text • 241B • Updated • 71 • 11 -
OpenGVLab/InternVL3_5-38B-HF
Image-Text-to-Text • 38B • Updated • 2.05k • 6 -
OpenGVLab/InternVL3_5-30B-A3B-HF
Image-Text-to-Text • 31B • Updated • 592 • 6
This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 224 -
OpenGVLab/InternVL3_5-241B-A28B
Image-Text-to-Text • 241B • Updated • 1.05k • 139 -
OpenGVLab/InternVL3_5-38B
Image-Text-to-Text • 38B • Updated • 12.9k • 44 -
OpenGVLab/InternVL3_5-30B-A3B
Image-Text-to-Text • 31B • Updated • 94.7k • 43
Sequential Diffusion Language Models
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 311 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 232k • 85 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 56k • 46 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 70.2k • 106
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 6 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 16 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 1.17k • 38 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 9
[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 8 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 11 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 11
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 6.04k • 91 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 208 • 6 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 66 • 4 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published • 1 -
OpenGVLab/VideoMAEv2-Base
Video Classification • 86.2M • Updated • 11.7k • 17 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 2.12k • 2 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 1.09k • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 382 • 27 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 101 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 703 • 13 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 7
Better than InternVL 2.0
-
InternVL
⚡513Chat with an AI that understands images and text
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 162 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 273 • 193 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 533 • 14
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 87 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 36 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 41 • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 60 • 14
Expanding Performance Boundaries of Open-Source MLLM
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 60 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 6.76k • 417 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 413 • 76 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 2.85k • 62
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 21 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 106 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 1.76k • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 41 • 34
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 136 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 22 • 7 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 93 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 70 • 2
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 29 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 225 • 26 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 57 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 67 • 8
Chat-Centric Video Understanding
State Space Model for Efficient Video Understanding
A Large-Scale Video-Text Dataset
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 29.9M • Updated • 122 • 2 -
OpenGVLab/internimage_s_1k_224
Image Classification • 50.1M • Updated • 70 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 97.5M • Updated • 394 • 1
Improved Baselines with Pyramid Vision Transformer