OpenGVLab

community

https://github.com/opengvlab

opengvlab

OpenGVLab

Activity Feed Request to join this org

AI & ML interests

Computer Vision

Recent Activity

shepnerd updated a collection about 13 hours ago

InternVideo3

shepnerd updated a collection about 13 hours ago

InternVideo3

shepnerd updated a collection about 13 hours ago

InternVideo3

View all activity

Papers

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

RIVER: A Real-Time Interaction Benchmark for Video LLMs

View all Papers

OpenGVLab 's collections 35

InternVideo3

InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Paper • 2606.12195 • Published 15 days ago • 23
yanziang/InternVideo3_Dataset

Viewer • Updated 14 days ago • 380k • 187 • 2
yanziang/InternVideo3-8B-Instruct

Video-Text-to-Text • 9B • Updated 14 days ago • 537 • 6

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated Oct 11, 2025 • 11 • 1
OpenGVLab/Vlaser-8B

8B • Updated Oct 11, 2025 • 604 • 2
OpenGVLab/Vlaser-2B-VLA

Updated Oct 11, 2025 • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published Oct 13, 2025 • 23

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published Oct 14, 2025 • 5
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated Sep 28, 2025 • 186 • 7
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated Sep 28, 2025 • 592 • 8
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated Sep 28, 2025 • 107 • 7

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 224
OpenGVLab/InternVL3_5-241B-A28B

Image-Text-to-Text • 241B • Updated Aug 29, 2025 • 1.05k • 139
OpenGVLab/InternVL3_5-38B

Image-Text-to-Text • 38B • Updated Aug 29, 2025 • 12.9k • 44
OpenGVLab/InternVL3_5-30B-A3B

Image-Text-to-Text • 31B • Updated Aug 29, 2025 • 94.7k • 43

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published Sep 28, 2025 • 47
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated Dec 27, 2025 • 14 • 18
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated Dec 27, 2025 • 26 • 7
OpenGVLab/SDLM-3B-D8

Text Generation • 3B • Updated Dec 27, 2025 • 25 • 3

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30, 2025 • 15 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20, 2025 • 14 • 7
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6, 2025 • 107 • 18
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29, 2025 • 42 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15, 2025 • 160 • 15

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
OpenGVLab/PIIP

Object Detection • Updated Apr 16, 2025 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 11
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 11

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 6.04k • 91
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13, 2025 • 208 • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13, 2025 • 66 • 4
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21, 2025 • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16, 2025 • 382 • 27
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 101 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 703 • 13
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 7

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 36 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11, 2025 • 41 • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 60 • 14

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 60
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 6.76k • 417
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 413 • 76
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8, 2025 • 2.85k • 62

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 255 • 8
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 29
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 225 • 26
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 57 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 67 • 8

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 29
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 32
Configuration error

Agents

Featured

97

VideoMamba

🐍

97

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 33
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20, 2025 • 872M • 12k • 28
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20, 2025 • 208M • 801 • 34
OpenGVLab/OmniCorpus-YT

Updated Mar 20, 2025 • 450 • 16

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25, 2025 • 122 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25, 2025 • 70 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25, 2025 • 394 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 470 • 29
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 237 • 22
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 367 • 198
OpenGVLab/MMPR

Preview • Updated Apr 11, 2025 • 133 • 52

InternVideo-Next

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Paper • 2512.01342 • Published Dec 1, 2025 • 21
revliter/internvideo_next_base_p14_res224_f16

91M • Updated Dec 18, 2025 • 625 • 5
revliter/internvideo_next_large_p14_res224_f16

0.3B • Updated Dec 18, 2025 • 6.76k • 7
revliter/internvideo_next_large_p14_res224_f16_stage1

Updated Dec 18, 2025 • 11 • 2

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9, 2025 • 22
OpenGVLab/NaViL-2B

4B • Updated Oct 10, 2025 • 26 • 1
OpenGVLab/NaViL-9B

16B • Updated Oct 10, 2025 • 13 • 1

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 224
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 71 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 2.05k • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 592 • 6

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 71 • 12
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18, 2025 • 32 • 10
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18, 2025 • 20 • 21
OpenGVLab/ScaleCUA-Data

Preview • Updated Sep 27, 2025 • 2.65k • 31

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20, 2025 • 19 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20, 2025 • 14 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22, 2025 • 275 • 17

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 311
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11, 2025 • 232k • 85
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11, 2025 • 56k • 46
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11, 2025 • 70.2k • 106

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 6
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16, 2025 • 16
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated 18 days ago • 1.17k • 38
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 9

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13, 2025 • 21
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 84 • 7
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 23 • 5
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated Oct 2, 2025 • 105 • 10

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023 • 1
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14, 2025 • 11.7k • 17
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14, 2025 • 2.12k • 2
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25, 2025 • 1.09k • 1

InternVL2.5

Better than InternVL 2.0

Running

Agents

Featured

513

InternVL

⚡

513

Chat with an AI that understands images and text
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 162
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 273 • 193
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11, 2025 • 533 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25, 2025 • 12
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25, 2025 • 367 • 213
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25, 2025 • 247 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 114 • 93

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 21
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 106 • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 1.76k • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 41 • 34

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 136 • 5
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 22 • 7
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 93
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26, 2025 • 70 • 2

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 331 • 52
Sleeping

31

VideoChat2

⚡

31

View a remote web page within the app

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 350 • 98
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 57 • 8
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 52
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 26

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 415 • 17
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 12 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 68 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 11 • 3

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 13 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 308 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 14 • 1

InternVideo3

InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Paper • 2606.12195 • Published 15 days ago • 23
yanziang/InternVideo3_Dataset

Viewer • Updated 14 days ago • 380k • 187 • 2
yanziang/InternVideo3-8B-Instruct

Video-Text-to-Text • 9B • Updated 14 days ago • 537 • 6

InternVideo-Next

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Paper • 2512.01342 • Published Dec 1, 2025 • 21
revliter/internvideo_next_base_p14_res224_f16

91M • Updated Dec 18, 2025 • 625 • 5
revliter/internvideo_next_large_p14_res224_f16

0.3B • Updated Dec 18, 2025 • 6.76k • 7
revliter/internvideo_next_large_p14_res224_f16_stage1

Updated Dec 18, 2025 • 11 • 2

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

OpenGVLab/Vlaser-2B

2B • Updated Oct 11, 2025 • 11 • 1
OpenGVLab/Vlaser-8B

8B • Updated Oct 11, 2025 • 604 • 2
OpenGVLab/Vlaser-2B-VLA

Updated Oct 11, 2025 • 3
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published Oct 13, 2025 • 23

NaViL

[NeurIPS 2025] Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9, 2025 • 22
OpenGVLab/NaViL-2B

4B • Updated Oct 10, 2025 • 26 • 1
OpenGVLab/NaViL-9B

16B • Updated Oct 10, 2025 • 13 • 1

InternVL3.5-Flash

InternVL3.5-Flash is a fast variant of InternVL3.5 using semantic aware dynamic high-resolution strategy.

ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Paper • 2510.12793 • Published Oct 14, 2025 • 5
OpenGVLab/InternVL3_5-241B-A28B-Flash

Image-Text-to-Text • 242B • Updated Sep 28, 2025 • 186 • 7
OpenGVLab/InternVL3_5-38B-Flash

Image-Text-to-Text • 40B • Updated Sep 28, 2025 • 592 • 8
OpenGVLab/InternVL3_5-30B-A3B-Flash

Image-Text-to-Text • 31B • Updated Sep 28, 2025 • 107 • 7

InternVL3.5-Core

This collection includes only the InternVL3.5 checkpoints that have completed the full training pipeline (i.e., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 224
OpenGVLab/InternVL3_5-241B-A28B-HF

Image-Text-to-Text • 241B • Updated Sep 8, 2025 • 71 • 11
OpenGVLab/InternVL3_5-38B-HF

Image-Text-to-Text • 38B • Updated Sep 8, 2025 • 2.05k • 6
OpenGVLab/InternVL3_5-30B-A3B-HF

Image-Text-to-Text • 31B • Updated Sep 8, 2025 • 592 • 6

InternVL3.5

This collection includes all released checkpoints of InternVL3.5, covering different training stages (e.g., Pretraining, SFT, MPO, Cascade RL).

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 224
OpenGVLab/InternVL3_5-241B-A28B

Image-Text-to-Text • 241B • Updated Aug 29, 2025 • 1.05k • 139
OpenGVLab/InternVL3_5-38B

Image-Text-to-Text • 38B • Updated Aug 29, 2025 • 12.9k • 44
OpenGVLab/InternVL3_5-30B-A3B

Image-Text-to-Text • 31B • Updated Aug 29, 2025 • 94.7k • 43

ScaleCUA

OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 71 • 12
OpenGVLab/ScaleCUA-7B

Image-Text-to-Text • 8B • Updated Sep 18, 2025 • 32 • 10
OpenGVLab/ScaleCUA-32B

Image-Text-to-Text • 33B • Updated Sep 18, 2025 • 20 • 21
OpenGVLab/ScaleCUA-Data

Preview • Updated Sep 27, 2025 • 2.65k • 31

SDLM

Sequential Diffusion Language Models

Sequential Diffusion Language Models

Paper • 2509.24007 • Published Sep 28, 2025 • 47
OpenGVLab/SDLM-32B-D4

Text Generation • 33B • Updated Dec 27, 2025 • 14 • 18
OpenGVLab/SDLM-3B-D4

Text Generation • 3B • Updated Dec 27, 2025 • 26 • 7
OpenGVLab/SDLM-3B-D8

Text Generation • 3B • Updated Dec 27, 2025 • 25 • 3

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

OpenGVLab/Docopilot-2B

Image-Text-to-Text • 2B • Updated Jul 20, 2025 • 19 • 8
OpenGVLab/Docopilot-8B

Image-Text-to-Text • 8B • Updated Jul 20, 2025 • 14 • 3
OpenGVLab/Doc-750K

Preview • Updated Jul 22, 2025 • 275 • 17

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

OpenGVLab/ZeroGUI-AndroidLab-7B

Image-Text-to-Text • 8B • Updated May 30, 2025 • 15 • 5
OpenGVLab/ZeroGUI-OSWorld-7B

Image-Text-to-Text • 8B • Updated Jun 20, 2025 • 14 • 7
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45

InternVL3

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 311
OpenGVLab/InternVL3-1B

Image-Text-to-Text • 0.9B • Updated Sep 11, 2025 • 232k • 85
OpenGVLab/InternVL3-2B

Image-Text-to-Text • 2B • Updated Sep 11, 2025 • 56k • 46
OpenGVLab/InternVL3-8B

Image-Text-to-Text • 8B • Updated Sep 11, 2025 • 70.2k • 106

VisualPRM

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
OpenGVLab/VisualPRM-8B

Image-Text-to-Text • 8B • Updated May 6, 2025 • 107 • 18
OpenGVLab/VisualPRM-8B-v1_1

Image-Text-to-Text • 8B • Updated May 29, 2025 • 42 • 9
OpenGVLab/VisualPRM400K

Preview • Updated Apr 15, 2025 • 160 • 15

Mono-InternVL

A Pioneering Monolithic MLLM

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Paper • 2410.08202 • Published Oct 10, 2024 • 6
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16, 2025 • 16
OpenGVLab/Mono-InternVL-2B

Image-Text-to-Text • 3B • Updated 18 days ago • 1.17k • 38
OpenGVLab/Mono-InternVL-2B-S1-1

Image-Text-to-Text • 3B • Updated Jul 22, 2025 • 9

PIIP

[NeurIPS 2024 Spotlight (Ranking Top 10), TPAMI 2025] Parameter-Inverted Image Pyramid Networks

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
OpenGVLab/PIIP

Object Detection • Updated Apr 16, 2025 • 5
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 11
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

Image-Text-to-Text • 7B • Updated Apr 20, 2025 • 11

VideoChat-R1

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

OpenGVLab/VideoChat-R1-thinking_7B

Video-Text-to-Text • 8B • Updated Apr 13, 2025 • 21
OpenGVLab/VideoChat-R1_7B

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 84 • 7
OpenGVLab/VideoChat-R1_7B_caption

Video-Text-to-Text • 8B • Updated Apr 22, 2025 • 23 • 5
OpenGVLab/VideoChat-R1_5-7B

Video-Text-to-Text • 8B • Updated Oct 2, 2025 • 105 • 10

InternVideo2.5

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 6.04k • 91
OpenGVLab/InternVL_2_5_HiCo_R16

Video-Text-to-Text • 8B • Updated Feb 13, 2025 • 208 • 6
OpenGVLab/InternVL_2_5_HiCo_R64

Video-Text-to-Text • 8B • Updated May 13, 2025 • 66 • 4
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Paper • 2501.12386 • Published Jan 21, 2025 • 1

VideoMAE-v2

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Paper • 2303.16727 • Published Mar 29, 2023 • 1
OpenGVLab/VideoMAEv2-Base

Video Classification • 86.2M • Updated Jan 14, 2025 • 11.7k • 17
OpenGVLab/VideoMAEv2-Large

Video Classification • 0.3B • Updated Jan 14, 2025 • 2.12k • 2
OpenGVLab/VideoMAEv2-Huge

Video Classification • 0.6B • Updated Feb 25, 2025 • 1.09k • 1

VideoChat-Flash

Faster and more powerful VideoChat.

OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448

Video-Text-to-Text • 2B • Updated Mar 16, 2025 • 382 • 27
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 101 • 7
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Video-Text-to-Text • 8B • Updated Mar 16, 2025 • 703 • 13
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Paper • 2501.00574 • Published Dec 31, 2024 • 7

InternVL2.5

Better than InternVL 2.0

Running

Agents

Featured

513

InternVL

⚡

513

Chat with an AI that understands images and text
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 162
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 273 • 193
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Sep 11, 2025 • 533 • 14

InternVL2.5-MPO

Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
OpenGVLab/InternVL2_5-78B-MPO

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 36 • 54
OpenGVLab/InternVL2_5-38B-MPO

Image-Text-to-Text • 38B • Updated Sep 11, 2025 • 41 • 20
OpenGVLab/InternVL2_5-26B-MPO

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 60 • 14

InternVL2.0

Expanding Performance Boundaries of Open-Source MLLM

OpenGVLab/InternVL2-Pretrain-Models

Image-Text-to-Text • Updated Mar 25, 2025 • 12
OpenGVLab/InternVL2-Llama3-76B

Image-Text-to-Text • 76B • Updated Mar 25, 2025 • 367 • 213
OpenGVLab/InternVL2-Llama3-76B-AWQ

Image-Text-to-Text • Updated Mar 25, 2025 • 247 • 25
OpenGVLab/InternVL2-40B

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 114 • 93

InternVL1.5

A Pioneering Open-Source Alternative to GPT-4V

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 60
OpenGVLab/InternVL-Chat-V1-5

Image-Text-to-Text • 26B • Updated Mar 25, 2025 • 6.76k • 417
OpenGVLab/InternViT-6B-448px-V1-5

Image Feature Extraction • 6B • Updated Dec 9, 2024 • 413 • 76
OpenGVLab/InternViT-300M-448px

Image Feature Extraction • 0.3B • Updated Jan 8, 2025 • 2.85k • 62

InternVL1.0

Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 21
OpenGVLab/InternViT-6B-224px

Image Feature Extraction • Updated Dec 9, 2024 • 106 • 24
OpenGVLab/InternVL-14B-224px

Image Feature Extraction • 14B • Updated Dec 9, 2024 • 1.76k • 35
OpenGVLab/InternVL-Chat-V1-2-Plus

Image-Text-to-Text • 40B • Updated Mar 25, 2025 • 41 • 34

V2PE

Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

OpenGVLab/V2PE

Updated Dec 13, 2024 • 4
OpenGVLab/V2PE-Data

Preview • Updated Dec 14, 2024 • 255 • 8
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1

InternVL Adaptation

Adaptation Models for Specific Domains

OpenGVLab/Mini-InternVL2-4B-DA-DriveLM

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 136 • 5
OpenGVLab/Mini-InternVL2-4B-DA-Medical

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 22 • 7
OpenGVLab/Mini-InternVL2-4B-DA-BDD

Image-Text-to-Text • 4B • Updated Dec 9, 2024 • 93
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM

Image-Text-to-Text • 2B • Updated Mar 26, 2025 • 70 • 2

InternVideo2

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 29
OpenGVLab/InternVideo2-Chat-8B

Video-Text-to-Text • 8B • Updated Oct 10, 2024 • 225 • 26
OpenGVLab/InternVideo2_chat_8B_HD

Video-Text-to-Text • 8B • Updated Dec 18, 2024 • 57 • 18
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5

Video-Text-to-Text • 9B • Updated Sep 19, 2024 • 67 • 8

VideoChat

Chat-Centric Video Understanding

OpenGVLab/VideoChat2_stage3_Mistral_7B

Updated May 22, 2024 • 4
OpenGVLab/VideoChat2_stage2_Mistral_7B

Updated May 22, 2024 • 2
OpenGVLab/VideoChat2-IT

Viewer • Updated Jun 29, 2024 • 1.82M • 331 • 52
Sleeping

31

VideoChat2

⚡

31

View a remote web page within the app

VideoMamba

State Space Model for Efficient Video Understanding

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 29
OpenGVLab/VideoMamba

Video Classification • Updated Apr 14, 2024 • 32
Configuration error

Agents

Featured

97

VideoMamba

🐍

97

Identify actions and objects in videos and images
Andy1621/VideoMamba

Updated Mar 13, 2024 • 2

InternVid

A Large-Scale Video-Text Dataset

OpenGVLab/InternVid

Viewer • Updated Aug 13, 2024 • 21.3M • 350 • 98
OpenGVLab/InternVid-10M-FLT-INFO

Viewer • Updated Jan 24, 2024 • 10.6M • 57 • 8
OpenGVLab/ViCLIP

Updated Jun 7, 2024 • 52
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Paper • 2307.06942 • Published Jul 13, 2023 • 26

OmniCorpus

A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 33
OpenGVLab/OmniCorpus-CC

Viewer • Updated Mar 20, 2025 • 872M • 12k • 28
OpenGVLab/OmniCorpus-CC-210M

Viewer • Updated Mar 20, 2025 • 208M • 801 • 34
OpenGVLab/OmniCorpus-YT

Updated Mar 20, 2025 • 450 • 16

All-Seeing Project

OpenGVLab/ASMv2

Text Generation • Updated Feb 29, 2024 • 415 • 17
OpenGVLab/ASM-FT

Updated Feb 21, 2024 • 12 • 6
OpenGVLab/AS-Core

Preview • Updated Mar 21, 2024 • 68 • 10
OpenGVLab/ASM-Pretrain

Updated Feb 21, 2024 • 11 • 3

InternImage

Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Paper • 2211.05778 • Published Nov 10, 2022
OpenGVLab/internimage_t_1k_224

Image Classification • 29.9M • Updated Mar 25, 2025 • 122 • 2
OpenGVLab/internimage_s_1k_224

Image Classification • 50.1M • Updated Mar 25, 2025 • 70 • 1
OpenGVLab/internimage_b_1k_224

Image Classification • 97.5M • Updated Mar 25, 2025 • 394 • 1

PVT v2

Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Paper • 2106.13797 • Published Jun 25, 2021
OpenGVLab/pvt_v2_b1

Image Classification • 14M • Updated Mar 12, 2024 • 13 • 1
OpenGVLab/pvt_v2_b2

Image Classification • 25.4M • Updated Mar 12, 2024 • 308 • 1
OpenGVLab/pvt_v2_b2_linear

Image Classification • 22.6M • Updated Mar 12, 2024 • 14 • 1

InternVL Data

OpenGVLab/InternVL-Chat-V1-2-SFT-Data

Viewer • Updated Sep 20, 2024 • 573k • 470 • 29
OpenGVLab/InternVL-SA-1B-Caption

Viewer • Updated Sep 21, 2024 • 8.63M • 237 • 22
OpenGVLab/ShareGPT-4o

Viewer • Updated Aug 17, 2024 • 59.4k • 367 • 198
OpenGVLab/MMPR

Preview • Updated Apr 11, 2025 • 133 • 52

AI & ML interests

Recent Activity

Papers

Team members 118

OpenGVLab 's collections 35

VideoMamba

InternVL

VideoChat2