MOSS-VL-Instruct-0408

📌 Introduction

MOSS-VL-Instruct-0408 is the instruction-tuned checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing visual understanding.

Built on top of MOSS-VL-Base-0408 through supervised fine-tuning (SFT), this checkpoint is designed as a high-performance offline multimodal engine. It delivers strong, well-rounded performance across the full spectrum of vision-language tasks — including image understanding, OCR, document parsing, visual reasoning, and instruction following — and is particularly outstanding at video understanding, from long-form comprehension to fine-grained temporal reasoning and action recognition.

✨ Highlights

🎬 Outstanding Video Understanding — A core strength of MOSS-VL. The model excels at long-form video comprehension, temporal reasoning, action recognition, and second-level event localization, delivering top-tier results on benchmarks such as VideoMME, and MLVU.
🖼️ Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing.
💬 Reliable Instruction Following — Substantially improved alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.

🏗 Model Architecture

MOSS-VL-Instruct-0408 adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design drives latency down to the millisecond level, enabling instantaneous responses to dynamic video streams. Natively supporting interleaved modalities, it processes complex sequences of images and videos within a unified pipeline — eliminating the need for heavy pre-processing.

MOSS-VL Architecture

🧩 Absolute Timestamps

To ensure the model accurately perceives the pacing and duration of events, MOSS-VL-Instruct-0408 injects absolute timestamps alongside each sampled frame, grounding the reasoning process in a precise temporal reference.

Timestamped Sequence Input Illustration

🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).

MOSS-VL mRoPE Architecture Illustration

📊 Model Performance

We conducted a comprehensive evaluation of MOSS-VL-Instruct-0408 across four key dimensions: Multimodal Perception, Multimodal Reasoning, Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in general multimodal perception and complex video analysis.

🌟 Key Highlights

🚀 Leading Video Intelligence: MOSS-VL achieves a score of 65.8 in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like VideoMME, MLVU, EgoSchema, and VSI-bench (where it outperforms Qwen3-VL-8B-Instruct by 8.3 points).
👁️ Outstanding Multimodal Perception: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like BLINK and MMBench.
🧠 Robust Multimodal Reasoning: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites.
📄 Reliable Document Understanding: While the model is primarily optimized for general perception, MOSS-VL still delivers 83.9 on OCR and document analysis, ensuring dependable extraction of text and structured information.

MOSS-VL Benchmark Results

🚀 Quickstart

🛠️ Installation

conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt

🏃 Run Inference

Single-image offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
prompt = "Describe this image."


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt=prompt,
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
)

print(text)

Single-video offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe this video."


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt=prompt,
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
)

print(text)

Batched offline inference (Python)

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

queries = [
    {
        "prompt": "Describe sample A.",
        "images": [],
        "videos": ["data/sample_a.mp4"],
        "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
        "generate_kwargs": {
            "temperature": 1.0,
            "top_k": 50,
            "top_p": 1.0,
            "max_new_tokens": 256,
            "repetition_penalty": 1.0,
            "do_sample": False,
        },
    },
    {
        "prompt": "Describe sample B.",
        "images": [],
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": {"video_fps": 1.0, "min_frames": 8, "max_frames": 256},
        "generate_kwargs": {
            "temperature": 1.0,
            "top_k": 50,
            "top_p": 1.0,
            "max_new_tokens": 256,
            "repetition_penalty": 1.0,
            "do_sample": False,
        },
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(processor, queries, vision_chunked_length=64)

texts = [item["text"] for item in result["results"]]

🚧 Limitations and Future Work

MOSS-VL-Instruct-0408 represents an early milestone in the MOSS-VL roadmap, and we're actively working on several directions to push it further:

🧮 Math & Code Reasoning — While the current checkpoint already exhibits great general reasoning, we plan to substantially strengthen its mathematical reasoning and code reasoning capabilities, especially in multimodal contexts.
🎯 RL Post-Training — We are working on a reinforcement learning post-training stage to further align the model with human preferences and to unlock stronger multi-step reasoning behaviors on top of the SFT foundation.

We welcome community feedback and contributions on any of these directions.

📜 Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},
  note          = {GitHub repository}
}