YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Overview

Description:

One Transformer to Rule all Segmentation Tasks or OneFormer is a universal segmentation architecture capable of addressing panoptic, instance, or semantic image segmentation tasks. The task token is essential in dynamically guiding the model to output task-specific predictions by conditioning the architecture on the desired segmentation type (e.g., "semantic," "instance," or "panoptic") during a single, unified training and inference process.

This model is ready for commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model Agreement

Deployment Geography:

Global

Use Case:

Intended Users: This model is intended for use by computer vision engineers, robotics engineers, and researchers who require a comprehensive and detailed pixel-level understanding of an image.
Intended Use Cases: The model's ability to perform semantic, instance, and panoptic segmentation simultaneously makes it ideal for:

  • Autonomous Systems: Providing full scene understanding for self-driving cars or drones by identifying both "stuff" (road, sky) and "things" (car 1, car 2, pedestrian 1).
  • Robotic Perception: Enabling robots to identify, locate, and separate individual objects for "pick-and-place" tasks in cluttered environments.
  • Medical Image Analysis: Segmenting and counting individual cells or tumors (instances) while also classifying surrounding anatomical regions or tissues (semantic).
  • Geospatial Analysis: Analyzing satellite imagery to map land use (e.g., "forest," "water") while also detecting and counting individual objects (e.g., "buildings," "vehicles").
  • Computational Photography: Powering features like AR effects or portrait mode by creating precise masks that separate subjects from their background.

Release Date:

NGC 05/25/2026 via [URL]

References(s):

J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, H. Shi: OneFormer: One Transformer to Rule Universal Image Segmentation

Model Architecture:

Architecture Type: The model is a unified segmentor that takes color (RGB) images as inputs and generates segmentation masks and associated labels as outputs.

Network Architecture:

  • The backbone feature extractor of this model is the DiNaT-L model.
  • The multi-scale features from the backbone are then fed into a Pixel Decoder (similar to an FPN) to generate high-resolution, multi-scale feature maps.
  • The core of the architecture is a transformer decoder that takes two sets of inputs: the multi-scale feature maps and a set of learnable queries.
  • A key innovation of OneFormer is the use of a task token. This single, learnable token is added to the queries to "prompt" the model, conditioning it to perform a specific task (semantic, instance, or panoptic segmentation) using the exact same weights.
  • Finally, the refined query embeddings from the decoder are passed to two parallel prediction heads:
  • A classification head (a linear layer or small MLP) to predict the class label for each query.
  • A mask head (also an MLP) to dynamically generate the final mask for each query by combining the decoder outputs with the pixel decoder's feature maps.

More Details: The models in this instance are universal segmentation models that take RGB images and a text string (e.g., "The task is semantic") as input, and produce masks and classes as output. More specifically, this model was trained with a DiNaT-Large backbone that was trained in a supervised manner on NVIDIA proprietary data called NVImageNet, which allows commercial usage. The text input is used to generate a task token that conditions the model to perform a specific segmentation task (semantic, instance, or panoptic) using the same set of weights. Finally, OneFormer was trained and finetuned on a combination of datasets, including OpenImages, ITS. Note that we ensured that all the raw images used during training have commercial licenses to ensure safe commercial usage.

Number of model parameters: 230*10^7

Input(s):

Input Type(s): Image, Text

Input Format(s):

  • Image: Red, Green, Blue (RGB)
  • Text: String

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)

Other Properties Related to Input: The image size should be divisible by 32, and the text should state "This task is semantic/instance/panoptic."

Output(s)

Output Type(s): Label, Mask and Score for each detected object in the input image.

Output Format(s):

  • Label: Integer
  • Mask: Red, Green, Blue (RGB)
  • Score: Float

Output Parameters:

  • Label: One-Dimensional (1D)
  • Mask: Two-Dimensional (2D)
  • Score: One-Dimensional (1D)

Other Properties Related to Output:

  • pred_classes: Batch size x Number of queries
  • pred_masks: Batch size x Number of queries x Height x Width
  • pred_scores: Batch size x Number of queries

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • TAO v6.25.11

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

Preferred/Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v2.0

Training, Testing, and Evaluation Datasets:

Dataset Overview

  • Total Number of Datasets: 04
    ++ Training: NV-ImageNet, Subset of OpenImagesv5, Subset of MSCOCO 2017, ITS (train)
    ++ Validation: ITS (val), COCO 2017 validation set.

  • Data Modality: Image Images are scaled, typically by resizing the shorter edge to a fixed size (e.g., 800 pixels) while maintaining the aspect ratio. From these scaled images, large random crops (e.g., 1024x1024) are taken to create training patches. Standard augmentations, including random horizontal flipping (with 50% probability) and photometric jitter (adjusting brightness, contrast, and saturation), are applied to improve model robustness.

NV-ImageNet

  • Data Modality: Image
  • Data Collection Method by dataset: Automated
  • Labeling Method by dataset: Human
  • Properties: NV-ImageNet is a commercially friendly image dataset whose category names are aligned with original ImageNet-1k's category names. This dataset is collected from 84 websites which allows its images to be used commercially and Bing image search constrained with only returning results that are free to share and use commercially.
    • Training Set: 5,458,059 images
    • Total Classes: 1000 classes
    • Total Annotations: 5,458,059 annotations

Subset of OpenImagesv5

  • Link: https://storage.googleapis.com/openimages/web/download_v5.html
  • Data Modality: Image
  • Data Collection Method by dataset: Automated
  • Labeling Method by dataset: Synthetic
  • Properties:
    • Training Set: 670,612 images (train2017)
    • Total Classes: 133 total categories
      • 80 "Thing" Categories (for instance segmentation, e.g., 'person', 'dog', 'stop sign')
      • 53 "Stuff" Categories (for semantic segmentation, e.g., 'sky', 'grass', 'road')
    • Total Annotations: 10,944,849 annotations

ITS

  • Data Modality: Image
  • Data Collection Method by dataset: Automated
  • Labeling Method by dataset: Synthetic
  • Properties:
    • Training Set: 227,763 images
    • Validation Set: 5,404 images
    • Total Classes: 4 categories (person, car, bicycle and road_sign)
    • Total Annotations: containing 5,612,043 annotations for training and 55,582 annotations for validation.

Evaluation and Testing

ITS - KPI set

  • Data Modality: Image
  • Data Collection Method by dataset: Automated
  • Labeling Method by dataset: Synthetic
  • Properties:
    • Validation Set: 5,404 images
    • Total Classes: 4 categories (person, car, bicycle and road_sign)
    • Total Annotations: containing 5,612,043 annotations for training and 55,582 annotations for validation.

Inference:

Acceleration Engine: Tensor(RT)
Test Hardware:

  • 1x NVIDIA A100 80GB
  • 1x NVIDIA H100 80GB

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for nvidia/NV-OneFormer