πŸš€ CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

Mingzhu Xu1  Tianxiang Xiao1  Yutong Liu1  Haoyu Tang1  Yupeng Hu1βœ‰  Liqiang Nie1

1Affiliation (Please update if needed)

These are the official implementation details and pre-trained models for CMIRNet, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).

πŸ”— Paper: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
πŸ”— Task: Referring Image Segmentation (RIS)
πŸ”— Framework: PyTorch


πŸ“Œ Model Information

1. Model Name

CMIRNet (Cross-Modal Interactive Reasoning Network)


2. Task Type & Applicable Tasks

  • Task Type: Vision-Language / Multimodal Learning
  • Core Task: Referring Image Segmentation (RIS)
  • Applicable Scenarios:
    • Language-guided object segmentation
    • Cross-modal reasoning
    • Vision-language alignment
    • Scene understanding with textual queries

3. Project Introduction

Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in fine-grained cross-modal alignment and complex reasoning between visual and linguistic modalities.

CMIRNet proposes a Cross-Modal Interactive Reasoning framework, which:

  • Introduces interactive reasoning mechanisms between visual and textual features
  • Enhances semantic alignment via multi-stage cross-modal fusion
  • Incorporates graph-based reasoning to capture complex relationships
  • Improves robustness under ambiguous or complex referring expressions

4. Training Data Source

The model is trained and evaluated on:

  • RefCOCO
  • RefCOCO+
  • RefCOCOg
  • RefCLEF

Image data is based on:

  • MS COCO 2014 Train Set (83K images)

πŸš€ Usage & Basic Inference

Step 1: Prepare Pre-trained Weights

Download backbone weights:

  • ResNet-50
  • ResNet-101
  • Swin-Transformer-Base
  • Swin-Transformer-Large

Step 2: Dataset Preparation

  1. Download COCO 2014 training images
  2. Extract to:
./data/images/
  1. Download referring datasets:
https://github.com/lichengunc/refer

Step 3: Training

ResNet-based Training

python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0

python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+

python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd

Swin-Transformer-based Training

python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0

python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+

python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd

Step 4: Testing / Inference

ResNet-based Testing

python test_resnet.py --device cuda:0 --resume path/to/weights

python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+

python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd

Swin-Transformer-based Testing

python test_swin.py --device cuda:0 --resume path/to/weights --window12

python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12

python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12

⚠️ Limitations & Notes

  • For academic research use only
  • Performance depends on dataset quality and referring expression clarity
  • May degrade under:
    • ambiguous language
    • complex scenes
    • domain shift
  • Requires substantial GPU resources for training

πŸ“β­οΈ Citation

@ARTICLE{CMIRNet,
  author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation}, 
  year={2024},
  pages={1-1},
  keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
  doi={10.1109/TCSVT.2024.3508752}
}

⭐ Acknowledgement

This work builds upon advances in:

  • Vision-language modeling
  • Transformer architectures
  • Graph neural networks

πŸ“¬ Contact

For questions or collaboration, please contact the corresponding author.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support