(ICASSP 2026) HINT: Composed Image Retrieval with Dual-Path Compositional Contextualized Network (Model Weights)
β Corresponding author
π Model Information
1. Model Name
HINT (dual-patH composItional coNtextualized neTwork) Checkpoints.
2. Task Type & Applicable Tasks
- Task Type: Composed Image Retrieval (CIR) / Vision-Language Retrieval.
- Applicable Tasks: Retrieving target images based on a reference image and a modification text.
3. Project Introduction
Existing Composed Image Retrieval (CIR) methods often suffer from the neglect of contextual information in discriminating matching samples , struggling to understand complex modifications and implicit dependencies in real-world scenarios. HINT effectively addresses this through:
π§© Dual Context Extraction (DCE): Extracts both intra-modal context and cross-modal context, enhancing joint semantic representation by integrating multimodal contextual information.
π Quantification of Contextual Relevance (QCR): Measures the relevance between cross-modal contextual information and the target image semantics, enabling the quantification of the implicit dependencies.
βοΈ Dual-Path Consistency Constraints (DPCC): Optimizes the training process by constraining representation consistency, ensuring the stable enhancement of similarity for matching instances while lowering it for non-matching ones.
Based on the BLIP-2 architecture , HINT achieves State-of-the-Art (SOTA) retrieval performance across both open-domain and fashion-domain benchmarks.
4. Training Data Source & Hosted Weights
The models were trained on the FashionIQ and CIRR datasets . This Hugging Face repository provides the corresponding .pt checkpoint files organized by dataset:
fashioniq.pt(Trained on FashionIQ)cirr.pt(Trained on CIRR)
π Usage & Basic Inference
These weights are designed to be evaluated seamlessly using the official HINT GitHub repository.
Step 1: Prepare the Environment
Clone the GitHub repository and install dependencies:
git clone https://github.com/iLearn-Lab/ICASSP26-HINT
cd ICASSP26-HINT
conda create -n hint python=3.8 -y
conda activate hint
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install open-clip-torch==2.24.0 scikit-learn==1.3.2 transformers==4.25.0 salesforce-lavis==1.0.2 timm==0.9.16
Step 2: Download Model Weights
Download the specific .pt files you wish to evaluate from this Hugging Face repository. Place them into a checkpoints/ directory within your cloned GitHub repo. For example, to evaluate the CIRR model:
ICASSP26-HINT/
βββ checkpoints/
βββ cirr.pt <-- (Rename to best_model.pt if required by your specific test script)
Step 3: Run Testing / Evaluation
To generate prediction files on the CIRR dataset for the CIRR Evaluation Server, point the test script to the directory containing your downloaded checkpoint:
python src/cirr_test_submission.py checkpoints/
(The script will automatically output .json files based on the checkpoint for online evaluation.)
β οΈ Limitations & Notes
- Hardware Requirements: Because HINT is built upon the powerful BLIP-2 architecture, inference and further fine-tuning require GPUs with sufficient memory (e.g., NVIDIA A40 48G / V100 32G is recommended).
- Intended Use: These weights are provided for academic research and to facilitate reproducibility of the ICASSP 2026 paper.
πβοΈ Citation
If you find our work, code, or these model weights useful in your research, please consider leaving a Star βοΈ on our GitHub repository and citing our paper:
@inproceedings{HINT2026,
title={HINT: COMPOSED IMAGE RETRIEVAL WITH DUAL-PATH COMPOSITIONAL CONTEXTUALIZED NETWORK},
author={Zhang, Mingyu and Li, Zixu and Chen, Zhiwei and Fu, Zhiheng and Zhu, Xiaowei and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng},
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026}
}