DiffiT: Diffusion Vision Transformers for Image Generation
This repository hosts the pretrained model weights for DiffiT (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.
Overview
DiffiT (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing Time-dependent Multihead Self Attention (TMSA) for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an FID score of 1.73 on ImageNet-256 while using 19.85% and 16.88% fewer parameters than comparable Transformer-based diffusion models such as MDT and DiT, respectively.
Models
ImageNet-256
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|---|---|---|---|---|---|
| DiffiT | ImageNet | 256×256 | 1.73 | 276.49 | model |
ImageNet-512
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|---|---|---|---|---|---|
| DiffiT | ImageNet | 512×512 | 2.67 | 252.12 | model |
Usage
Please refer to the official GitHub repository for full setup instructions, training code, and evaluation scripts.
Sampling Images
Image sampling is performed using sample.py from the DiffiT repository. To reproduce the reported numbers, use the commands below.
ImageNet-256:
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 4.4 \
--model_path $MODEL \
--image_size 256 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
ImageNet-512:
python sample.py \
--log_dir $LOG_DIR \
--cfg_scale 1.49 \
--model_path $MODEL \
--image_size 512 \
--model Diffit \
--num_sampling_steps 250 \
--num_samples 50000 \
--cfg_cond True
Evaluation
Once images have been sampled, you can compute FID and other metrics using the provided eval_run.sh script in the repository. The evaluation pipeline follows the protocol from openai/guided-diffusion/evaluations.
bash eval_run.sh
Citation
@inproceedings{hatamizadeh2025diffit,
title={Diffit: Diffusion vision transformers for image generation},
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
booktitle={European Conference on Computer Vision},
pages={37--55},
year={2025},
organization={Springer}
}
License
Copyright © 2026, NVIDIA Corporation. All rights reserved.
The code is released under the NVIDIA Source Code License-NC. The pretrained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

