JoyAI-Image-Edit-Plus (ComfyUI weights)
Single-file .safetensors checkpoints of JoyAI-Image-Edit-Plus, repackaged for native ComfyUI support (no custom node required).
JoyAI-Image-Edit-Plus is the multi-image instruction-guided editing model of the JoyAI-Image family. It accepts 1β6 reference images and a text instruction, and generates a new image that combines elements from the references according to the instruction.
Files
| File | Size | Goes into | Component |
|---|---|---|---|
diffusion_models/joy_image_edit_plus_bf16.safetensors |
~31 GB | ComfyUI/models/diffusion_models/ |
JoyImageEditPlusTransformer3DModel (bf16) |
text_encoders/qwen3vl_joyimage_bf16.safetensors |
~17 GB | ComfyUI/models/text_encoders/ |
Qwen3-VL-8B text encoder (bf16) |
vae/joy_image_edit_vae.safetensors |
~243 MB | ComfyUI/models/vae/ |
AutoencoderKLWan |
The repo layout already matches ComfyUI/models/, so a single hf download into your models root drops every file where it needs to go.
Model architecture
- Transformer: 40-layer DiT, hidden size 4096, 32 heads, in/out channels 16, patch size
[1, 2, 2], 3D RoPE (rope_dim_list = [16, 56, 56], theta 10000). Each reference image is patchified independently and concatenated on the sequence dimension with a per-image temporal offset in the 3D RoPE grid, so references may differ in resolution. - Text encoder:
Qwen3VLForConditionalGeneration(text dim 4096). The instruction is wrapped with one<|vision_start|><|image_pad|><|vision_end|>block per reference image. - VAE:
AutoencoderKLWan(z_dim 16, spatial downscale 8, temporal downscale 4) β the same VAE used by the single-image edit model. - Scheduler: FlowMatch (Euler), sampling shift 1.5.
Weight names are byte-identical to the diffusers checkpoint (894 transformer keys, zero renaming); ComfyUI auto-detects the model as joyimage.
Installation
The model runs natively in ComfyUI. Native support is proposed upstream in Comfy-Org/ComfyUI#14428; until it is merged, install the fork branch:
git clone -b joyimage-edit-pr https://github.com/feice-huang/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
Once the PR is merged upstream, the stock ComfyUI release will run these weights with no fork needed.
Then download the weights straight into ComfyUI/models/:
hf download jdopensource/JoyAI-Image-Edit-Plus-ComfyUI \
--local-dir /path/to/ComfyUI/models
Restart ComfyUI.
Usage
Example workflow: workflow_joyimage_edit.json
Build the graph from these native nodes:
- Load Diffusion Model (
UNETLoader) βdiffusion_models/joy_image_edit_plus_bf16.safetensors - Load CLIP (
CLIPLoader) βtext_encoders/qwen3vl_joyimage_bf16.safetensors, typejoyimage - Load VAE (
VAELoader) βvae/joy_image_edit_vae.safetensors - Load Image (
LoadImage) for each reference (1β6) - TextEncodeJoyImageEditPlus β feed
clip,vae, the instruction, and the reference images intoimage1β¦image6. Wire one instance for the positive prompt and one (empty prompt, same images) for the negative. Each node bucket-resizes the references to the 1024-base buckets, VAE-encodes them, and appends the reference latents to the conditioning; itsimageoutput feedsVAEDecode/ empty-latent sizing. - KSampler β VAEDecode β SaveImage
Recommended parameters
| Parameter | Value |
|---|---|
| Steps | 30 |
| CFG | 4.0 |
| Sampler | euler |
| Scheduler | simple |
| dtype | bf16 |
| Resolution | auto (1024-base buckets, per reference) |
Example
Prompt: "The woman is lovingly holding the cute puppy in her arms"
Model details
- Developed by: JD.com
- License: Apache-2.0
- Framework: PyTorch / ComfyUI
Links
- Source code and documentation: github.com/jd-opensource/JoyAI-Image
- Original Diffusers-format weights: jdopensource/JoyAI-Image-Edit-Plus-Diffusers
- Single-image edit model (ComfyUI): jdopensource/JoyAI-Image-Edit-ComfyUI
Citation
@misc{joyai-image-2025,
title={JoyAI-Image: A Unified Multimodal Foundation Model for Image Understanding, Generation, and Editing},
author={Joy Future Academy, JD},
year={2025},
url={https://github.com/jd-opensource/JoyAI-Image}
}


