FIRM-Reward
Collection
The data and models of "Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation" • 6 items • Updated
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on the instruction_following_train_v3 and the consistency_train_v3 datasets. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.591 | 0.2182 | 500 | 0.5827 |
| 0.5605 | 0.4364 | 1000 | 0.5460 |
| 0.5252 | 0.6546 | 1500 | 0.5199 |
| 0.5075 | 0.8728 | 2000 | 0.5055 |