TinyFlux
A /12 scaled Flux architecture for experimentation and research. TinyFlux maintains the core MMDiT (Multimodal Diffusion Transformer) design of Flux while dramatically reducing parameter count for faster iteration and lower resource requirements.
Model Description
TinyFlux is a miniaturized version of FLUX.1-schnell that preserves the essential architectural components:
- Double-stream blocks (MMDiT style) - separate text/image pathways with joint attention
- Single-stream blocks - concatenated text+image with shared weights
- AdaLN-Zero modulation - adaptive layer norm with gating
- 3D RoPE - rotary position embeddings for temporal + spatial positions
- Flow matching - rectified flow training objective
Architecture Comparison
| Component | Flux | TinyFlux | Scale |
|---|---|---|---|
| Hidden size | 3072 | 256 | /12 |
| Attention heads | 24 | 2 | /12 |
| Head dimension | 128 | 128 | preserved |
| Double-stream layers | 19 | 3 | /6 |
| Single-stream layers | 38 | 3 | /12 |
| VAE channels | 16 | 16 | preserved |
| Total params | ~12B | ~8M | /1500 |
Text Encoders
TinyFlux uses smaller text encoders than standard Flux:
| Role | Flux | TinyFlux |
|---|---|---|
| Sequence encoder | T5-XXL (4096 dim) | flan-t5-base (768 dim) |
| Pooled encoder | CLIP-L (768 dim) | CLIP-L (768 dim) |
Training
Dataset
Trained on AbstractPhil/flux-schnell-teacher-latents:
- 10,000 samples
- Pre-computed VAE latents (16, 64, 64) from 512Γ512 images
- Diverse prompts covering people, objects, scenes, styles
Training Details
- Objective: Flow matching (rectified flow)
- Timestep sampling: Logit-normal with Flux shift (s=3.0)
- Loss weighting: Min-SNR-Ξ³ (Ξ³=5.0)
- Optimizer: AdamW (lr=1e-4, Ξ²=(0.9, 0.99), wd=0.01)
- Schedule: Cosine with warmup
- Precision: bfloat16
Flow Matching Formulation
Interpolation: x_t = (1 - t) * noise + t * data
Target velocity: v = data - noise
Loss: MSE(predicted_v, target_v) * min_snr_weight(t)
Usage
Installation
pip install torch transformers diffusers safetensors huggingface_hub
Inference
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL
# Load model (copy TinyFlux class definition first)
config = TinyFluxConfig()
model = TinyFlux(config).to("cuda").to(torch.bfloat16)
weights = load_file(hf_hub_download("AbstractPhil/tiny-flux", "model.safetensors"))
model.load_state_dict(weights)
model.eval()
# Load encoders
t5_tok = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_enc = T5EncoderModel.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16).to("cuda")
clip_tok = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
clip_enc = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.bfloat16).to("cuda")
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-schnell", subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
# Encode prompt
prompt = "a photo of a cat"
t5_in = t5_tok(prompt, max_length=128, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
t5_out = t5_enc(**t5_in).last_hidden_state
clip_in = clip_tok(prompt, max_length=77, padding="max_length", truncation=True, return_tensors="pt").to("cuda")
clip_out = clip_enc(**clip_in).pooler_output
# Euler sampling (t: 0β1, noiseβdata)
x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
img_ids = TinyFlux.create_img_ids(1, 64, 64, "cuda")
timesteps = torch.linspace(0, 1, 21, device="cuda")
for i in range(20):
t = timesteps[i].unsqueeze(0)
dt = timesteps[i+1] - timesteps[i]
guidance = torch.tensor([3.5], device="cuda", dtype=torch.bfloat16)
v = model(
hidden_states=x,
encoder_hidden_states=t5_out,
pooled_projections=clip_out,
timestep=t,
img_ids=img_ids,
guidance=guidance,
)
x = x + v * dt
# Decode
latents = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)
latents = latents / vae.config.scaling_factor
image = vae.decode(latents.float()).sample
image = (image / 2 + 0.5).clamp(0, 1)
Full Inference Script
See the inference_colab.py for a complete generation pipeline with:
- Classifier-free guidance
- Batch generation
- Image saving
Files
AbstractPhil/tiny-flux/
βββ model.safetensors # Model weights (~32MB)
βββ config.json # Model configuration
βββ README.md # This file
βββ model.py # Model architecture definition
βββ inference_colab.py # Inference script
βββ train_colab.py # Training script
βββ checkpoints/ # Training checkpoints
β βββ step_*.safetensors
βββ logs/ # Tensorboard logs
βββ samples/ # Generated samples during training
Limitations
- Resolution: Trained on 512Γ512 only
- Quality: Significantly lower than full Flux due to reduced capacity
- Text understanding: Limited by smaller T5 encoder (768 vs 4096 dim)
- Fine details: May struggle with complex scenes or fine-grained details
- Experimental: Intended for research and learning, not production use
Intended Use
- Understanding Flux/MMDiT architecture
- Rapid prototyping and experimentation
- Educational purposes
- Resource-constrained environments
- Baseline for architecture modifications
Citation
If you use TinyFlux in your research, please cite:
@misc{tinyflux2025,
title={TinyFlux: A Miniaturized Flux Architecture for Experimentation},
author={AbstractPhil},
year={2025},
url={https://huggingface.co/AbstractPhil/tiny-flux}
}
Acknowledgments
- Black Forest Labs for the original Flux architecture
- Hugging Face for diffusers and transformers libraries
License
MIT License - See LICENSE file for details.
Note: This is an experimental research model. For high-quality image generation, use the full FLUX.1-schnell or FLUX.1-dev models.
- Downloads last month
- 361