Image model · MMDIT
Qwen-Image requirements
MMDIT image model · 20B params · 1328×1328 · 20-50 steps · released Aug 2025. Realistic minimum to run: Nvidia GeForce RTX 4060 Ti (16GB) at Q4_K_M GGUF.
Backbone size by precision
| Precision | Size |
|---|---|
| fp16 / bf16 | 40.9 GB |
| fp8 | 20.4 GB |
| Q8 GGUF | 21.8 GB |
| Q4 GGUF (recommended) | 13.1 GB |
| Q2 GGUF | 7.06 GB |
Backbone weights only. The verdict uses peak VRAM consumed at Q4_K_M GGUF, not the file size.
Pipeline components
| Component | Size |
|---|---|
| Qwen2.5-VL 7B text encoder offloaded | 9.38 GB |
| VAE | 0.25 GB |
Encoders marked “offloaded” move to CPU before denoising, so they do not count toward peak VRAM.
Run it
Qwen-Image runs in ComfyUI or Nunchaku (SVDQuant 4-bit). Load the Q4_K_M GGUF backbone with its text encoder and VAE; there is no single chat command like a text LLM.
Which devices can run Qwen-Image?
Apple Silicon Macs
- Apple M1 (8GB) No
- Apple M2 (16GB) No
- Apple M4 (16GB) No
- Apple M5 (16GB) No
- Apple M3 Pro (18GB) No
- Apple M4 (24GB) Yes
- Apple M4 Pro (24GB) Yes
- Apple M5 (32GB) Yes
- Apple M4 Pro (48GB) Yes
- Apple M5 Pro (48GB) Yes
- Apple M4 Max (64GB) Yes
- Apple M4 Max (128GB) Yes
- Apple M5 Max (128GB) Yes
- Apple M3 Ultra (256GB) Yes
RAM-only laptops
No mainstream local runtime for a 20B image model on RAM-only laptops yet.
iPhone & iPad
No mainstream local runtime for a 20B image model on iPhone & iPad yet.
Android
No mainstream local runtime for a 20B image model on Android yet.
NVIDIA GPUs
AMD GPUs
FAQ
How much VRAM does Qwen-Image need?
At Q4_K_M GGUF the realistic peak is ~14 GB, versus ~57 GB with every component resident. With aggressive CPU offload it drops to ~3 GB, much slower.
Why is peak VRAM lower than the sum of the files?
The text encoder is run once to encode your prompt, then offloaded to CPU before the denoising steps, so it is not resident at the memory peak.
Can I use Qwen-Image commercially?
Yes. Qwen-Image is licensed Apache-2.0, which permits commercial use.
20B MMDiT with a 7B Qwen2.5-VL text encoder. The encoder is offloaded after prompt-encoding, so the Q4_K_M GGUF backbone (13.1GB) sustains ~14GB during denoising, validated on a 16GB RTX A4000. Full bf16 with the encoder resident is ~57GB (multi-GPU). Nunchaku SVDQuant 4-bit can offload to ~3GB VRAM, slowly. Apache-2.0. Sources: Qwen-Image HF card, city96 Qwen-Image GGUF, ComfyUI Qwen-Image tutorial, sandner.art GGUF benchmark.
Sources
VRAM is a sourced peak-usage anchor at Q4_K_M GGUF, validated 2026-06-15. See methodology.