Image model · MMDIT

SD Stable Diffusion 3.5 Large requirements

MMDIT image model · 8.1B params · 1024×1024 · 28-40 steps · released Oct 2024. Realistic minimum to run: Nvidia GeForce RTX 3060 (12GB) at Q4 GGUF.

Stability Community License Commercial use: conditional

Free for commercial use under $1M annual revenue; an enterprise license is required above that.

Peak VRAM (Q4 GGUF)

~7 GB

All resident

~19 GB

Offload floor

~5 GB

Resolution

1024×1024

Backbone size by precision

Precision On disk

Precision	Size
fp16 / bf16	16.5 GB
fp8	14.9 GB
Q8 GGUF	8.78 GB
Q4 GGUF (recommended)	4.77 GB

Backbone weights only. The verdict uses peak VRAM consumed at Q4 GGUF, not the file size.

Pipeline components

Component Size

Component	Size
CLIP-L text encoder	0.25 GB
OpenCLIP-G text encoder	1.39 GB
T5-XXL text encoder offloaded	2.9 GB
VAE	0.17 GB

Encoders marked “offloaded” move to CPU before denoising, so they do not count toward peak VRAM.

Run it

Stable Diffusion 3.5 Large runs in ComfyUI, Draw Things or diffusers. Load the Q4 GGUF backbone with its text encoder and VAE; there is no single chat command like a text LLM.

ComfyUIDraw Thingsdiffusers

Which devices can run Stable Diffusion 3.5 Large?

Apple Silicon Macs

RAM-only laptops

No mainstream local runtime for a 8.1B image model on RAM-only laptops yet.

iPhone & iPad

Android

No mainstream local runtime for a 8.1B image model on Android yet.

NVIDIA GPUs

AMD GPUs

AMD Radeon RX 7900 XTX (24GB) Yes

FAQ

How much VRAM does Stable Diffusion 3.5 Large need?

At Q4 GGUF the realistic peak is ~7 GB, versus ~19 GB with every component resident. With aggressive CPU offload it drops to ~5 GB, much slower.

Why is peak VRAM lower than the sum of the files?

The text encoder is run once to encode your prompt, then offloaded to CPU before the denoising steps, so it is not resident at the memory peak.

Can I use Stable Diffusion 3.5 Large commercially?

Conditionally. Free for commercial use under $1M annual revenue; an enterprise license is required above that.

8.1B MMDiT with three text encoders (CLIP-L, OpenCLIP-G, T5-XXL). The 9.8GB T5-XXL is offloaded to CPU after prompt-encoding, so peak VRAM tracks the backbone, not the sum. At Q4 GGUF (4.77GB backbone) with T5 offloaded, peak is ~7GB (synthesis from city96 component sizes plus the diffusers offload behavior, not a single measurement). Stability's full-fp16 baseline is 19GB (11GB with TensorRT fp8). Sources: Stability SD3.5 announcement, city96 SD3.5 GGUF repo, Stability TensorRT note, diffusers SD3 docs.

Sources

VRAM is a sourced peak-usage anchor at Q4 GGUF (composed from component sizes, not a single measurement), validated 2026-06-15. See methodology.