Video model · stable-video-diffusion
ST Stable Video Diffusion (img2vid-XT) requirements
UNET video model · 1.5B params · 1024×576, 25f (~4s) · released Nov 2023. Realistic minimum to run: Nvidia GeForce RTX 3060 (12GB) at fp16 + offload.
Non-commercial research by default; commercial use requires a Stability AI license.
Backbone size by precision
| Precision | Size |
|---|---|
| fp16 / bf16 (recommended) | 9.56 GB |
Backbone weights only. Peak VRAM is dominated by the activation memory for 25 frames at 1024×576, not the file size.
Pipeline components
| Component | Size |
|---|---|
| CLIP image encoder | 0.5 GB |
| VAE (3D) | 0.5 GB |
Video VAEs are larger than image VAEs because they decode a temporal stack of frames.
Run it
Stable Video Diffusion (img2vid-XT) runs in ComfyUI or Diffusers. Generating more frames or higher resolution raises peak VRAM sharply; the fp16 + offload figure is for the default 25-frame clip.
Which devices can run Stable Video Diffusion (img2vid-XT)?
Apple Silicon Macs
- Apple M1 (8GB) No
- Apple M2 (16GB) Yes
- Apple M4 (16GB) Yes
- Apple M5 (16GB) Yes
- Apple M3 Pro (18GB) Yes
- Apple M4 (24GB) Yes
- Apple M4 Pro (24GB) Yes
- Apple M5 (32GB) Yes
- Apple M4 Pro (48GB) Yes
- Apple M5 Pro (48GB) Yes
- Apple M4 Max (64GB) Yes
- Apple M4 Max (128GB) Yes
- Apple M5 Max (128GB) Yes
- Apple M3 Ultra (256GB) Yes
RAM-only laptops
No mainstream local runtime for a 1.5B video model on RAM-only laptops yet.
iPhone & iPad
No mainstream local runtime for a 1.5B video model on iPhone & iPad yet.
Android
No mainstream local runtime for a 1.5B video model on Android yet.
NVIDIA GPUs
AMD GPUs
FAQ
How much VRAM does Stable Video Diffusion (img2vid-XT) need?
At fp16 + offload the realistic peak is ~8 GB, versus ~22 GB with every component resident. With aggressive CPU offload it drops to ~8 GB, much slower.
Why is peak VRAM lower than the sum of the files?
The pipeline moves each stage off the GPU between passes (sequential CPU offload), so peak VRAM stays near the active stage rather than the sum of every file.
Can I use Stable Video Diffusion (img2vid-XT) commercially?
Conditionally. Non-commercial research by default; commercial use requires a Stability AI license.
Stability's image-to-video model (~1.5B UNet): you feed it a still image, not a text prompt. The distributed svd_xt.safetensors is a single 9.56GB checkpoint that bundles the UNet, the temporal VAE and the CLIP image encoder, so its file size is larger than the UNet parameter count alone implies. With model CPU offload, feed-forward chunking and a small decode chunk size, the diffusers docs put it under 8GB for 25 frames at 1024x576; a straightforward run is ~14-22GB. There is no text encoder (it conditions on CLIP image embeddings). Community License: non-commercial by default, commercial needs Stability approval. Sources: Stability card, diffusers SVD guide.
Sources
VRAM is a sourced peak-usage anchor at fp16 + offload for the default clip length, validated 2026-06-15. See methodology.