Video model · stable-video-diffusion

ST Stable Video Diffusion (img2vid-XT) requirements

UNET video model · 1.5B params · 1024×576, 25f (~4s) · released Nov 2023. Realistic minimum to run: Nvidia GeForce RTX 3060 (12GB) at fp16 + offload.

Stable Video Diffusion Community License Commercial use: conditional

Non-commercial research by default; commercial use requires a Stability AI license.

Peak VRAM (fp16 + offload)

~8 GB

All resident

~22 GB

Offload floor

~8 GB

Clip

25f / ~4s

Backbone size by precision

PrecisionOn disk

Precision	Size
fp16 / bf16 (recommended)	9.56 GB

Backbone weights only. Peak VRAM is dominated by the activation memory for 25 frames at 1024×576, not the file size.

Pipeline components

ComponentSize

Component	Size
CLIP image encoder	0.5 GB
VAE (3D)	0.5 GB

Video VAEs are larger than image VAEs because they decode a temporal stack of frames.

Run it

Stable Video Diffusion (img2vid-XT) runs in ComfyUI or Diffusers. Generating more frames or higher resolution raises peak VRAM sharply; the fp16 + offload figure is for the default 25-frame clip.

ComfyUIDiffusers

Which devices can run Stable Video Diffusion (img2vid-XT)?

Apple Silicon Macs

RAM-only laptops

No mainstream local runtime for a 1.5B video model on RAM-only laptops yet.

iPhone & iPad

No mainstream local runtime for a 1.5B video model on iPhone & iPad yet.

Android

No mainstream local runtime for a 1.5B video model on Android yet.

NVIDIA GPUs

AMD GPUs

AMD Radeon RX 7900 XTX (24GB) Yes

FAQ

How much VRAM does Stable Video Diffusion (img2vid-XT) need?

At fp16 + offload the realistic peak is ~8 GB, versus ~22 GB with every component resident. With aggressive CPU offload it drops to ~8 GB, much slower.

Why is peak VRAM lower than the sum of the files?

The pipeline moves each stage off the GPU between passes (sequential CPU offload), so peak VRAM stays near the active stage rather than the sum of every file.

Can I use Stable Video Diffusion (img2vid-XT) commercially?

Conditionally. Non-commercial research by default; commercial use requires a Stability AI license.

Stability's image-to-video model (~1.5B UNet): you feed it a still image, not a text prompt. The distributed svd_xt.safetensors is a single 9.56GB checkpoint that bundles the UNet, the temporal VAE and the CLIP image encoder, so its file size is larger than the UNet parameter count alone implies. With model CPU offload, feed-forward chunking and a small decode chunk size, the diffusers docs put it under 8GB for 25 frames at 1024x576; a straightforward run is ~14-22GB. There is no text encoder (it conditions on CLIP image embeddings). Community License: non-commercial by default, commercial needs Stability approval. Sources: Stability card, diffusers SVD guide.

Sources

VRAM is a sourced peak-usage anchor at fp16 + offload for the default clip length, validated 2026-06-15. See methodology.