Video model · cogvideox

CV CogVideoX-5B requirements

DIT video model · 5B params · 720×480, 49f (~6s) · released Aug 2024. Realistic minimum to run: Nvidia GeForce RTX 4090 (24GB) at INT8 / fp8.

CogVideoX License Commercial use: conditional

Free for research; commercial use requires registration and is capped at 1 million visits per month.

Peak VRAM (INT8 / fp8)

~16 GB

All resident

~26 GB

Offload floor

~5 GB

Clip

49f / ~6s

Backbone size by precision

PrecisionOn disk

Precision	Size
fp16 / bf16	11 GB
fp8	5.5 GB
Q4 GGUF	3.5 GB

Backbone weights only. Peak VRAM is dominated by the activation memory for 49 frames at 720×480, not the file size.

Pipeline components

ComponentSize

Component	Size
T5-XXL text encoder offloaded	2.9 GB
VAE (3D)	0.5 GB

Video VAEs are larger than image VAEs because they decode a temporal stack of frames.

Run it

CogVideoX-5B runs in Diffusers or ComfyUI. Generating more frames or higher resolution raises peak VRAM sharply; the INT8 / fp8 figure is for the default 49-frame clip.

DiffusersComfyUI

Which devices can run CogVideoX-5B?

Apple Silicon Macs

No mainstream local runtime for a 5B video model on Apple Silicon Macs yet.

RAM-only laptops

No mainstream local runtime for a 5B video model on RAM-only laptops yet.

iPhone & iPad

No mainstream local runtime for a 5B video model on iPhone & iPad yet.

Android

No mainstream local runtime for a 5B video model on Android yet.

NVIDIA GPUs

AMD GPUs

AMD Radeon RX 7900 XTX (24GB) Yes

FAQ

How much VRAM does CogVideoX-5B need?

At INT8 / fp8 the realistic peak is ~16 GB, versus ~26 GB with every component resident. With aggressive CPU offload it drops to ~5 GB, much slower.

Why is peak VRAM lower than the sum of the files?

The text encoder is run once to encode your prompt, then offloaded to CPU before the frames are generated, so it is not resident at the memory peak.

Can I use CogVideoX-5B commercially?

Conditionally. Free for research; commercial use requires registration and is capped at 1 million visits per month.

THUDM's 5B video DiT (720x480, 6s at 8fps). The diffusers docs cite ~16GB VRAM for the INT8/fp8-quantized model; the SAT bf16 no-offload path is 26GB, and full sequential CPU offload + INT8 drops the floor to ~5GB (very slow). The T5 encoder is offloaded. Custom license: free for research, commercial needs registration with a 1M-visit cap. Sources: THUDM card, diffusers CogVideoX docs, the license.

Sources

VRAM is a sourced peak-usage anchor at INT8 / fp8 for the default clip length, validated 2026-06-15. See methodology.