Video model · cogvideox
CV CogVideoX-5B requirements
DIT video model · 5B params · 720×480, 49f (~6s) · released Aug 2024. Realistic minimum to run: Nvidia GeForce RTX 4090 (24GB) at INT8 / fp8.
Free for research; commercial use requires registration and is capped at 1 million visits per month.
Backbone size by precision
| Precision | Size |
|---|---|
| fp16 / bf16 | 11 GB |
| fp8 | 5.5 GB |
| Q4 GGUF | 3.5 GB |
Backbone weights only. Peak VRAM is dominated by the activation memory for 49 frames at 720×480, not the file size.
Pipeline components
| Component | Size |
|---|---|
| T5-XXL text encoder offloaded | 2.9 GB |
| VAE (3D) | 0.5 GB |
Video VAEs are larger than image VAEs because they decode a temporal stack of frames.
Run it
CogVideoX-5B runs in Diffusers or ComfyUI. Generating more frames or higher resolution raises peak VRAM sharply; the INT8 / fp8 figure is for the default 49-frame clip.
Which devices can run CogVideoX-5B?
Apple Silicon Macs
No mainstream local runtime for a 5B video model on Apple Silicon Macs yet.
RAM-only laptops
No mainstream local runtime for a 5B video model on RAM-only laptops yet.
iPhone & iPad
No mainstream local runtime for a 5B video model on iPhone & iPad yet.
Android
No mainstream local runtime for a 5B video model on Android yet.
NVIDIA GPUs
AMD GPUs
FAQ
How much VRAM does CogVideoX-5B need?
At INT8 / fp8 the realistic peak is ~16 GB, versus ~26 GB with every component resident. With aggressive CPU offload it drops to ~5 GB, much slower.
Why is peak VRAM lower than the sum of the files?
The text encoder is run once to encode your prompt, then offloaded to CPU before the frames are generated, so it is not resident at the memory peak.
Can I use CogVideoX-5B commercially?
Conditionally. Free for research; commercial use requires registration and is capped at 1 million visits per month.
THUDM's 5B video DiT (720x480, 6s at 8fps). The diffusers docs cite ~16GB VRAM for the INT8/fp8-quantized model; the SAT bf16 no-offload path is 26GB, and full sequential CPU offload + INT8 drops the floor to ~5GB (very slow). The T5 encoder is offloaded. Custom license: free for research, commercial needs registration with a 1M-visit cap. Sources: THUDM card, diffusers CogVideoX docs, the license.
Sources
VRAM is a sourced peak-usage anchor at INT8 / fp8 for the default clip length, validated 2026-06-15. See methodology.