Video model · cogvideox
CV CogVideoX-2B requirements
DIT video model · 2B params · 720×480, 49f (~6s) · released Aug 2024. Realistic minimum to run: Nvidia GeForce RTX 3060 (12GB) at fp16 + offload.
Backbone size by precision
| Precision | Size |
|---|---|
| fp16 / bf16 (recommended) | 5 GB |
| fp8 | 2.5 GB |
Backbone weights only. Peak VRAM is dominated by the activation memory for 49 frames at 720×480, not the file size.
Pipeline components
| Component | Size |
|---|---|
| T5-XXL text encoder offloaded | 2.9 GB |
| VAE (3D) | 0.5 GB |
Video VAEs are larger than image VAEs because they decode a temporal stack of frames.
Run it
CogVideoX-2B runs in Diffusers or ComfyUI. Generating more frames or higher resolution raises peak VRAM sharply; the fp16 + offload figure is for the default 49-frame clip.
Which devices can run CogVideoX-2B?
Apple Silicon Macs
No mainstream local runtime for a 2B video model on Apple Silicon Macs yet.
RAM-only laptops
No mainstream local runtime for a 2B video model on RAM-only laptops yet.
iPhone & iPad
No mainstream local runtime for a 2B video model on iPhone & iPad yet.
Android
No mainstream local runtime for a 2B video model on Android yet.
NVIDIA GPUs
AMD GPUs
FAQ
How much VRAM does CogVideoX-2B need?
At fp16 + offload the realistic peak is ~8 GB, versus ~18 GB with every component resident. With aggressive CPU offload it drops to ~4 GB, much slower.
Why is peak VRAM lower than the sum of the files?
The text encoder is run once to encode your prompt, then offloaded to CPU before the frames are generated, so it is not resident at the memory peak.
Can I use CogVideoX-2B commercially?
Yes. CogVideoX-2B is licensed Apache-2.0, which permits commercial use.
The Apache-2.0 CogVideoX tier (2B). The official card lists SAT fp16 no-offload at 18GB, model-offload paths around ~8GB, and a ~4GB floor with full sequential offload + INT8 + VAE tiling. The T5 encoder is offloaded. Apache-2.0, commercial OK (unlike the 5B's custom license). Anchor is the model-offload path (synthesis). Sources: THUDM card, diffusers CogVideoX docs.
Sources
VRAM is a sourced peak-usage anchor at fp16 + offload (composed from component sizes, not a single measurement) for the default clip length, validated 2026-06-15. See methodology.