Skip to content
localmodel.run

Video model · cogvideox

CV CogVideoX-2B requirements

DIT video model · 2B params · 720×480, 49f (~6s) · released Aug 2024. Realistic minimum to run: Nvidia GeForce RTX 3060 (12GB) at fp16 + offload.

Apache-2.0 Commercial use OK
Peak VRAM (fp16 + offload)
~8 GB
All resident
~18 GB
Offload floor
~4 GB
Clip
49f / ~6s

Backbone size by precision

PrecisionOn disk
PrecisionSize
fp16 / bf16 (recommended) 5 GB
fp8 2.5 GB

Backbone weights only. Peak VRAM is dominated by the activation memory for 49 frames at 720×480, not the file size.

Pipeline components

ComponentSize
ComponentSize
T5-XXL text encoder offloaded 2.9 GB
VAE (3D) 0.5 GB

Video VAEs are larger than image VAEs because they decode a temporal stack of frames.

Run it

CogVideoX-2B runs in Diffusers or ComfyUI. Generating more frames or higher resolution raises peak VRAM sharply; the fp16 + offload figure is for the default 49-frame clip.

DiffusersComfyUI

Which devices can run CogVideoX-2B?

Apple Silicon Macs

No mainstream local runtime for a 2B video model on Apple Silicon Macs yet.

RAM-only laptops

No mainstream local runtime for a 2B video model on RAM-only laptops yet.

iPhone & iPad

No mainstream local runtime for a 2B video model on iPhone & iPad yet.

Android

No mainstream local runtime for a 2B video model on Android yet.

NVIDIA GPUs

AMD GPUs

FAQ

How much VRAM does CogVideoX-2B need?

At fp16 + offload the realistic peak is ~8 GB, versus ~18 GB with every component resident. With aggressive CPU offload it drops to ~4 GB, much slower.

Why is peak VRAM lower than the sum of the files?

The text encoder is run once to encode your prompt, then offloaded to CPU before the frames are generated, so it is not resident at the memory peak.

Can I use CogVideoX-2B commercially?

Yes. CogVideoX-2B is licensed Apache-2.0, which permits commercial use.

The Apache-2.0 CogVideoX tier (2B). The official card lists SAT fp16 no-offload at 18GB, model-offload paths around ~8GB, and a ~4GB floor with full sequential offload + INT8 + VAE tiling. The T5 encoder is offloaded. Apache-2.0, commercial OK (unlike the 5B's custom license). Anchor is the model-offload path (synthesis). Sources: THUDM card, diffusers CogVideoX docs.

Sources

VRAM is a sourced peak-usage anchor at fp16 + offload (composed from component sizes, not a single measurement) for the default clip length, validated 2026-06-15. See methodology.