Methodology: how we estimate local model memory

Step 01

Weights memory

We prefer the measured on-disk size of each GGUF file from Ollama and HuggingFace repos (bartowski, unsloth). When a quant is not published, we estimate it from effective bits-per-weight:

Formulaweights

weights_GB = params_B × bpw / 8

Effective bits-per-weight (from the llama.cpp quantize benchmark table):

Q4_K_M

4.89 bpw
Q8_0

8.5 bpw
FP16

16 bpw

Step 02

KV cache

Context adds memory on top of weights. The full cost is 2 × n_layers × n_kv_heads × head_dim × context × bytes. Most modern models use Grouped Query Attention. Llama 3.1 8B has 8 KV heads instead of 32 query heads, so the KV cost is about 4x smaller than a naive calculation. We approximate this with a sqrt(params) scaling calibrated to real GQA models, rather than per-model head counts, so it is an estimate; exact cost needs layer details not stored in the dataset.

Step 03

Mixture-of-Experts

MoE models (like Qwen3 30B-A3B) activate only a few billion parameters per token, so they run fast. But all experts must be held in memory, so memory tracks total parameters, not active ones. We size MoE models by total params, a point many calculators get wrong.

Step 04

Usable memory

We compare the model's need to the memory you can actually give it:

Apple Silicon: unified memory, ~66% usable under 64 GB and ~75% at or above (Metal recommendedMaxWorkingSetSize).
Discrete GPU: VRAM minus ~1 GB for driver and display.
CPU-only: ~60% of RAM, leaving room for the OS and apps.

Step 05

The verdict

We pick the highest-quality quant that fits in usable memory. Yes means it fits with headroom, tight means it fits but barely, no means it does not fit even at Q4_K_M. These are estimates, always verify before a large download.

Dataset

Data sources

95 models and 39 devices, validated against 60+ primary sources: Ollama library pages, HuggingFace model cards and GGUF repositories, vendor documentation (Stability AI, Coqui, OpenAI Whisper), the llama.cpp documentation, and Apple / NVIDIA / AMD spec pages.

All primary sources

FAQ

Common questions

Are these numbers exact?

They are validated estimates. Quant sizes are read from real GGUF files on Ollama and HuggingFace, so weights are accurate. KV cache and overhead are estimated, so total memory can vary ±15% with context length and runtime.

Why does usable memory differ from total memory?

macOS reserves part of unified memory for the OS (the Metal working-set limit is ~66% under 64 GB, ~75% at or above). GPUs lose ~1 GB of VRAM to the driver and display. CPU-only machines need headroom for the OS and apps.

How often is the data updated?

The dataset is reviewed and updated periodically. Last validated 2026-06-15.