# localmodel.run, full dataset Validated 2026-06-15. Memory = weights + KV cache + ~0.8GB overhead. Estimates only. ## Models (memory at Q4_K_M, 4k context) ### Llama 3.1 8B (https://localmodel.run/model/llama-3.1-8b) - Params: 8B - Q4_K_M on disk: 4.92 GB; Q8_0: 8.54 GB - Est. memory to run (Q4_K_M, 4k ctx): ~6.4 GB - Context: 128k; Ollama: llama3.1:8b; Released: 2024-07 - Sources: https://ollama.com/library/llama3.1, https://ollama.com/library/llama3.1/tags, https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, https://lmarena.ai/leaderboard ### Llama 3.3 70B (https://localmodel.run/model/llama-3.3-70b) - Params: 70B - Q4_K_M on disk: 42.52 GB; Q8_0: 74.98 GB - Est. memory to run (Q4_K_M, 4k ctx): ~45.3 GB - Context: 128k; Ollama: llama3.3:70b; Released: 2024-12 - Sources: https://ollama.com/library/llama3.3, https://ollama.com/library/llama3.3/tags, https://huggingface.co/bartowski/Llama-3.3-70B-Instruct-GGUF, https://lmarena.ai/leaderboard ### Llama 3.2 3B (https://localmodel.run/model/llama-3.2-3b) - Params: 3B - Q4_K_M on disk: 2.02 GB; Q8_0: 3.42 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.2 GB - Context: 128k; Ollama: llama3.2:3b; Released: 2024-09 - Sources: https://ollama.com/library/llama3.2, https://ollama.com/library/llama3.2/tags, https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF, https://lmarena.ai/leaderboard ### Llama 3.2 1B (https://localmodel.run/model/llama-3.2-1b) - Params: 1B - Q4_K_M on disk: 0.81 GB; Q8_0: 1.32 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.8 GB - Context: 128k; Ollama: llama3.2:1b; Released: 2024-09 - Sources: https://ollama.com/library/llama3.2, https://ollama.com/library/llama3.2/tags, https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF, https://lmarena.ai/leaderboard ### Mistral 7B (https://localmodel.run/model/mistral-7b) - Params: 7B - Q4_K_M on disk: 4.37 GB; Q8_0: 7.7 GB - Est. memory to run (Q4_K_M, 4k ctx): ~5.8 GB - Context: 32k; Ollama: mistral:7b; Released: 2024-05 - Sources: https://ollama.com/library/mistral, https://ollama.com/library/mistral/tags, https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF, https://lmarena.ai/leaderboard ### Mistral Small 3 24B (https://localmodel.run/model/mistral-small-3-24b) - Params: 24B - Q4_K_M on disk: 14.33 GB; Q8_0: 25.05 GB - Est. memory to run (Q4_K_M, 4k ctx): ~16.3 GB - Context: 128k; Ollama: mistral-small3.2:24b; Released: 2025-06 - Sources: https://ollama.com/library/mistral-small3.2, https://ollama.com/library/mistral-small3.2/tags, https://huggingface.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF, https://docs.mistral.ai/models/model-cards/mistral-small-3-2-25-06, https://lmarena.ai/leaderboard ### Phi-4 14B (https://localmodel.run/model/phi-4-14b) - Params: 14B - Q4_K_M on disk: 9.05 GB; Q8_0: 15.58 GB - Est. memory to run (Q4_K_M, 4k ctx): ~10.8 GB - Context: 16k; Ollama: phi4:14b; Released: 2024-12 - Sources: https://ollama.com/library/phi4, https://ollama.com/library/phi4/tags, https://huggingface.co/bartowski/phi-4-GGUF, https://lmarena.ai/leaderboard ### Phi-4-mini 3.8B (https://localmodel.run/model/phi-4-mini-3.8b) - Params: 3.8B - Q4_K_M on disk: 2.49 GB; Q8_0: 4.08 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.8 GB - Context: 128k; Ollama: phi4-mini:3.8b; Released: 2025-02 - Sources: https://ollama.com/library/phi4-mini, https://ollama.com/library/phi4-mini/tags, https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF ### Gemma 2 9B (https://localmodel.run/model/gemma-2-9b) - Params: 9B - Q4_K_M on disk: 5.76 GB; Q8_0: 9.83 GB - Est. memory to run (Q4_K_M, 4k ctx): ~7.3 GB - Context: 8k; Ollama: gemma2:9b; Released: 2024-06 - Sources: https://ollama.com/library/gemma2, https://ollama.com/library/gemma2/tags, https://huggingface.co/bartowski/gemma-2-9b-it-GGUF, https://lmarena.ai/leaderboard ### Gemma 2 27B (https://localmodel.run/model/gemma-2-27b) - Params: 27B - Q4_K_M on disk: 16.65 GB; Q8_0: 28.94 GB - Est. memory to run (Q4_K_M, 4k ctx): ~18.7 GB - Context: 8k; Ollama: gemma2:27b; Released: 2024-06 - Sources: https://ollama.com/library/gemma2, https://ollama.com/library/gemma2/tags, https://huggingface.co/bartowski/gemma-2-27b-it-GGUF, https://lmarena.ai/leaderboard ### Gemma 3 4B (https://localmodel.run/model/gemma-3-4b) - Params: 4B - Q4_K_M on disk: 2.49 GB; Q8_0: 4.13 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.8 GB - Context: 128k; Ollama: gemma3:4b; Released: 2025-03 - Sources: https://ollama.com/library/gemma3, https://ollama.com/library/gemma3/tags, https://huggingface.co/bartowski/google_gemma-3-4b-it-GGUF, https://lmarena.ai/leaderboard ### Gemma 3 12B (https://localmodel.run/model/gemma-3-12b) - Params: 12B - Q4_K_M on disk: 7.3 GB; Q8_0: 12.51 GB - Est. memory to run (Q4_K_M, 4k ctx): ~8.9 GB - Context: 128k; Ollama: gemma3:12b; Released: 2025-03 - Sources: https://ollama.com/library/gemma3, https://ollama.com/library/gemma3/tags, https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF, https://lmarena.ai/leaderboard ### Gemma 3 27B (https://localmodel.run/model/gemma-3-27b) - Params: 27B - Q4_K_M on disk: 16.55 GB; Q8_0: 28.71 GB - Est. memory to run (Q4_K_M, 4k ctx): ~18.6 GB - Context: 128k; Ollama: gemma3:27b; Released: 2025-03 - Sources: https://ollama.com/library/gemma3, https://ollama.com/library/gemma3/tags, https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF, https://lmarena.ai/leaderboard ### Qwen2.5 7B (https://localmodel.run/model/qwen2.5-7b) - Params: 7B - Q4_K_M on disk: 4.68 GB; Q8_0: 8.1 GB - Est. memory to run (Q4_K_M, 4k ctx): ~6.1 GB - Context: 128k; Ollama: qwen2.5:7b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5, https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen2.5 14B (https://localmodel.run/model/qwen2.5-14b) - Params: 14B - Q4_K_M on disk: 8.99 GB; Q8_0: 15.7 GB - Est. memory to run (Q4_K_M, 4k ctx): ~10.7 GB - Context: 128k; Ollama: qwen2.5:14b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5, https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen2.5 32B (https://localmodel.run/model/qwen2.5-32b) - Params: 32B - Q4_K_M on disk: 19.85 GB; Q8_0: 34.82 GB - Est. memory to run (Q4_K_M, 4k ctx): ~22.1 GB - Context: 128k; Ollama: qwen2.5:32b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5, https://huggingface.co/bartowski/Qwen2.5-32B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen2.5 72B (https://localmodel.run/model/qwen2.5-72b) - Params: 72B - Q4_K_M on disk: 47.42 GB; Q8_0: 77.26 GB - Est. memory to run (Q4_K_M, 4k ctx): ~50.2 GB - Context: 128k; Ollama: qwen2.5:72b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5, https://huggingface.co/bartowski/Qwen2.5-72B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/, https://lmarena.ai/leaderboard ### Qwen3 8B (https://localmodel.run/model/qwen3-8b) - Params: 8B - Q4_K_M on disk: 5.03 GB; Q8_0: 8.71 GB - Est. memory to run (Q4_K_M, 4k ctx): ~6.5 GB - Context: 32k; Ollama: qwen3:8b; Released: 2025-04 - Sources: https://ollama.com/library/qwen3/tags, https://huggingface.co/Qwen/Qwen3-8B-GGUF, https://qwenlm.github.io/blog/qwen3/ ### Qwen3 14B (https://localmodel.run/model/qwen3-14b) - Params: 14B - Q4_K_M on disk: 9 GB; Q8_0: 15.7 GB - Est. memory to run (Q4_K_M, 4k ctx): ~10.7 GB - Context: 32k; Ollama: qwen3:14b; Released: 2025-04 - Sources: https://ollama.com/library/qwen3/tags, https://huggingface.co/Qwen/Qwen3-14B-GGUF, https://qwenlm.github.io/blog/qwen3/ ### Qwen3 32B (https://localmodel.run/model/qwen3-32b) - Params: 32B - Q4_K_M on disk: 19.8 GB; Q8_0: 34.8 GB - Est. memory to run (Q4_K_M, 4k ctx): ~22 GB - Context: 32k; Ollama: qwen3:32b; Released: 2025-04 - Sources: https://ollama.com/library/qwen3/tags, https://huggingface.co/Qwen/Qwen3-32B-GGUF, https://qwenlm.github.io/blog/qwen3/, https://lmarena.ai/leaderboard ### Qwen3 30B-A3B (https://localmodel.run/model/qwen3-30b-a3b) - Params: 30.5B (MoE, 3.3B active per token) - Q4_K_M on disk: 18.6 GB; Q8_0: 32.5 GB - Est. memory to run (Q4_K_M, 4k ctx): ~20.7 GB - Context: 32k; Ollama: qwen3:30b-a3b; Released: 2025-04 - Sources: https://ollama.com/library/qwen3/tags, https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF, https://huggingface.co/Qwen/Qwen3-30B-A3B-GGUF, https://qwenlm.github.io/blog/qwen3/, https://lmarena.ai/leaderboard ### DeepSeek-R1-Distill-Qwen 7B (https://localmodel.run/model/deepseek-r1-distill-qwen-7b) - Params: 7B - Q4_K_M on disk: 4.68 GB; Q8_0: 8.1 GB - Est. memory to run (Q4_K_M, 4k ctx): ~6.1 GB - Context: 128k; Ollama: deepseek-r1:7b; Released: 2025-01 - Sources: https://ollama.com/library/deepseek-r1, https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF, https://github.com/deepseek-ai/DeepSeek-R1 ### DeepSeek-R1-Distill-Qwen 14B (https://localmodel.run/model/deepseek-r1-distill-qwen-14b) - Params: 14B - Q4_K_M on disk: 8.99 GB; Q8_0: 15.7 GB - Est. memory to run (Q4_K_M, 4k ctx): ~10.7 GB - Context: 128k; Ollama: deepseek-r1:14b; Released: 2025-01 - Sources: https://ollama.com/library/deepseek-r1, https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF, https://github.com/deepseek-ai/DeepSeek-R1 ### DeepSeek-R1-Distill-Llama 8B (https://localmodel.run/model/deepseek-r1-distill-llama-8b) - Params: 8B - Q4_K_M on disk: 4.92 GB; Q8_0: 8.54 GB - Est. memory to run (Q4_K_M, 4k ctx): ~6.4 GB - Context: 128k; Ollama: deepseek-r1:8b; Released: 2025-01 - Sources: https://ollama.com/library/deepseek-r1, https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF, https://github.com/deepseek-ai/DeepSeek-R1 ### DeepSeek-R1-Distill-Qwen 32B (https://localmodel.run/model/deepseek-r1-distill-qwen-32b) - Params: 32B - Q4_K_M on disk: 19.85 GB; Q8_0: 34.82 GB - Est. memory to run (Q4_K_M, 4k ctx): ~22.1 GB - Context: 128k; Ollama: deepseek-r1:32b; Released: 2025-01 - Sources: https://ollama.com/library/deepseek-r1, https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF, https://github.com/deepseek-ai/DeepSeek-R1 ### DeepSeek-V2-Lite (https://localmodel.run/model/deepseek-v2-lite) - Params: 16B (MoE, 2.4B active per token) - Q4_K_M on disk: 10.4 GB; Q8_0: 16.8 GB - Est. memory to run (Q4_K_M, 4k ctx): ~12.2 GB - Context: 32k; Ollama: deepseek-v2:16b; Released: 2024-05 - Sources: https://ollama.com/library/deepseek-v2, https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite, https://huggingface.co/mradermacher/DeepSeek-V2-Lite-GGUF, https://github.com/deepseek-ai/DeepSeek-V2 ### SmolLM2 1.7B (https://localmodel.run/model/smollm2-1.7b) - Params: 1.7B - Q4_K_M on disk: 1.06 GB; Q8_0: 1.82 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.2 GB - Context: 8k; Ollama: smollm2:1.7b; Released: 2024-11 - Sources: https://ollama.com/library/smollm2, https://huggingface.co/bartowski/SmolLM2-1.7B-Instruct-GGUF, https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF, https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B, https://lmarena.ai/leaderboard ### Qwen2.5 0.5B (https://localmodel.run/model/qwen2.5-0.5b) - Params: 0.494B - Q4_K_M on disk: 0.491 GB; Q8_0: 0.676 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.5 GB - Context: 128k; Ollama: qwen2.5:0.5b; Released: 2024-09-19 - Sources: https://ollama.com/library/qwen2.5:0.5b, https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen2.5 1.5B (https://localmodel.run/model/qwen2.5-1.5b) - Params: 1.54B - Q4_K_M on disk: 1.12 GB; Q8_0: 1.89 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.2 GB - Context: 128k; Ollama: qwen2.5:1.5b; Released: 2024-09-19 - Sources: https://ollama.com/library/qwen2.5:1.5b, https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen2.5 3B (https://localmodel.run/model/qwen2.5-3b) - Params: 3.09B - Q4_K_M on disk: 2.1 GB; Q8_0: 3.62 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.3 GB - Context: 128k; Ollama: qwen2.5:3b; Released: 2024-09-19 - Sources: https://ollama.com/library/qwen2.5:3b, https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF, https://qwenlm.github.io/blog/qwen2.5/ ### Qwen3 0.6B (https://localmodel.run/model/qwen3-0.6b) - Params: 0.6B - Q4_K_M on disk: 0.48 GB; Q8_0: 0.8 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.5 GB - Context: 32k; Ollama: qwen3:0.6b; Released: 2025-04-29 - Sources: https://ollama.com/library/qwen3:0.6b, https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF, https://github.com/QwenLM/Qwen3, https://apxml.com/models/qwen3-1-7b ### Qwen3 1.7B (https://localmodel.run/model/qwen3-1.7b) - Params: 1.7B - Q4_K_M on disk: 1.28 GB; Q8_0: 2.17 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.4 GB - Context: 32k; Ollama: qwen3:1.7b; Released: 2025-04-29 - Sources: https://ollama.com/library/qwen3:1.7b, https://huggingface.co/bartowski/Qwen_Qwen3-1.7B-GGUF, https://github.com/QwenLM/Qwen3, https://apxml.com/models/qwen3-1-7b ### Qwen3 4B (https://localmodel.run/model/qwen3-4b) - Params: 4B - Q4_K_M on disk: 2.5 GB; Q8_0: 4.28 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.8 GB - Context: 32k; Ollama: qwen3:4b; Released: 2025-04-29 - Sources: https://ollama.com/library/qwen3:4b, https://huggingface.co/Qwen/Qwen3-4B-GGUF, https://github.com/QwenLM/Qwen3 ### Gemma 2 2B (https://localmodel.run/model/gemma-2-2b) - Params: 2.61B - Q4_K_M on disk: 1.71 GB; Q8_0: 2.78 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.9 GB - Context: 8k; Ollama: gemma2:2b; Released: 2024-07-31 - Sources: https://ollama.com/library/gemma2:2b, https://huggingface.co/bartowski/gemma-2-2b-it-GGUF, https://huggingface.co/blog/gemma-july-update, https://huggingface.co/blog/gemma2 ### Gemma 3 1B (https://localmodel.run/model/gemma-3-1b) - Params: 1B - Q4_K_M on disk: 0.81 GB; Q8_0: 1.07 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.8 GB - Context: 32k; Ollama: gemma3:1b; Released: 2025-03-12 - Sources: https://ollama.com/library/gemma3:1b, https://huggingface.co/bartowski/google_gemma-3-1b-it-GGUF, https://developers.googleblog.com/en/introducing-gemma3/, https://llm-stats.com/models/gemma-3-1b-it ### SmolLM2 135M (https://localmodel.run/model/smollm2-135m) - Params: 0.135B - Q4_K_M on disk: 0.105 GB; Q8_0: 0.145 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1 GB - Context: 2k; Ollama: smollm2:135m; Released: 2024-11-02 - Sources: https://ollama.com/library/smollm2:135m, https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF, https://venturebeat.com/ai/ai-on-your-smartphone-hugging-faces-smollm2-brings-powerful-models-to-the-palm-of-your-hand ### SmolLM2 360M (https://localmodel.run/model/smollm2-360m) - Params: 0.362B - Q4_K_M on disk: 0.271 GB; Q8_0: 0.386 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.2 GB - Context: 2k; Ollama: smollm2:360m; Released: 2024-11-02 - Sources: https://ollama.com/library/smollm2:360m, https://huggingface.co/bartowski/SmolLM2-360M-Instruct-GGUF, https://venturebeat.com/ai/ai-on-your-smartphone-hugging-faces-smollm2-brings-powerful-models-to-the-palm-of-your-hand ### TinyLlama 1.1B (https://localmodel.run/model/tinyllama-1.1b) - Params: 1.1B - Q4_K_M on disk: 0.669 GB; Q8_0: 1.17 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.8 GB - Context: 2k; Ollama: tinyllama:1.1b; Released: 2024-01-04 - Sources: https://ollama.com/library/tinyllama:1.1b, https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF, https://github.com/jzhang38/TinyLlama ### Granite 3.1 2B (https://localmodel.run/model/granite-3.1-2b) - Params: 2.53B - Q4_K_M on disk: 1.55 GB; Q8_0: 2.69 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.8 GB - Context: 128k; Ollama: granite3.1-dense:2b; Released: 2024-12-18 - Sources: https://ollama.com/library/granite3.1-dense:2b, https://huggingface.co/bartowski/granite-3.1-2b-instruct-GGUF, https://community.ibm.com/community/user/blogs/nickolus-plowden/2025/01/12/granite-31-delivers-powerful-performance-longer-co, https://huggingface.co/ibm-granite/granite-3.1-2b-instruct ### Phi-3.5-mini 3.8B (https://localmodel.run/model/phi-3.5-mini) - Params: 3.82B - Q4_K_M on disk: 2.39 GB; Q8_0: 4.06 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3.7 GB - Context: 128k; Ollama: phi3.5:3.8b; Released: 2024-08-23 - Sources: https://ollama.com/library/phi3.5:3.8b, https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF, https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/, https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280 ### Sarvam-M 24B (https://localmodel.run/model/sarvam-m-24b) - Params: 24B - Q4_K_M on disk: 14.3 GB; Q8_0: 25.1 GB - Est. memory to run (Q4_K_M, 4k ctx): ~16.3 GB - Context: 32k; Ollama: n/a; Released: 2025-05 - Sources: https://huggingface.co/sarvamai/sarvam-m, https://huggingface.co/lmstudio-community/sarvam-m-GGUF, https://huggingface.co/sarvamai/sarvam-m-q8-gguf ### Sarvam-1 2B (https://localmodel.run/model/sarvam-1-2b) - Params: 2B - Q4_K_M on disk: 1.55 GB; Q8_0: 2.69 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2.7 GB - Context: 8k; Ollama: n/a; Released: 2024-10 - Sources: https://huggingface.co/sarvamai/sarvam-1, https://huggingface.co/bartowski/sarvam-1-GGUF ### Sarvam-30B (https://localmodel.run/model/sarvam-30b) - Params: 30B (MoE, 2.4B active per token) - Q4_K_M on disk: 19.6 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~21.7 GB - Context: 64k; Ollama: n/a; Released: 2026-03 - Sources: https://huggingface.co/sarvamai/sarvam-30b, https://huggingface.co/sarvamai/sarvam-30b-gguf, https://www.sarvam.ai/blogs/sarvam-30b-105b ### Sarvam-105B (https://localmodel.run/model/sarvam-105b) - Params: 105B (MoE, 10.3B active per token) - Q4_K_M on disk: 64.2 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~67.5 GB - Context: 128k; Ollama: n/a; Released: 2026-03 - Sources: https://huggingface.co/sarvamai/sarvam-105b, https://huggingface.co/sarvamai/sarvam-105b-gguf, https://www.sarvam.ai/blogs/sarvam-30b-105b ### Qwen2.5 Coder 0.5B (https://localmodel.run/model/qwen2.5-coder-0.5b) - Params: 0.494B - Q4_K_M on disk: 0.37 GB; Q8_0: 0.49 GB - Est. memory to run (Q4_K_M, 4k ctx): ~1.4 GB - Context: 32k; Ollama: qwen2.5-coder:0.5b; Released: 2024-11 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-0.5B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct ### Qwen2.5 Coder 1.5B (https://localmodel.run/model/qwen2.5-coder-1.5b) - Params: 1.54B - Q4_K_M on disk: 0.92 GB; Q8_0: 1.53 GB - Est. memory to run (Q4_K_M, 4k ctx): ~2 GB - Context: 32k; Ollama: qwen2.5-coder:1.5b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct ### Qwen2.5 Coder 3B (https://localmodel.run/model/qwen2.5-coder-3b) - Params: 3.09B - Q4_K_M on disk: 1.8 GB; Q8_0: 3.06 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3 GB - Context: 32k; Ollama: qwen2.5-coder:3b; Released: 2024-11 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct ### Qwen2.5 Coder 7B (https://localmodel.run/model/qwen2.5-coder-7b) - Params: 7B - Q4_K_M on disk: 4.36 GB; Q8_0: 7.54 GB - Est. memory to run (Q4_K_M, 4k ctx): ~5.8 GB - Context: 32k; Ollama: qwen2.5-coder:7b; Released: 2024-09 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct ### Qwen2.5 Coder 14B (https://localmodel.run/model/qwen2.5-coder-14b) - Params: 14B - Q4_K_M on disk: 8.37 GB; Q8_0: 14.62 GB - Est. memory to run (Q4_K_M, 4k ctx): ~10.1 GB - Context: 32k; Ollama: qwen2.5-coder:14b; Released: 2024-11 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-14B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct ### Qwen2.5 Coder 32B (https://localmodel.run/model/qwen2.5-coder-32b) - Params: 32B - Q4_K_M on disk: 18.49 GB; Q8_0: 32.43 GB - Est. memory to run (Q4_K_M, 4k ctx): ~20.7 GB - Context: 32k; Ollama: qwen2.5-coder:32b; Released: 2024-11 - Sources: https://ollama.com/library/qwen2.5-coder, https://ollama.com/library/qwen2.5-coder/tags, https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct ### Mistral Nemo 12B (https://localmodel.run/model/mistral-nemo-12b) - Params: 12.2B - Q4_K_M on disk: 6.96 GB; Q8_0: 12.13 GB - Est. memory to run (Q4_K_M, 4k ctx): ~8.6 GB - Context: 128k; Ollama: mistral-nemo:12b; Released: 2024-07 - Sources: https://ollama.com/library/mistral-nemo, https://ollama.com/library/mistral-nemo/tags, https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 ### Mixtral 8x7B (https://localmodel.run/model/mixtral-8x7b) - Params: 46.7B (MoE, 12.9B active per token) - Q4_K_M on disk: 26.49 GB; Q8_0: 46.22 GB - Est. memory to run (Q4_K_M, 4k ctx): ~28.9 GB - Context: 32k; Ollama: mixtral:8x7b; Released: 2023-12 - Sources: https://ollama.com/library/mixtral, https://ollama.com/library/mixtral/tags, https://huggingface.co/MaziyarPanahi/Mixtral-8x7B-Instruct-v0.1-GGUF, https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 ### DeepSeek R1 (https://localmodel.run/model/deepseek-r1) - Params: 671B (MoE, 37B active per token) - Q4_K_M on disk: 376.66 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~383.7 GB - Context: 128k; Ollama: deepseek-r1:671b; Released: 2025-01 - Sources: https://ollama.com/library/deepseek-r1, https://ollama.com/library/deepseek-r1/tags, https://huggingface.co/unsloth/DeepSeek-R1-GGUF, https://huggingface.co/deepseek-ai/DeepSeek-R1 ### DeepSeek V3 (https://localmodel.run/model/deepseek-v3) - Params: 671B (MoE, 37B active per token) - Q4_K_M on disk: 376.66 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~383.7 GB - Context: 128k; Ollama: deepseek-v3:671b; Released: 2025-03 - Sources: https://ollama.com/library/deepseek-v3, https://ollama.com/library/deepseek-v3/tags, https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF, https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 ### Qwen3 235B A22B (https://localmodel.run/model/qwen3-235b-a22b) - Params: 235B (MoE, 22B active per token) - Q4_K_M on disk: 132.39 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~136.9 GB - Context: 128k; Ollama: qwen3:235b; Released: 2025-04 - Sources: https://ollama.com/library/qwen3, https://ollama.com/library/qwen3/tags, https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF, https://huggingface.co/Qwen/Qwen3-235B-A22B ### Llama 4 Scout (https://localmodel.run/model/llama-4-scout) - Params: 109B (MoE, 17B active per token) - Q4_K_M on disk: 60.87 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~64.2 GB - Context: 128k; Ollama: llama4:scout; Released: 2025-04 - Sources: https://ollama.com/library/llama4, https://ollama.com/library/llama4/tags, https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF, https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct ### gpt-oss 20B (https://localmodel.run/model/gpt-oss-20b) - Params: 21B (MoE, 3.6B active per token) - Q4_K_M on disk: 11.28 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~13.2 GB - Context: 128k; Ollama: gpt-oss:20b; Released: 2025-08 - Sources: https://ollama.com/library/gpt-oss, https://ollama.com/library/gpt-oss/tags, https://huggingface.co/ggml-org/gpt-oss-20b-GGUF, https://huggingface.co/openai/gpt-oss-20b ### gpt-oss 120B (https://localmodel.run/model/gpt-oss-120b) - Params: 117B (MoE, 5.1B active per token) - Q4_K_M on disk: 59.03 GB; Q8_0: n/a GB - Est. memory to run (Q4_K_M, 4k ctx): ~62.4 GB - Context: 128k; Ollama: gpt-oss:120b; Released: 2025-08 - Sources: https://ollama.com/library/gpt-oss, https://ollama.com/library/gpt-oss/tags, https://huggingface.co/ggml-org/gpt-oss-120b-GGUF, https://huggingface.co/openai/gpt-oss-120b ### Yi 1.5 34B (https://localmodel.run/model/yi-1.5-34b) - Params: 34B - Q4_K_M on disk: 19.24 GB; Q8_0: 34.03 GB - Est. memory to run (Q4_K_M, 4k ctx): ~21.4 GB - Context: 32k; Ollama: n/a; Released: 2024-05 - Sources: https://huggingface.co/bartowski/Yi-1.5-34B-Chat-GGUF, https://huggingface.co/01-ai/Yi-1.5-34B-Chat ### Command R 35B (https://localmodel.run/model/command-r-35b) - Params: 35B - Q4_K_M on disk: 20.05 GB; Q8_0: 34.63 GB - Est. memory to run (Q4_K_M, 4k ctx): ~22.3 GB - Context: 128k; Ollama: command-r:35b; Released: 2024-03 - Sources: https://ollama.com/library/command-r, https://ollama.com/library/command-r/tags, https://huggingface.co/bartowski/c4ai-command-r-v01-GGUF, https://huggingface.co/CohereForAI/c4ai-command-r-v01 ### GLM-4 9B (https://localmodel.run/model/glm-4-9b) - Params: 9B - Q4_K_M on disk: 5.82 GB; Q8_0: 9.31 GB - Est. memory to run (Q4_K_M, 4k ctx): ~7.3 GB - Context: 128k; Ollama: glm4:9b; Released: 2024-06 - Sources: https://ollama.com/library/glm4, https://ollama.com/library/glm4/tags, https://huggingface.co/bartowski/glm-4-9b-chat-GGUF, https://huggingface.co/THUDM/glm-4-9b-chat ### Falcon3 10B (https://localmodel.run/model/falcon3-10b) - Params: 10B - Q4_K_M on disk: 5.86 GB; Q8_0: 10.2 GB - Est. memory to run (Q4_K_M, 4k ctx): ~7.5 GB - Context: 32k; Ollama: falcon3:10b; Released: 2024-12 - Sources: https://ollama.com/library/falcon3, https://ollama.com/library/falcon3/tags, https://huggingface.co/bartowski/Falcon3-10B-Instruct-GGUF, https://huggingface.co/tiiuae/Falcon3-10B-Instruct ### Granite 4.0 H Small (https://localmodel.run/model/granite-4.0-h-small) - Params: 32B (MoE, 9B active per token) - Q4_K_M on disk: 18.23 GB; Q8_0: 31.91 GB - Est. memory to run (Q4_K_M, 4k ctx): ~20.4 GB - Context: 128k; Ollama: granite4:small-h; Released: 2025-10 - Sources: https://ollama.com/library/granite4, https://ollama.com/library/granite4/tags, https://huggingface.co/unsloth/granite-4.0-h-small-GGUF, https://huggingface.co/ibm-granite/granite-4.0-h-small ### SmolLM3 3B (https://localmodel.run/model/smollm3-3b) - Params: 3B - Q4_K_M on disk: 1.78 GB; Q8_0: 3.05 GB - Est. memory to run (Q4_K_M, 4k ctx): ~3 GB - Context: 128k; Ollama: n/a; Released: 2025-07 - Sources: https://huggingface.co/unsloth/SmolLM3-3B-GGUF, https://huggingface.co/HuggingFaceTB/SmolLM3-3B ### Qwen2.5-VL 3B (https://localmodel.run/model/qwen2.5-vl-3b) - Params: 3.75B - Q4_K_M on disk: 3.05 GB; Q8_0: 5.1 GB - Est. memory to run (Q4_K_M, 4k ctx): ~4.4 GB - Context: 32k; Ollama: n/a; Released: 2025-01 - Sources: https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct ### Qwen2.5-VL 7B (https://localmodel.run/model/qwen2.5-vl-7b) - Params: 8.29B - Q4_K_M on disk: 5.62 GB; Q8_0: 9.59 GB - Est. memory to run (Q4_K_M, 4k ctx): ~7.1 GB - Context: 32k; Ollama: n/a; Released: 2025-01 - Sources: https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF, https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct ### Llama 3.2 Vision 11B (https://localmodel.run/model/llama-3.2-vision-11b) - Params: 10.7B - Q4_K_M on disk: 7.36 GB; Q8_0: 11.49 GB - Est. memory to run (Q4_K_M, 4k ctx): ~9 GB - Context: 128k; Ollama: llama3.2-vision:11b; Released: 2024-09 - Sources: https://ollama.com/library/llama3.2-vision, https://ollama.com/library/llama3.2-vision/tags, https://huggingface.co/leafspark/Llama-3.2-11B-Vision-Instruct-GGUF, https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct ## Image generation models (peak VRAM consumed) ### Stable Diffusion 1.5 (https://localmodel.run/model/stable-diffusion-1-5) - Params: 860M - Peak VRAM: ~3.7 GB at fp16; ~4 GB all-resident; ~2 GB with aggressive offload - License: CreativeML OpenRAIL-M (commercial use: yes); Released: 2022-10 - Sources: https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5, https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations, https://huggingface.co/Comfy-Org/stable-diffusion-v1-5-archive/blob/main/v1-5-pruned-emaonly-fp16.safetensors ### Stable Diffusion XL 1.0 (https://localmodel.run/model/sdxl-1-0) - Params: 2.6B - Peak VRAM: ~7.5 GB at fp16; ~8.5 GB all-resident; ~4 GB with aggressive offload - License: CreativeML OpenRAIL++-M (commercial use: yes); Released: 2023-07 - Sources: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0, https://stability.ai/news-updates/sdxl-09-stable-diffusion, https://github.com/Comfy-Org/ComfyUI/issues/2855, https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimum-SDXL-Usage ### Stable Diffusion 3.5 Large (https://localmodel.run/model/stable-diffusion-3-5-large) - Params: 8.1B - Peak VRAM: ~7 GB at Q4 GGUF; ~19 GB all-resident; ~5 GB with aggressive offload - License: Stability Community License (commercial use: conditional); Released: 2024-10 - Sources: https://stability.ai/news-updates/introducing-stable-diffusion-3-5, https://huggingface.co/city96/stable-diffusion-3.5-large-gguf, https://stability.ai/news-updates/stable-diffusion-35-models-optimized-with-tensorrt-deliver-2x-faster-performance-and-40-less-memory-on-nvidia-rtx-gpus, https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_3 ### FLUX.1 dev (https://localmodel.run/model/flux-1-dev) - Params: 12B - Peak VRAM: ~6.5 GB at Q4 GGUF; ~33 GB all-resident; ~3 GB with aggressive offload - License: FLUX.1-dev Non-Commercial License (commercial use: no); Released: 2024-08 - Sources: https://huggingface.co/black-forest-labs/FLUX.1-dev, https://huggingface.co/city96/FLUX.1-dev-gguf, https://huggingface.co/city96/FLUX.1-dev-gguf/discussions/9, https://huggingface.co/docs/diffusers/optimization/memory, https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md ### FLUX.1 schnell (https://localmodel.run/model/flux-1-schnell) - Params: 12B - Peak VRAM: ~6.5 GB at Q4 GGUF; ~33 GB all-resident; ~3 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2024-08 - Sources: https://huggingface.co/black-forest-labs/FLUX.1-schnell, https://huggingface.co/city96/FLUX.1-schnell-gguf, https://huggingface.co/docs/diffusers/optimization/memory ### Qwen-Image (https://localmodel.run/model/qwen-image) - Params: 20B - Peak VRAM: ~14 GB at Q4_K_M GGUF; ~57 GB all-resident; ~3 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2025-08 - Sources: https://huggingface.co/Qwen/Qwen-Image, https://huggingface.co/city96/Qwen-Image-gguf, https://docs.comfy.org/tutorials/image/qwen/qwen-image, https://sandner.art/qwen-image-and-edit-local-gguf-generations-with-lightning/, https://github.com/QwenLM/Qwen-Image ## Video generation models (peak VRAM consumed) ### Wan 2.1 T2V 1.3B (https://localmodel.run/model/wan-2-1-t2v-1-3b) - Params: 1.3B - Peak VRAM: ~6 GB at Q4 GGUF; ~20 GB all-resident; ~5 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2025-02 - Sources: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B, https://huggingface.co/samuelchristlie/Wan2.1-T2V-1.3B-GGUF, https://huggingface.co/city96/umt5-xxl-encoder-gguf, https://github.com/Wan-Video/Wan2.1 ### Wan 2.1 T2V 14B (https://localmodel.run/model/wan-2-1-t2v-14b) - Params: 14B - Peak VRAM: ~12 GB at Q4 GGUF; ~40 GB all-resident; ~8 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2025-02 - Sources: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B, https://huggingface.co/city96/Wan2.1-T2V-14B-gguf, https://huggingface.co/docs/diffusers/v0.33.1/api/pipelines/wan ### Wan 2.2 TI2V 5B (https://localmodel.run/model/wan-2-2-ti2v-5b) - Params: 5B - Peak VRAM: ~8 GB at Q4 GGUF; ~24 GB all-resident; ~5 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2025-07 - Sources: https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B, https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF, https://docs.comfy.org/tutorials/video/wan/wan2_2, https://github.com/Wan-Video/Wan2.2 ### Wan 2.2 T2V A14B (https://localmodel.run/model/wan-2-2-t2v-a14b) - Params: 27B (MoE, 14B active) - Peak VRAM: ~16 GB at Q4 GGUF; ~80 GB all-resident; ~8 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2025-07 - Sources: https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B, https://huggingface.co/QuantStack/Wan2.2-T2V-A14B-GGUF, https://github.com/Wan-Video/Wan2.2 ### LTX-Video 2B (https://localmodel.run/model/ltx-video-2b) - Params: 2B - Peak VRAM: ~10 GB at fp8 + offload; ~12 GB all-resident; ~6 GB with aggressive offload - License: LTX-Video Open Weights (OpenRAIL-M) (commercial use: yes); Released: 2024-11 - Sources: https://huggingface.co/Lightricks/LTX-Video, https://huggingface.co/docs/diffusers/api/pipelines/ltx_video, https://huggingface.co/city96/LTX-Video-gguf ### LTX-Video 13B (https://localmodel.run/model/ltx-video-13b) - Params: 13B - Peak VRAM: ~20 GB at fp8; ~38 GB all-resident; ~12 GB with aggressive offload - License: LTX-Video Open Weights (OpenRAIL-M) (commercial use: yes); Released: 2025-05 - Sources: https://huggingface.co/Lightricks/LTX-Video, https://huggingface.co/Lightricks/LTX-Video-0.9.8-13B-distilled, https://huggingface.co/docs/diffusers/api/pipelines/ltx_video ### CogVideoX-5B (https://localmodel.run/model/cogvideox-5b) - Params: 5B - Peak VRAM: ~16 GB at INT8 / fp8; ~26 GB all-resident; ~5 GB with aggressive offload - License: CogVideoX License (commercial use: conditional); Released: 2024-08 - Sources: https://huggingface.co/THUDM/CogVideoX-5b, https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox, https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE ### CogVideoX-2B (https://localmodel.run/model/cogvideox-2b) - Params: 2B - Peak VRAM: ~8 GB at fp16 + offload; ~18 GB all-resident; ~4 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2024-08 - Sources: https://huggingface.co/THUDM/CogVideoX-2b, https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox ### HunyuanVideo (https://localmodel.run/model/hunyuanvideo) - Params: 13B - Peak VRAM: ~16 GB at Q4 GGUF; ~60 GB all-resident; ~8 GB with aggressive offload - License: Tencent Hunyuan Community License (commercial use: conditional); Released: 2024-12 - Sources: https://huggingface.co/tencent/HunyuanVideo, https://huggingface.co/city96/HunyuanVideo-gguf, https://github.com/Tencent-Hunyuan/HunyuanVideo ### Mochi 1 (https://localmodel.run/model/mochi-1) - Params: 10B - Peak VRAM: ~20 GB at fp8 + offload; ~60 GB all-resident; ~18 GB with aggressive offload - License: Apache-2.0 (commercial use: yes); Released: 2024-10 - Sources: https://huggingface.co/genmo/mochi-1-preview, https://huggingface.co/docs/diffusers/en/api/pipelines/mochi, https://huggingface.co/Comfy-Org/mochi_preview_repackaged ### Stable Video Diffusion (img2vid-XT) (https://localmodel.run/model/stable-video-diffusion) - Params: 1.5B - Peak VRAM: ~8 GB at fp16 + offload; ~22 GB all-resident; ~8 GB with aggressive offload - License: Stable Video Diffusion Community License (commercial use: conditional); Released: 2023-11 - Sources: https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt, https://huggingface.co/docs/diffusers/using-diffusers/svd ## Audio & voice models (peak memory consumed) ### Whisper large-v3 (https://localmodel.run/model/whisper-large-v3) - Params: 1.55B - Peak memory: ~2.5 GB at int8 - License: MIT (commercial use: yes); Released: 2023-11 - Sources: https://github.com/openai/whisper, https://huggingface.co/openai/whisper-large-v3, https://github.com/SYSTRAN/faster-whisper, https://github.com/ggml-org/whisper.cpp ### Whisper large-v3-turbo (https://localmodel.run/model/whisper-large-v3-turbo) - Params: 809M - Peak memory: ~1.5 GB at int8 - License: MIT (commercial use: yes); Released: 2024-10 - Sources: https://huggingface.co/openai/whisper-large-v3-turbo, https://github.com/SYSTRAN/faster-whisper/issues/1030, https://github.com/ggml-org/whisper.cpp ### Whisper small (https://localmodel.run/model/whisper-small) - Params: 244M - Peak memory: ~0.85 GB at fp16 (whisper.cpp) - License: MIT (commercial use: yes); Released: 2022-09 - Sources: https://github.com/openai/whisper, https://github.com/ggml-org/whisper.cpp ### Kokoro-82M (https://localmodel.run/model/kokoro-82m) - Params: 82M - Peak memory: ~1 GB at fp32 - License: Apache-2.0 (commercial use: yes); Released: 2025-01 - Sources: https://huggingface.co/hexgrad/Kokoro-82M, https://huggingface.co/FluidInference/kokoro-82m-coreml, https://github.com/remsky/Kokoro-FastAPI, https://github.com/puff-dayo/Kokoro-82M-Android ### Orpheus 3B (https://localmodel.run/model/orpheus-3b) - Params: 3B - Peak memory: ~4 GB at Q4_K_M GGUF - License: Apache-2.0 (commercial use: yes); Released: 2025-03 - Sources: https://huggingface.co/canopylabs/orpheus-3b-0.1-ft, https://github.com/canopyai/Orpheus-TTS, https://huggingface.co/Mungert/orpheus-3b-0.1-ft-GGUF ### Bark (https://localmodel.run/model/bark) - Params: 900M - Peak memory: ~5 GB at fp32 - License: MIT (commercial use: yes); Released: 2023-04 - Sources: https://huggingface.co/suno/bark, https://huggingface.co/blog/optimizing-bark, https://github.com/suno-ai/bark ### Dia 1.6B (https://localmodel.run/model/dia-1-6b) - Params: 1.6B - Peak memory: ~10 GB at fp16 - License: Apache-2.0 (commercial use: yes); Released: 2025-04 - Sources: https://huggingface.co/nari-labs/Dia-1.6B, https://github.com/nari-labs/dia/issues/34 ### MusicGen small (https://localmodel.run/model/musicgen-small) - Params: 300M - Peak memory: ~3 GB at fp32 - License: CC-BY-NC-4.0 (commercial use: no); Released: 2023-06 - Sources: https://huggingface.co/facebook/musicgen-small, https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md ### MusicGen medium (https://localmodel.run/model/musicgen-medium) - Params: 1.5B - Peak memory: ~14 GB at fp32 - License: CC-BY-NC-4.0 (commercial use: no); Released: 2023-06 - Sources: https://huggingface.co/facebook/musicgen-medium, https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md ### MusicGen large (https://localmodel.run/model/musicgen-large) - Params: 3.3B - Peak memory: ~20 GB at fp32 - License: CC-BY-NC-4.0 (commercial use: no); Released: 2023-06 - Sources: https://huggingface.co/facebook/musicgen-large, https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md ### Stable Audio Open 1.0 (https://localmodel.run/model/stable-audio-open) - Params: 1.3B - Peak memory: ~15 GB at fp32 - License: Stability Community License (commercial use: conditional); Released: 2024-06 - Sources: https://huggingface.co/stabilityai/stable-audio-open-1.0, https://arxiv.org/html/2407.14358v1, https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md ## Devices (usable memory for model weights) ### Apple M1 (8GB) (https://localmodel.run/best-llm-for/apple-m1-8gb) - Memory: 8 GB unified; usable for weights: ~5.5 GB - Best runtime: Ollama (llama.cpp Metal backend) - Notes: Base MacBook Air M1 config. recommendedMaxWorkingSetSize ~66% of RAM for <64GB configs = ~5.3–5.5GB usable for model weights. macOS needs ~2–3GB for OS overhead on top. Practical limit: 3–4B parameter models at Q4_K_M. - Sources: https://support.apple.com/en-us/111883, https://developer.apple.com/forums/thread/732035, https://blog.peddals.com/en/fine-tune-vram-size-of-mac-for-llm/ ### Apple M2 (16GB) (https://localmodel.run/best-llm-for/apple-m2-16gb) - Memory: 16 GB unified; usable for weights: ~10.5 GB - Best runtime: Ollama (llama.cpp Metal backend) / MLX - Notes: Available in MacBook Pro 13-inch M2 (2022) and MacBook Air M2. recommendedMaxWorkingSetSize ~66% for <64GB = ~10.5GB. Fits 7–8B models at Q4_K_M comfortably. - Sources: https://support.apple.com/en-us/111869, https://developer.apple.com/forums/thread/732035, https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html ### Apple M3 Pro (18GB) (https://localmodel.run/best-llm-for/apple-m3-18gb) - Memory: 18 GB unified; usable for weights: ~12 GB - Best runtime: Ollama (llama.cpp Metal backend) / MLX - Notes: Base M3 Pro config in MacBook Pro 14/16-inch 2023. recommendedMaxWorkingSetSize ~66% for <64GB = ~12GB. Fits 13B models at Q4_K_M, 7B with headroom. - Sources: https://support.apple.com/en-us/117736, https://developer.apple.com/forums/thread/732035, https://blog.peddals.com/en/fine-tune-vram-size-of-mac-for-llm/ ### Apple M4 (16GB) (https://localmodel.run/best-llm-for/apple-m4-16gb) - Memory: 16 GB unified; usable for weights: ~10.5 GB - Best runtime: Ollama (MLX backend, preview) / MLX direct - Notes: Base Mac mini M4 (2024) and MacBook Pro 14-inch M4 config. recommendedMaxWorkingSetSize ~66% for <64GB = ~10.5GB. MLX backend now available in Ollama. Fits 7–8B models at Q4_K_M. - Sources: https://support.apple.com/en-us/121555, https://support.apple.com/en-us/121552, https://developer.apple.com/forums/thread/732035 ### Apple M4 (24GB) (https://localmodel.run/best-llm-for/apple-m4-24gb) - Memory: 24 GB unified; usable for weights: ~16 GB - Best runtime: Ollama (MLX backend, preview) / MLX direct - Notes: Configurable Mac mini M4 (2024). recommendedMaxWorkingSetSize ~66% for <64GB = ~16GB. Fits 13B models at Q4_K_M comfortably. - Sources: https://support.apple.com/en-us/121555, https://developer.apple.com/forums/thread/732035, https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html ### Apple M4 Pro (24GB) (https://localmodel.run/best-llm-for/apple-m4-pro-24gb) - Memory: 24 GB unified; usable for weights: ~16 GB - Best runtime: Ollama (MLX backend, preview) / MLX direct - Notes: Base M4 Pro config in Mac mini and MacBook Pro 14/16-inch 2024. Same 66% rule applies for <64GB = ~16GB. Higher memory bandwidth than base M4 chip benefits throughput. - Sources: https://support.apple.com/en-us/121553, https://support.apple.com/en-us/121555, https://developer.apple.com/forums/thread/732035 ### Apple M4 Pro (48GB) (https://localmodel.run/best-llm-for/apple-m4-pro-48gb) - Memory: 48 GB unified; usable for weights: ~32 GB - Best runtime: Ollama (MLX backend) / MLX direct - Notes: Top M4 Pro config in MacBook Pro 14/16-inch 2024 and Mac mini. recommendedMaxWorkingSetSize ~66% for <64GB = ~32GB. Fits 34B models at Q4_K_M. Strong sweet-spot for local LLM. - Sources: https://support.apple.com/en-us/121553, https://support.apple.com/en-us/121555, https://blog.peddals.com/en/fine-tune-vram-size-of-mac-for-llm/ ### Apple M4 Max (64GB) (https://localmodel.run/best-llm-for/apple-m4-max-64gb) - Memory: 64 GB unified; usable for weights: ~48 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: Base M4 Max config. At 64GB, rule shifts to ~75% (per Apple Metal docs) = ~48GB usable. 546GB/s memory bandwidth. Comfortably fits 34B models; 70B at lower quant. - Sources: https://support.apple.com/en-us/121553, https://www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/, https://developer.apple.com/forums/thread/732035 ### Apple M4 Max (128GB) (https://localmodel.run/best-llm-for/apple-m4-max-128gb) - Memory: 128 GB unified; usable for weights: ~96 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: Top M4 Max config. 75% rule at ≥64GB = 96GB confirmed usable (matches 128GB M1 Ultra precedent from stencel.io). Fits 70B models at full precision or Q8; handles 405B at lower quants. - Sources: https://support.apple.com/en-us/121553, https://www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/, https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html ### Apple M3 Ultra (256GB) (https://localmodel.run/best-llm-for/apple-m3-ultra-256gb) - Memory: 256 GB unified; usable for weights: ~192 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: Max config Mac Studio M3 Ultra (2025). 75% rule at ≥64GB = 192GB usable. The 512GB config was discontinued in March 2026; 256GB remains available. Supports the largest open-weight models (405B+) at reasonable quants. - Sources: https://support.apple.com/en-us/122211, https://www.apple.com/newsroom/2025/03/apple-reveals-m3-ultra-taking-apple-silicon-to-a-new-extreme/, https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html ### Nvidia GeForce RTX 3060 (12GB) (https://localmodel.run/best-llm-for/nvidia-rtx-3060-12gb) - Memory: 12 GB vram; usable for weights: ~11 GB - Best runtime: Ollama (CUDA) / llama.cpp CUDA - Notes: 12GB GDDR6 on 192-bit bus. Unusually high VRAM for its tier. ~1GB reserved for driver/OS = ~11GB usable. Fits 7B Q4_K_M (4.1GB) and 13B Q4_K_M (7.4GB) with room. CUDA support via Ampere architecture. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3060-3060ti/, https://marketplace.nvidia.com/en-us/consumer/graphics-cards/msi-gaming-geforce-rtx-3060-12gb-15-gbps-gdrr6-192-bit-hdmi-dp-pcie-4-torx-twin-fan-ampere-oc-graphics-card/ ### Nvidia GeForce RTX 4060 Ti (16GB) (https://localmodel.run/best-llm-for/nvidia-rtx-4060-ti-16gb) - Memory: 16 GB vram; usable for weights: ~15 GB - Best runtime: Ollama (CUDA) / llama.cpp CUDA - Notes: 16GB GDDR6 on 128-bit bus (narrow bandwidth: 288 GB/s). ~1GB reserved = ~15GB usable. Bandwidth bottleneck limits throughput vs wider-bus alternatives. Fits 13B Q4_K_M easily; 34B at low quant. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4060-4060ti/, https://www.techspot.com/specs/gpu/280961-nvidia-geforce-rtx-4060-ti-16gb.html ### Nvidia GeForce RTX 4070 (12GB) (https://localmodel.run/best-llm-for/nvidia-rtx-4070-12gb) - Memory: 12 GB vram; usable for weights: ~11 GB - Best runtime: Ollama (CUDA) / vLLM (Linux) - Notes: 12GB GDDR6X on 192-bit bus (~504 GB/s). ~1GB reserved = ~11GB usable. Better bandwidth than 4060 Ti 16GB despite less VRAM. Fits 7B comfortably, 13B Q4 squeezed. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4070/, https://www.nvidia.com/en-us/geforce/news/rtx-40-series-vram-video-memory-explained/ ### Nvidia GeForce RTX 4080 (16GB) (https://localmodel.run/best-llm-for/nvidia-rtx-4080-16gb) - Memory: 16 GB vram; usable for weights: ~15 GB - Best runtime: vLLM (Linux) / Ollama (CUDA) - Notes: 16GB GDDR6X on 256-bit bus (~716 GB/s). ~1GB reserved = ~15GB usable. Good bandwidth for 13B models; 34B models require offloading. Strong for inference throughput. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4080/, https://www.nvidia.com/en-us/geforce/news/rtx-40-series-vram-video-memory-explained/ ### Nvidia GeForce RTX 4090 (24GB) (https://localmodel.run/best-llm-for/nvidia-rtx-4090-24gb) - Memory: 24 GB vram; usable for weights: ~23 GB - Best runtime: vLLM (Linux) / Ollama (CUDA) - Notes: 24GB GDDR6X on 384-bit bus (~1008 GB/s). ~1GB reserved = ~23GB usable. Flagship Ada Lovelace. Fits 34B Q4_K_M (19GB); 70B requires offloading or very low quant. Best single-GPU consumer option for throughput. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/, https://www.nvidia.com/en-us/geforce/news/rtx-40-series-vram-video-memory-explained/ ### Nvidia GeForce RTX 5090 (32GB) (https://localmodel.run/best-llm-for/nvidia-rtx-5090-32gb) - Memory: 32 GB vram; usable for weights: ~31 GB - Best runtime: vLLM (Linux) / Ollama (CUDA) - Notes: 32GB GDDR7 on 512-bit bus (~1792 GB/s). ~1GB reserved = ~31GB usable. Blackwell architecture. Fits 34B Q8 and pushes toward 70B Q4_K_M (~35–38GB, tight). Best single-GPU consumer option as of 2026. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/, https://www.spheron.network/blog/nvidia-rtx-5090-specs/ ### Nvidia GeForce RTX 3090 (24GB) (https://localmodel.run/best-llm-for/nvidia-rtx-3090-24gb) - Memory: 24 GB vram; usable for weights: ~23 GB - Best runtime: vLLM (Linux) / Ollama (CUDA) - Notes: 24GB GDDR6X on 384-bit bus (~936 GB/s). ~1GB reserved = ~23GB usable. Ampere architecture. Excellent value-per-GB for LLM inference; same VRAM ceiling as RTX 4090 at lower bandwidth. Fits 34B Q4_K_M. - Sources: https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3090/, https://www.hardware-corner.net/gpu-llm-benchmarks/rtx-3090/ ### AMD Radeon RX 7900 XTX (24GB) (https://localmodel.run/best-llm-for/amd-rx-7900-xtx-24gb) - Memory: 24 GB vram; usable for weights: ~23 GB - Best runtime: Ollama (ROCm) / llama.cpp ROCm (Linux) - Notes: 24GB GDDR6 on 384-bit bus (~960 GB/s). ~1GB reserved = ~23GB usable. RDNA 3 architecture. ROCm support on Linux; ROCm on Windows is experimental/limited. LLM inference quality behind CUDA in software maturity. Fits 34B Q4_K_M. Windows users typically prefer llama.cpp with ROCm or DirectML backend. - Sources: https://www.notebookcheck.net/AMD-Radeon-RX-7900-XTX-with-24-GB-VRAM-review-Already-available-for-less-than-1000-Euros.810630.0.html, https://www.asus.com/us/motherboards-components/graphics-cards/asus/rx7900xtx-24g/techspec/, https://bestgpuforllm.com/articles/nvidia-vs-amd-for-llm/ ### 8GB RAM Laptop (CPU/iGPU only) (https://localmodel.run/best-llm-for/laptop-8gb) - Memory: 8 GB ram; usable for weights: ~5 GB - Best runtime: Ollama (llama.cpp backend) - Notes: OS overhead ~2-3GB (Windows/macOS), leaving ~5GB usable. Practical ceiling is 1B-3B Q4_K_M models. Llama 3.2 1B Q4_K_M ~0.8GB file, needs ~1.2GB at runtime. Phi-3 Mini 3.8B Q4_K_M ~2.4GB file, needs ~3.6GB at runtime, fits. 7B+ models will OOM. Usable figure is community estimate, not vendor spec. - Sources: https://ollama.com/library/llama3.2:1b, https://ollama.com/library/phi3:mini, https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF ### 16GB RAM Laptop (CPU/iGPU only) (https://localmodel.run/best-llm-for/laptop-16gb) - Memory: 16 GB ram; usable for weights: ~12 GB - Best runtime: Ollama (llama.cpp backend) - Notes: OS overhead ~2-4GB, leaving ~12GB usable. Practical ceiling is 7B-8B Q4_K_M models. Llama 3.1 8B Q4_K_M ~4.7GB file, needs ~7GB at runtime. Mistral 7B Q4_K_M ~4.1GB file, needs ~6GB at runtime. 13B models are marginal and will be slow. Usable figure is community estimate, not vendor spec. - Sources: https://ollama.com/library/llama3.1:8b, https://ollama.com/library/mistral:7b, https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ### 32GB RAM Laptop (CPU/iGPU only) (https://localmodel.run/best-llm-for/laptop-32gb) - Memory: 32 GB ram; usable for weights: ~28 GB - Best runtime: Ollama (llama.cpp backend) - Notes: OS overhead ~2-4GB, leaving ~28GB usable. Practical ceiling is 13B-34B Q4_K_M models comfortably; 70B at Q2_K (lossy) marginally fits (~28GB). Llama 3.1 13B Q4_K_M ~7.8GB file, needs ~11.7GB. Llama 3.3 70B Q4_K_M ~43GB, does NOT fit; Q2_K ~24GB fits but quality degraded. Usable figure is community estimate, not vendor spec. - Sources: https://ollama.com/library/llama3.1:70b, https://ollama.com/library/mixtral:8x7b, https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF ### iPhone 15 Pro (https://localmodel.run/best-llm-for/iphone-15-pro) - Memory: 8 GB unified; usable for weights: ~4.5 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: 8GB LPDDR5 unified memory confirmed via TechInsights teardown (Micron D1b LPDDR5 chips) and GSMArena. iOS reserves ~3-4GB for OS+system, leaving ~4-4.5GB usable for model weights (community estimate). Practical ceiling 1B-3B Q4_K_M on-device; Apple Intelligence models (Llama 3B class) run at ~3-4GB. CoreML path most memory-efficient but slowest gen speed; llama.cpp+Metal gives best throughput. Benchmarks for on-device LLM cite iPhone 15 Pro achieving ~20-30 tok/s on 3B Q4 models. - Sources: https://www.gsmarena.com/apple_iphone_15_pro-12178.php, https://wccftech.com/iphone-15-pro-ram-type-lpddr5-confirmed/, https://www.techinsights.com/blog/apple-iphone-15-pro-teardown ### iPhone 16 (https://localmodel.run/best-llm-for/iphone-16) - Memory: 8 GB unified; usable for weights: ~4.5 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: All iPhone 16 models confirmed 8GB RAM per MacRumors (Sep 2024, citing Apple Intelligence requirement). Apple specs page confirms 8GB. Unified memory architecture; iOS reserves ~3-4GB, leaving ~4-4.5GB usable (community estimate). Identical LLM capability envelope to iPhone 15 Pro. On-device model ceiling: 1B-3B Q4_K_M class. Apple Intelligence on-device model is ~3B parameter class. - Sources: https://www.gsmarena.com/apple_iphone_16-12557.php, https://www.macrumors.com/2024/09/09/all-iphone-16-models-equipped-with-8gb-of-ram/, https://www.apple.com/iphone-16/specs/ ### iPhone 16 Pro (https://localmodel.run/best-llm-for/iphone-16-pro) - Memory: 8 GB unified; usable for weights: ~4.5 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: 8GB unified memory confirmed (same MacRumors source as iPhone 16, all 4 iPhone 16 models have 8GB). A18 Pro chip with 6-core GPU gives higher inference throughput vs A16/A17 but same memory envelope. iOS overhead ~3-4GB; usable ~4.5GB (community estimate). Practical ceiling 1B-4B Q4_K_M. Off Grid app (llama.cpp+Metal) and PocketPal are primary consumer runtimes. Apple Intelligence 3B on-device model fits comfortably. - Sources: https://www.gsmarena.com/apple_iphone_16_pro-12560.php, https://www.macrumors.com/2024/09/09/all-iphone-16-models-equipped-with-8gb-of-ram/, https://www.apple.com/iphone-16-pro/specs/ ### iPad Pro M4 (16GB, 1TB/2TB config) (https://localmodel.run/best-llm-for/ipad-pro-m4-16gb) - Memory: 16 GB unified; usable for weights: ~12 GB - Best runtime: MLX (via Python or Swift; mlx-lm package) - Notes: iPad, not iPhone, categorized as 'iphone' due to schema enum limitation. 16GB unified memory only available on 1TB and 2TB storage configurations per Apple Support KB article (support.apple.com/en-us/119891). 256GB/512GB configs ship with 8GB. M4 chip with 10-core GPU. iPadOS overhead ~3-4GB; ~12GB usable for model weights (community estimate, ~75% of total per Apple Silicon architecture). MLX framework (Apple's ML framework for Apple Silicon) achieves ~61 tok/s on Llama 3.2 3B. Practical ceiling: 7B-13B Q4_K_M models fit; 30B Q4_K_M (~19GB) also fits with headroom. - Sources: https://support.apple.com/en-us/119891, https://www.apple.com/ipad-pro/specs/, https://www.gsmarena.com/apple_ipad_pro_13_(2024)-12342.php ### Google Pixel 9 Pro (https://localmodel.run/best-llm-for/pixel-9-pro) - Memory: 16 GB ram; usable for weights: ~10.5 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM (Adreno GPU path) - Notes: 16GB LPDDR5X confirmed via GSMArena. Android Police investigation confirmed 2.6GB hardware-partitioned for Google Gemini/on-device ML features, not available to apps. Android OS overhead ~3-4GB additional; usable estimate ~10-10.5GB (community estimate). Practical ceiling: 7B Q4_K_M fits (~7GB runtime); 13B Q4_K_M tight (~11.7GB) may OOM. MLC-LLM leverages Adreno GPU for faster inference vs CPU-only path. Tensor G4 chip handles 1B-4B models well natively via Google AI Edge. - Sources: https://www.gsmarena.com/google_pixel_9_pro-12424.php, https://www.androidpolice.com/google-pixel-9-pro-ram-partition-gemini/, https://mlc.ai/blog/2024/05/08/mlc-llm-android ### Samsung Galaxy S24 Ultra (https://localmodel.run/best-llm-for/samsung-s24-ultra) - Memory: 12 GB ram; usable for weights: ~8.5 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM (Adreno GPU path) - Notes: 12GB LPDDR5X at 4800MHz confirmed via GSMArena review page (explicitly states RAM type and speed). All S24 Ultra configs are 12GB (no 16GB tier unlike S25 Ultra). Snapdragon 8 Gen 3 (US). Android OS overhead ~3-4GB; usable ~8-8.5GB (community estimate). Practical ceiling: 7B Q4_K_M (~7GB runtime) fits; 13B Q4_K_M (~11.7GB) will OOM. MLC-LLM uses Adreno 750 GPU for faster inference. Samsung Galaxy AI on-device features use 1B-3B class models. - Sources: https://www.gsmarena.com/samsung_galaxy_s24_ultra-12419.php, https://www.gsmarena.com/samsung_galaxy_s24_ultra-review-2754p5.php, https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF ### Samsung Galaxy S25 Ultra (16GB, 1TB config only) (https://localmodel.run/best-llm-for/samsung-s25-ultra-16gb) - Memory: 16 GB ram; usable for weights: ~12 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM (Adreno GPU path) - Notes: 16GB LPDDR5X (Micron, 12nm process) confirmed via SammyFans and SammyGuru. CRITICAL: 16GB RAM is ONLY available on the 1TB storage configuration. 256GB and 512GB configs ship with 12GB RAM. Android OS overhead ~3-4GB; usable ~12GB (community estimate). Snapdragon 8 Elite. Practical ceiling: 7B-13B Q4_K_M. Llama 3.1 8B Q4_K_M (~7GB runtime) fits with headroom; 13B Q4_K_M (~11.7GB) marginal. MLC-LLM + Adreno 830 GPU recommended for best throughput. - Sources: https://www.gsmarena.com/samsung_galaxy_s25_ultra-12750.php, https://www.sammyfans.com/2025/01/22/galaxy-s25-ultra-ram-lpddr5x-micron/, https://www.sammyguru.com/samsung-galaxy-s25-ultra-ram-specs-confirmed/ ### Generic Android Phone (8GB RAM) (https://localmodel.run/best-llm-for/android-generic-8gb) - Memory: 8 GB ram; usable for weights: ~4.5 GB - Best runtime: llama.cpp (PocketPal or SmolChat) - Notes: Representative of mid-range Android devices (e.g., Pixel 8a, OnePlus 12R, Samsung A55). RAM type varies (LPDDR4X on lower-end, LPDDR5 on higher-end). Android OS overhead ~3-4GB; usable ~4-4.5GB (community estimate). Practical ceiling: 1B-3B Q4_K_M only. Llama 3.2 1B Q4_K_M (~1.2GB runtime) fits well. Phi-3 Mini 3.8B Q4_K_M (~3.6GB runtime) fits with minimal headroom. 7B+ models will reliably OOM. PocketPal (llama.cpp backend) is the recommended consumer app for CPU inference on Android. - Sources: https://ollama.com/library/llama3.2:1b, https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF, https://github.com/a-ghorbani/pocketpal-ai ### Generic Android Phone (12GB RAM) (https://localmodel.run/best-llm-for/android-generic-12gb) - Memory: 12 GB ram; usable for weights: ~8.5 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM - Notes: Representative of upper-mid and flagship Android devices (e.g., OnePlus 12, Pixel 9, Samsung S24+). LPDDR5 or LPDDR5X depending on SoC. Android OS overhead ~3-4GB; usable ~8-8.5GB (community estimate). Practical ceiling: 7B Q4_K_M models. Llama 3.1 8B Q4_K_M (~7GB runtime) fits with ~1.5GB headroom. 13B Q4_K_M (~11.7GB) will OOM. MLC-LLM leverages GPU (Adreno or Mali) for faster inference if supported. CPU-only inference on 7B is slow (~3-8 tok/s); GPU path via MLC-LLM can reach ~15-25 tok/s on flagship Adreno. - Sources: https://ollama.com/library/llama3.1:8b, https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, https://github.com/a-ghorbani/pocketpal-ai ### Apple M5 (16GB) (https://localmodel.run/best-llm-for/apple-m5-16gb) - Memory: 16 GB unified; usable for weights: ~10.5 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: Base MacBook Pro 14-inch M5 (2025). recommendedMaxWorkingSetSize ~66% for <64GB = ~10.5GB usable. Fits 7-8B models at Q4_K_M. M5 adds a Neural Accelerator per GPU core for faster prompt processing. - Sources: https://support.apple.com/en-us/125405, https://www.apple.com/macbook-pro/specs/, https://developer.apple.com/forums/thread/732035 ### Apple M5 (32GB) (https://localmodel.run/best-llm-for/apple-m5-32gb) - Memory: 32 GB unified; usable for weights: ~21 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: Top base-M5 config (MacBook Pro 14-inch M5, 2025); 32GB is the max on the base M5 chip. ~66% for <64GB = ~21GB usable. Fits 14B at Q4_K_M comfortably, 32B tight. - Sources: https://support.apple.com/en-us/125405, https://www.apple.com/macbook-pro/specs/, https://developer.apple.com/forums/thread/732035 ### Apple M5 Pro (48GB) (https://localmodel.run/best-llm-for/apple-m5-pro-48gb) - Memory: 48 GB unified; usable for weights: ~32 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: M5 Pro (MacBook Pro 14/16-inch). Configurable to 64GB; 48GB config shown. ~66% for <64GB = ~32GB usable. ~307GB/s memory bandwidth. Fits 34B at Q4_K_M. - Sources: https://support.apple.com/en-us/126318, https://www.apple.com/macbook-pro/specs/, https://blog.peddals.com/en/fine-tune-vram-size-of-mac-for-llm/ ### Apple M5 Max (128GB) (https://localmodel.run/best-llm-for/apple-m5-max-128gb) - Memory: 128 GB unified; usable for weights: ~96 GB - Best runtime: MLX direct / Ollama (MLX backend) - Notes: M5 Max top config. 75% rule at >=64GB = ~96GB usable. ~614GB/s memory bandwidth. Fits 70B at Q8, or 235B-class MoE at Q4 with room to spare. - Sources: https://support.apple.com/en-us/126318, https://support.apple.com/en-us/126319, https://www.apple.com/macbook-pro/specs/ ### iPhone 17 (https://localmodel.run/best-llm-for/iphone-17) - Memory: 8 GB unified; usable for weights: ~4.5 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: iPhone 17 confirmed 8GB RAM (MacRumors, Sep 2025). A19 chip. iOS reserves ~3-4GB; ~4.5GB usable for weights (community estimate). On-device ceiling 1B-4B Q4_K_M. - Sources: https://www.macrumors.com/2025/09/09/iphone-17-pro-iphone-air-ram-amounts/, https://www.apple.com/iphone-17/specs/ ### iPhone 17 Pro (https://localmodel.run/best-llm-for/iphone-17-pro) - Memory: 12 GB unified; usable for weights: ~8 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: iPhone 17 Pro confirmed 12GB RAM (MacRumors, Sep 2025), up from 8GB on iPhone 16 Pro. A19 Pro. iOS overhead ~3.5-4GB; ~8GB usable (community estimate). The extra RAM lifts the on-device ceiling toward 7-8B Q4_K_M, a first for iPhone. - Sources: https://www.macrumors.com/2025/09/09/iphone-17-pro-iphone-air-ram-amounts/, https://www.apple.com/iphone-17-pro/specs/ ### iPhone Air (https://localmodel.run/best-llm-for/iphone-air) - Memory: 12 GB unified; usable for weights: ~8 GB - Best runtime: llama.cpp + Metal (via PocketPal or Off Grid app) - Notes: iPhone Air confirmed 12GB RAM (MacRumors, Sep 2025). A19 Pro. iOS overhead ~3.5-4GB; ~8GB usable (community estimate). On-device ceiling ~7-8B Q4_K_M. - Sources: https://www.macrumors.com/2025/09/09/iphone-17-pro-iphone-air-ram-amounts/, https://www.apple.com/iphone-air/specs/ ### Google Pixel 10 Pro (https://localmodel.run/best-llm-for/pixel-10-pro) - Memory: 16 GB ram; usable for weights: ~10.5 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM (Adreno GPU path) - Notes: 16GB LPDDR5X confirmed via GSMArena. Tensor G5. Memory is reserved for on-device Gemini Nano plus ~3-4GB Android OS overhead; usable ~10-10.5GB (community estimate). Practical ceiling 7B Q4_K_M; 13B tight. - Sources: https://www.gsmarena.com/google_pixel_10_pro_5g-13987.php, https://store.google.com/product/pixel_10_pro_specs ### Samsung Galaxy S26 Ultra (16GB, 1TB config) (https://localmodel.run/best-llm-for/samsung-s26-ultra) - Memory: 16 GB ram; usable for weights: ~12 GB - Best runtime: llama.cpp (PocketPal) or MLC-LLM (Adreno GPU path) - Notes: 16GB LPDDR5X on the 1TB config (256/512GB ship 12GB) per GSMArena and SammyFans. Snapdragon 8 Elite Gen 5. Android overhead ~3-4GB; usable ~12GB (community estimate). Fits 7B-13B Q4_K_M. MLC-LLM + Adreno for best throughput. - Sources: https://www.gsmarena.com/samsung_galaxy_s26_ultra_5g-14320.php, https://www.sammyfans.com/2026/03/13/samsung-galaxy-s26-ultra-details/