5 platforms · runtimes guide

Best tool to run LLMs locally

The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.

macOS runtime guide

Beginner pick

LM Studio

Polished GUI, ships MLX on Apple Silicon, one-click model downloads.

Power user

mlx-lm

Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.

$ pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit

Runtime	What it is	Status
MLX / mlx-lm	Apple-native inference, usually fastest on M-series	stable
LM Studio	GUI app, MLX + GGUF backends	stable
Ollama	CLI/server, simple ollama run UX	stable
llama.cpp (Metal)	Reference cross-platform engine	stable

Gotcha: vLLM is NOT a Mac tool, it is a CUDA/Linux serving engine. Unified memory is not a fixed VRAM slice; ~70% is usable for weights.

Realistic ceiling M-series with 16GB runs up to ~8B comfortably; 64GB+ runs 70B at Q4.

Windows runtime guide

Beginner pick

LM Studio

Best GUI on Windows, auto-detects CUDA/Vulkan backends.

Power user

Ollama (CUDA)

Scriptable server; CUDA path is fastest on NVIDIA.

$ ollama run llama3.1:8b

Runtime	What it is	Status
LM Studio	GUI, CUDA/Vulkan/ROCm	stable
Ollama	CLI/server	stable
llama.cpp	Reference engine	stable

Gotcha: AMD GPUs run via Vulkan/ROCm at roughly half CUDA throughput. NVIDIA is the smooth path on Windows.

Realistic ceiling 12GB VRAM runs up to ~14B at Q4; 24GB runs 32B at Q4.

Linux runtime guide

Beginner pick

Ollama

One-line install, simple single-user UX.

Power user

vLLM

Highest throughput for serving many requests; PagedAttention + continuous batching.

$ pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct

Runtime	What it is	Status
vLLM	OpenAI-compatible serving engine	stable
Ollama	CLI/server	stable
llama.cpp	Reference engine	stable

Gotcha: vLLM shines for multi-user serving/throughput. For a single local chat, Ollama or llama.cpp is simpler and lighter.

Realistic ceiling Scales with VRAM; multi-GPU for 70B+ at higher precision.

iOS runtime guide

Beginner pick

Apple Foundation Models

Built into iOS 26, ~3B on-device model, zero download, fully private.

Power user

PocketPal AI

Run any GGUF from HuggingFace fully offline.

Runtime	What it is	Status
Apple Foundation Models	On-device ~3B model API (iOS 26+)	stable
PocketPal AI	llama.cpp wrapper app	stable
LLM Farm	Open-source on-device runner	stable

Gotcha: Phones realistically run 1B-4B class models. Anything larger thermally throttles or OOMs.

Realistic ceiling 1B-4B at Q4 on 8GB iPhones; iPad Pro M4 (16GB) handles more.

Android runtime guide

Beginner pick

PocketPal AI

Polished app, download GGUF and run offline.

Power user

MLC LLM / LiteRT-LM

GPU/NPU acceleration paths for supported chips.

Runtime	What it is	Status
PocketPal AI	llama.cpp/llama.rn app	stable
MLC LLM	Compiled on-device inference	stable
Google AI Edge / LiteRT-LM	On-device LLM runtime for Gemma	stable

Gotcha: NPU acceleration is limited and chip-specific; most apps run on CPU. Expect 1B-4B class.

Realistic ceiling 1B-4B at Q4 on 8-16GB phones.

FAQ

Is vLLM good for Mac?

No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.

Ollama or LM Studio?

LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.

What is the fastest way to run a model on Apple Silicon?

mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.

Best tool to run LLMs locally

Sources