5 platforms · runtimes guide
Best tool to run LLMs locally
The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.
Polished GUI, ships MLX on Apple Silicon, one-click model downloads.
Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.
pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit | Runtime | What it is | Status |
|---|---|---|
| MLX / mlx-lm | Apple-native inference, usually fastest on M-series | stable |
| LM Studio | GUI app, MLX + GGUF backends | stable |
| Ollama | CLI/server, simple ollama run UX | stable |
| llama.cpp (Metal) | Reference cross-platform engine | stable |
Best GUI on Windows, auto-detects CUDA/Vulkan backends.
Scriptable server; CUDA path is fastest on NVIDIA.
ollama run llama3.1:8b | Runtime | What it is | Status |
|---|---|---|
| LM Studio | GUI, CUDA/Vulkan/ROCm | stable |
| Ollama | CLI/server | stable |
| llama.cpp | Reference engine | stable |
One-line install, simple single-user UX.
Highest throughput for serving many requests; PagedAttention + continuous batching.
pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct | Runtime | What it is | Status |
|---|---|---|
| vLLM | OpenAI-compatible serving engine | stable |
| Ollama | CLI/server | stable |
| llama.cpp | Reference engine | stable |
Built into iOS 26, ~3B on-device model, zero download, fully private.
Run any GGUF from HuggingFace fully offline.
| Runtime | What it is | Status |
|---|---|---|
| Apple Foundation Models | On-device ~3B model API (iOS 26+) | stable |
| PocketPal AI | llama.cpp wrapper app | stable |
| LLM Farm | Open-source on-device runner | stable |
Polished app, download GGUF and run offline.
GPU/NPU acceleration paths for supported chips.
| Runtime | What it is | Status |
|---|---|---|
| PocketPal AI | llama.cpp/llama.rn app | stable |
| MLC LLM | Compiled on-device inference | stable |
| Google AI Edge / LiteRT-LM | On-device LLM runtime for Gemma | stable |
FAQ
Is vLLM good for Mac?
No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.
Ollama or LM Studio?
LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.
What is the fastest way to run a model on Apple Silicon?
mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.