Skip to content
localmodel.run

5 platforms · runtimes guide

Best tool to run LLMs locally

The runtime that is fastest on a Mac will fail to start on Windows. Here is what to install, who each tool is actually for, and the one mistake people make on each platform.

macOS runtime guide
Beginner pick
LM Studio

Polished GUI, ships MLX on Apple Silicon, one-click model downloads.

Power user
mlx-lm

Apple's MLX framework, usually the fastest on Apple Silicon for the same quant.

$ pip install mlx-lm && mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit
Runtime What it is Status
MLX / mlx-lm Apple-native inference, usually fastest on M-series stable
LM Studio GUI app, MLX + GGUF backends stable
Ollama CLI/server, simple ollama run UX stable
llama.cpp (Metal) Reference cross-platform engine stable
Gotcha: vLLM is NOT a Mac tool, it is a CUDA/Linux serving engine. Unified memory is not a fixed VRAM slice; ~70% is usable for weights.
Windows runtime guide
Beginner pick
LM Studio

Best GUI on Windows, auto-detects CUDA/Vulkan backends.

Power user
Ollama (CUDA)

Scriptable server; CUDA path is fastest on NVIDIA.

$ ollama run llama3.1:8b
Runtime What it is Status
LM Studio GUI, CUDA/Vulkan/ROCm stable
Ollama CLI/server stable
llama.cpp Reference engine stable
Gotcha: AMD GPUs run via Vulkan/ROCm at roughly half CUDA throughput. NVIDIA is the smooth path on Windows.
Linux runtime guide
Beginner pick
Ollama

One-line install, simple single-user UX.

Power user
vLLM

Highest throughput for serving many requests; PagedAttention + continuous batching.

$ pip install vllm && vllm serve meta-llama/Llama-3.1-8B-Instruct
Runtime What it is Status
vLLM OpenAI-compatible serving engine stable
Ollama CLI/server stable
llama.cpp Reference engine stable
Gotcha: vLLM shines for multi-user serving/throughput. For a single local chat, Ollama or llama.cpp is simpler and lighter.
iOS runtime guide
Beginner pick
Apple Foundation Models

Built into iOS 26, ~3B on-device model, zero download, fully private.

Power user
PocketPal AI

Run any GGUF from HuggingFace fully offline.

Runtime What it is Status
Apple Foundation Models On-device ~3B model API (iOS 26+) stable
PocketPal AI llama.cpp wrapper app stable
LLM Farm Open-source on-device runner stable
Gotcha: Phones realistically run 1B-4B class models. Anything larger thermally throttles or OOMs.
Android runtime guide
Beginner pick
PocketPal AI

Polished app, download GGUF and run offline.

Power user
MLC LLM / LiteRT-LM

GPU/NPU acceleration paths for supported chips.

Runtime What it is Status
PocketPal AI llama.cpp/llama.rn app stable
MLC LLM Compiled on-device inference stable
Google AI Edge / LiteRT-LM On-device LLM runtime for Gemma stable
Gotcha: NPU acceleration is limited and chip-specific; most apps run on CPU. Expect 1B-4B class.

FAQ

Is vLLM good for Mac?

No. vLLM is a CUDA/Linux serving engine built for throughput across many requests. On Apple Silicon use MLX (mlx-lm) or Ollama/LM Studio instead.

Ollama or LM Studio?

LM Studio is the friendliest GUI and ships MLX on Apple Silicon. Ollama is a scriptable CLI/server that is great for integrating local models into tools. Both support GGUF models; on Apple Silicon, LM Studio defaults to its MLX backend instead of llama.cpp.

What is the fastest way to run a model on Apple Silicon?

mlx-lm. Same model, same quant, consistently higher tokens per second than Ollama's llama.cpp Metal backend on M-series chips. Ollama has an MLX backend in preview that narrows the gap, but mlx-lm still leads. LM Studio can use the same MLX backend if you want a GUI.

Sources