Home / Blog / Running LLMs locally on a MacBook Pro
Running LLMs locally on a MacBook Pro
I've been experimenting with running large language models locally on my MacBook Pro (M3, 16GB unified memory). Here's what I found actually works well versus what's too slow to be useful.
Why run locally?
A few reasons drove me to try this:
- Privacy — no API calls, nothing leaves the machine
- Cost — no per-token billing for experiments
- Latency — for short prompts, local can be faster than a round trip to an API
- Offline — works on a plane
Tools I tried
Ollama
Ollama is by far the easiest way to get started. Install it, run ollama pull llama3, and you have a model running in minutes. It exposes an OpenAI-compatible API on localhost:11434.
ollama pull llama3.2
ollama run llama3.2
On 16GB unified memory, llama3.2:3b runs at around 40 tokens/second — fast enough for interactive use. The 8B model drops to about 15 tokens/second, still usable.
llama.cpp
More control, more friction. You build from source and download GGUF-format model weights yourself. The upside is you can tune quantisation levels — Q4_K_M is a good balance of quality and speed.
What fits in 16GB
| Model | Quantisation | VRAM used | Speed |
|---|---|---|---|
| Llama 3.2 3B | Q4_K_M | ~2.5GB | ~40 tok/s |
| Llama 3.2 8B | Q4_K_M | ~5.5GB | ~15 tok/s |
| Mistral 7B | Q4_K_M | ~5GB | ~18 tok/s |
| Llama 3.1 70B | Q4_K_M | too large | — |
Conclusion
For a 16GB machine, 7–8B models are the sweet spot. They're capable enough for coding assistance, summarisation, and general Q&A, while leaving headroom for the rest of the OS. I wouldn't try anything larger without 32GB.