Home / Blog / Running LLMs locally on a MacBook Pro

Running LLMs locally on a MacBook Pro

I've been experimenting with running large language models locally on my MacBook Pro (M3, 16GB unified memory). Here's what I found actually works well versus what's too slow to be useful.

Why run locally?

A few reasons drove me to try this:

Tools I tried

Ollama

Ollama is by far the easiest way to get started. Install it, run ollama pull llama3, and you have a model running in minutes. It exposes an OpenAI-compatible API on localhost:11434.

ollama pull llama3.2
ollama run llama3.2

On 16GB unified memory, llama3.2:3b runs at around 40 tokens/second — fast enough for interactive use. The 8B model drops to about 15 tokens/second, still usable.

llama.cpp

More control, more friction. You build from source and download GGUF-format model weights yourself. The upside is you can tune quantisation levels — Q4_K_M is a good balance of quality and speed.

What fits in 16GB

Model Quantisation VRAM used Speed
Llama 3.2 3B Q4_K_M ~2.5GB ~40 tok/s
Llama 3.2 8B Q4_K_M ~5.5GB ~15 tok/s
Mistral 7B Q4_K_M ~5GB ~18 tok/s
Llama 3.1 70B Q4_K_M too large

Conclusion

For a 16GB machine, 7–8B models are the sweet spot. They're capable enough for coding assistance, summarisation, and general Q&A, while leaving headroom for the rest of the OS. I wouldn't try anything larger without 32GB.