Home / Blog / Running LLMs locally on a MacBook Pro

Running LLMs locally on a MacBook Pro

30 April 2026 6 min read AI LLM macOS

I've been experimenting with running large language models locally on my MacBook Pro (M3, 16GB unified memory). Here's what I found actually works well versus what's too slow to be useful.

Why run locally?

A few reasons drove me to try this:

Privacy — no API calls, nothing leaves the machine
Cost — no per-token billing for experiments
Latency — for short prompts, local can be faster than a round trip to an API
Offline — works on a plane

Tools I tried

Ollama

Ollama is by far the easiest way to get started. Install it, run ollama pull llama3, and you have a model running in minutes. It exposes an OpenAI-compatible API on localhost:11434.

ollama pull llama3.2
ollama run llama3.2

On 16GB unified memory, llama3.2:3b runs at around 40 tokens/second — fast enough for interactive use. The 8B model drops to about 15 tokens/second, still usable.

llama.cpp

More control, more friction. You build from source and download GGUF-format model weights yourself. The upside is you can tune quantisation levels — Q4_K_M is a good balance of quality and speed.

What fits in 16GB

Model	Quantisation	VRAM used	Speed
Llama 3.2 3B	Q4_K_M	~2.5GB	~40 tok/s
Llama 3.2 8B	Q4_K_M	~5.5GB	~15 tok/s
Mistral 7B	Q4_K_M	~5GB	~18 tok/s
Llama 3.1 70B	Q4_K_M	too large	—

Conclusion

For a 16GB machine, 7–8B models are the sweet spot. They're capable enough for coding assistance, summarisation, and general Q&A, while leaving headroom for the rest of the OS. I wouldn't try anything larger without 32GB.