← Local AI
Mistral 7B Q4_K_M running local — notes from the first week
Running Mistral 7B Q4_K_M locally via llama.cpp on a Mac Mini M2 (16GB unified memory). One week in.
Setup
# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_METAL=1
# Download model (Q4_K_M quantization ~4.4GB)
# huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF \
# mistral-7b-v0.1.Q4_K_M.gguf
./llama-cli -m mistral-7b-v0.1.Q4_K_M.gguf \
--ctx-size 4096 \
--n-gpu-layers 32 \
-i
With 32 layers offloaded to the GPU (Metal), generation runs at about 35 tokens/second. That’s fast enough to feel responsive. Without GPU offload: ~8 tok/s. Not usable for conversation.
What I actually use it for
- Code review drafts: paste a diff, ask for a review. The output needs editing but catches obvious things and saves me time structuring the feedback.
- Rewriting sentences: when I’ve read something five times and can’t tell anymore if it makes sense, I ask the model. It usually catches the actual problem.
- Offline work: when I’m on a train without signal, I still have a capable LLM. This turns out to matter more than I expected.
What doesn’t work
- Anything requiring up-to-date knowledge. The training cutoff is early 2023. Don’t ask it about current events.
- Long-context tasks. 4096 tokens is fine for individual files but not for “review this entire codebase.”
- Reliable tool use. The 7B model follows instructions reasonably but drops the format inconsistently. The 13B+ models are better here.
Verdict
Worth the setup time. The privacy and offline availability are the main draws. For pure capability, the hosted models are still ahead — but the gap is smaller than it was six months ago.
← previous
Swiss-DE names on ESP32 e-ink — Umlauts, Waveshare, and a three-day font problem
next →
New 2207 motors on the 5" long range — quieter, more efficient
→ sibling benches