← local-ai
meica.ch/werkstatt/local-ai/mistral-7b-local
← Local AI

Mistral 7B Q4_K_M running local — notes from the first week

Running Mistral 7B Q4_K_M locally via llama.cpp on a Mac Mini M2 (16GB unified memory). One week in.

Setup

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_METAL=1

# Download model (Q4_K_M quantization ~4.4GB)
# huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF \
#   mistral-7b-v0.1.Q4_K_M.gguf

./llama-cli -m mistral-7b-v0.1.Q4_K_M.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 32 \
  -i

With 32 layers offloaded to the GPU (Metal), generation runs at about 35 tokens/second. That’s fast enough to feel responsive. Without GPU offload: ~8 tok/s. Not usable for conversation.

What I actually use it for

  • Code review drafts: paste a diff, ask for a review. The output needs editing but catches obvious things and saves me time structuring the feedback.
  • Rewriting sentences: when I’ve read something five times and can’t tell anymore if it makes sense, I ask the model. It usually catches the actual problem.
  • Offline work: when I’m on a train without signal, I still have a capable LLM. This turns out to matter more than I expected.

What doesn’t work

  • Anything requiring up-to-date knowledge. The training cutoff is early 2023. Don’t ask it about current events.
  • Long-context tasks. 4096 tokens is fine for individual files but not for “review this entire codebase.”
  • Reliable tool use. The 7B model follows instructions reasonably but drops the format inconsistently. The 13B+ models are better here.

Verdict

Worth the setup time. The privacy and offline availability are the main draws. For pure capability, the hosted models are still ahead — but the gap is smaller than it was six months ago.