← Local AI

Mistral 7B Q4_K_M running local — notes from the first week

2026-03-20 #99

#llm #mistral #llama.cpp #local-ai #hardware

Running Mistral 7B Q4_K_M locally via llama.cpp on a Mac Mini M2 (16GB unified memory). One week in.

Setup

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_METAL=1

# Download model (Q4_K_M quantization ~4.4GB)
# huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF \
#   mistral-7b-v0.1.Q4_K_M.gguf

./llama-cli -m mistral-7b-v0.1.Q4_K_M.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 32 \
  -i

With 32 layers offloaded to the GPU (Metal), generation runs at about 35 tokens/second. That’s fast enough to feel responsive. Without GPU offload: ~8 tok/s. Not usable for conversation.

What I actually use it for

Code review drafts: paste a diff, ask for a review. The output needs editing but catches obvious things and saves me time structuring the feedback.
Rewriting sentences: when I’ve read something five times and can’t tell anymore if it makes sense, I ask the model. It usually catches the actual problem.
Offline work: when I’m on a train without signal, I still have a capable LLM. This turns out to matter more than I expected.

What doesn’t work

Anything requiring up-to-date knowledge. The training cutoff is early 2023. Don’t ask it about current events.
Long-context tasks. 4096 tokens is fine for individual files but not for “review this entire codebase.”
Reliable tool use. The 7B model follows instructions reasonably but drops the format inconsistently. The 13B+ models are better here.

Verdict

Worth the setup time. The privacy and offline availability are the main draws. For pure capability, the hosted models are still ahead — but the gap is smaller than it was six months ago.

← previous

Swiss-DE names on ESP32 e-ink — Umlauts, Waveshare, and a three-day font problem

2026-03-05 · Making · #97

New 2207 motors on the 5" long range — quieter, more efficient

2026-03-28 · FPV · #100

→ sibling benches