Local LLM Lab
What Actually Works on a Mac Mini
Six models. One Mac Mini M4 with 24GB RAM.
Some were fast. Some were accurate. Some ran out of memory before finishing a sentence.
Here's what I learned — and what I'm still running.
Hermes 3 Llama 3.1 8B (Q8_0)
First local model. llama.cpp, localhost:8080. Chat only — no automation, no scheduled work. Still running as a legacy fallback.
Ollama — Various Models
Installed and available, multiple models pulled. Not used in scheduled jobs — Abacus and MLX proved faster and more reliable for automation workloads.
Qwen3.5-9B-MLX-4bit
Current primary local model. MLX framework, localhost:8888. Zero marginal cost. Handles all draft and volume work via the Hermes "local" profile.
Qwen3 14B (MLX) — Benchmark
10.68 tokens/second average throughput. 0.764s time to first token. 10GB RAM usage. Baseline benchmark for local capability.
Qwen3.6 35B-A3B — Planned
Pending benchmark. Hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex reasoning tasks.
The best local model is the one that runs unattended at 02:00 without crashing. Token speed is a distant second.
1. Speed benchmarks don't tell you what you need to know.
The question isn't tokens per second. It's: does it finish a 2,000-word draft without hallucinating a URL that doesn't exist? Does it follow a system prompt consistently across a 90-minute session? Does it stay running at 02:00 when nothing is watching it? Those tests are not in any benchmark suite.
2. The model format matters as much as the model.
Qwen3.5 9B via MLX runs faster and uses less RAM than the same model via Ollama on Apple Silicon. The MLX framework is native to the chip. GGUF via llama.cpp is more portable but not as fast on M4. Choosing a model without choosing a runtime is only half the decision.
3. Scheduled jobs need cloud reliability, not local cost savings.
The 31 automated jobs that run daily don't use local models. They use Abacus RouteLLM. The reason: a local model that crashes at 06:00 means no morning briefing, no grid intelligence snapshot, no content review. The cost saving is not worth the reliability risk for unattended automation. Local models are for interactive, supervised work — drafts, research, brainstorming — where a failure is visible and recoverable.
4. Zero marginal cost changes how you use it.
When a model is free at the point of use, you stop rationing. You run a draft, don't like it, run it again with a different prompt. You use it for throwaway tasks you'd never pay API fees for. The creative and exploratory value of a local model comes from the psychology of free — not the token speed.
5. What I'm still working out.
The 35B model benchmark is pending. The hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex tasks — strategy, analysis, long-form writing. If it does, the routing logic changes: small model for drafts, large model for anything requiring reasoning depth.
Why I Run a Local LLM on My Mac Mini
The cost, control, and psychology of zero-marginal-cost inference.
publishedBenchmarking Qwen3 14B on Apple Silicon
MLX vs llama.cpp vs Ollama — real numbers, not marketing claims.
publishedWhen to Use Local vs Cloud — My Actual Routing Logic
Why scheduled jobs stay on cloud and interactive work stays local.
published