Local LLM Lab

What Actually Works on a Mac Mini

Six models. One Mac Mini M4 with 24GB RAM.

Some were fast. Some were accurate. Some ran out of memory before finishing a sentence.

Here's what I learned — and what I'm still running.

Models tested across llama.cpp, MLX, Ollama, cloud

Local models actively running

M4 24GB

Mac Mini — Apple Silicon unified memory

10.68 t/s

Qwen3 14B benchmark — average throughput

0.764s

Time to first token — Qwen3 14B on MLX

10GB

RAM usage — Qwen3 14B during inference

Zero

Marginal cost — Qwen3.5-9B handles all draft work

Scheduled jobs using local — cloud for reliability

Timeline

2025

Hermes 3 Llama 3.1 8B (Q8_0)

First local model. llama.cpp, localhost:8080. Chat only — no automation, no scheduled work. Still running as a legacy fallback.

Early 2026

Ollama — Various Models

Installed and available, multiple models pulled. Not used in scheduled jobs — Abacus and MLX proved faster and more reliable for automation workloads.

2026

Qwen3.5-9B-MLX-4bit

Current primary local model. MLX framework, localhost:8888. Zero marginal cost. Handles all draft and volume work via the Hermes "local" profile.

2026

Qwen3 14B (MLX) — Benchmark

10.68 tokens/second average throughput. 0.764s time to first token. 10GB RAM usage. Baseline benchmark for local capability.

2026 (Pending)

Qwen3.6 35B-A3B — Planned

Pending benchmark. Hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex reasoning tasks.

The best local model is the one that runs unattended at 02:00 without crashing. Token speed is a distant second.

Key learnings

1. Speed benchmarks don't tell you what you need to know.

The question isn't tokens per second. It's: does it finish a 2,000-word draft without hallucinating a URL that doesn't exist? Does it follow a system prompt consistently across a 90-minute session? Does it stay running at 02:00 when nothing is watching it? Those tests are not in any benchmark suite.

2. The model format matters as much as the model.

Qwen3.5 9B via MLX runs faster and uses less RAM than the same model via Ollama on Apple Silicon. The MLX framework is native to the chip. GGUF via llama.cpp is more portable but not as fast on M4. Choosing a model without choosing a runtime is only half the decision.

3. Scheduled jobs need cloud reliability, not local cost savings.

The 31 automated jobs that run daily don't use local models. They use Abacus RouteLLM. The reason: a local model that crashes at 06:00 means no morning briefing, no grid intelligence snapshot, no content review. The cost saving is not worth the reliability risk for unattended automation. Local models are for interactive, supervised work — drafts, research, brainstorming — where a failure is visible and recoverable.

4. Zero marginal cost changes how you use it.

When a model is free at the point of use, you stop rationing. You run a draft, don't like it, run it again with a different prompt. You use it for throwaway tasks you'd never pay API fees for. The creative and exploratory value of a local model comes from the psychology of free — not the token speed.

5. What I'm still working out.

The 35B model benchmark is pending. The hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex tasks — strategy, analysis, long-form writing. If it does, the routing logic changes: small model for drafts, large model for anything requiring reasoning depth.

Supporting evidence

Why I Run a Local LLM on My Mac Mini

The cost, control, and psychology of zero-marginal-cost inference.

published

Benchmarking Qwen3 14B on Apple Silicon

MLX vs llama.cpp vs Ollama — real numbers, not marketing claims.

published

When to Use Local vs Cloud — My Actual Routing Logic

Why scheduled jobs stay on cloud and interactive work stays local.

published

Full AI Journey