Benchmarking Qwen3 14B on Apple Silicon

My main local model is Qwen3.5 9B at Q4 quantisation via MLX. I ran a formal benchmark on the larger Qwen3 14B to find out whether more parameters at the same quantisation level were worth the trade-off in speed and RAM.

Short answer: yes, with caveats.

The test setup

Hardware: Mac Mini M4, 24GB unified memory, Apple Silicon neural engine
Framework: MLX — Apple's machine learning framework, native to Apple Silicon
Model: Qwen3 14B, MLX-converted, Q4 quantisation
Metrics: tokens per second (throughput), time to first token (TTFT), RAM usage during inference

This isn't a synthetic benchmark on a test cluster. It's what the model actually does on the machine that runs my full automation stack, with other background jobs alive.

The results

Metric	Result
Average throughput	10.68 tokens/second
Time to first token	0.764 seconds
RAM usage (inference)	~10GB

For comparison: Qwen3.5 9B at Q4 runs at around 15–18 t/s in my experience, using 6–7GB RAM. The 14B gives roughly 40% more parameters at a 35–40% speed cost, with 3–4GB more RAM.

What these numbers mean in practice

10.68 t/s for a 500-word response is about 47 seconds end-to-end. Slower than cloud for interactive use — but not painfully slow for draft-and-revise workflows where you're reading while the model is still writing.

The 0.764s TTFT is the more important number for how the model feels. Under a second to first token means it doesn't feel stuck when you submit. That threshold matters psychologically — anything over a second starts to feel like a loading spinner.

The 10GB RAM footprint is significant for a 24GB machine. It leaves 14GB headroom for the OS, other processes, and multi-model setups. You could run Qwen3 14B and Hermes 3 8B simultaneously with room to spare.

The verdict

Qwen3 14B is meaningfully better than 9B for complex tasks — reasoning depth, instruction following on multi-step prompts, long-form coherence. For quick drafts and simple reformatting, the 9B is still the right call: faster, lighter, good enough.

The model I haven't benchmarked yet is Qwen3.6 35B-A3B — a 35B mixture-of-experts model at Q4 quantisation. The hypothesis: 35B MoE might fit in 24GB and produce qualitatively better output for strategy and analysis tasks. That benchmark is coming, and it would change the routing logic if it holds.

Why I benchmark my own hardware

Published benchmark tables — MMLU scores, HumanEval, GPQA — don't tell me whether a model will hallucinate a supplier name at 02:00 when nothing is watching. They don't reflect Apple Silicon performance at all; most benchmarks run on NVIDIA hardware.

The only benchmark that matters for my use case: does it work, reliably, on my machine, for my tasks?

Everything else is someone else's data.

This post is part of the Local LLM Lab case study.

Benchmarking Qwen3 14B on Apple Silicon

The test setup

The results

What these numbers mean in practice

The verdict

Why I benchmark my own hardware

Recommended reading

When to Use Local vs Cloud — My Actual Routing Logic

Why I Run a Local LLM on My Mac Mini