My main local model is Qwen3.5 9B at Q4 quantisation via MLX. I ran a formal benchmark on the larger Qwen3 14B to find out whether more parameters at the same quantisation level were worth the trade-off in speed and RAM.
Short answer: yes, with caveats.
The test setup
- Hardware: Mac Mini M4, 24GB unified memory, Apple Silicon neural engine
- Framework: MLX — Apple's machine learning framework, native to Apple Silicon
- Model: Qwen3 14B, MLX-converted, Q4 quantisation
- Metrics: tokens per second (throughput), time to first token (TTFT), RAM usage during inference
This isn't a synthetic benchmark on a test cluster. It's what the model actually does on the machine that runs my full automation stack, with other background jobs alive.
The results
| Metric | Result |
|---|---|
| Average throughput | 10.68 tokens/second |
| Time to first token | 0.764 seconds |
| RAM usage (inference) | ~10GB |
For comparison: Qwen3.5 9B at Q4 runs at around 15–18 t/s in my experience, using 6–7GB RAM. The 14B gives roughly 40% more parameters at a 35–40% speed cost, with 3–4GB more RAM.
What these numbers mean in practice
10.68 t/s for a 500-word response is about 47 seconds end-to-end. Slower than cloud for interactive use — but not painfully slow for draft-and-revise workflows where you're reading while the model is still writing.
The 0.764s TTFT is the more important number for how the model feels. Under a second to first token means it doesn't feel stuck when you submit. That threshold matters psychologically — anything over a second starts to feel like a loading spinner.
The 10GB RAM footprint is significant for a 24GB machine. It leaves 14GB headroom for the OS, other processes, and multi-model setups. You could run Qwen3 14B and Hermes 3 8B simultaneously with room to spare.
The verdict
Qwen3 14B is meaningfully better than 9B for complex tasks — reasoning depth, instruction following on multi-step prompts, long-form coherence. For quick drafts and simple reformatting, the 9B is still the right call: faster, lighter, good enough.
The model I haven't benchmarked yet is Qwen3.6 35B-A3B — a 35B mixture-of-experts model at Q4 quantisation. The hypothesis: 35B MoE might fit in 24GB and produce qualitatively better output for strategy and analysis tasks. That benchmark is coming, and it would change the routing logic if it holds.
Why I benchmark my own hardware
Published benchmark tables — MMLU scores, HumanEval, GPQA — don't tell me whether a model will hallucinate a supplier name at 02:00 when nothing is watching. They don't reflect Apple Silicon performance at all; most benchmarks run on NVIDIA hardware.
The only benchmark that matters for my use case: does it work, reliably, on my machine, for my tasks?
Everything else is someone else's data.
This post is part of the Local LLM Lab case study.


