I run two local models and three cloud model tiers. The question that comes up most: when do you actually use each one?
Here's the routing logic I've settled on. Not a framework. Just the actual decision I make every time.
The core question
Would I feel stupid paying API fees for this?
If yes — it's a draft, a quick summary, a reformatted list, a brainstorm — it goes local. Qwen3.5 9B via MLX, zero marginal cost, runs at localhost:8888 through my Hermes local profile.
If no — the task requires reliable reasoning depth, it's going into production, I can't afford a retry — it goes to cloud.
The 31 jobs problem
I run 31 automated jobs via launchd on this machine. They fire at scheduled times: 07:00 for the morning briefing, six times daily for grid intelligence snapshots, 02:00 for the knowledge synthesiser, and so on.
None of them use local models.
The reason is simple: a local model that crashes at 06:00 means no morning briefing. A grid intelligence snapshot that fails silently means a day of missing data. A content review job that times out means the content flywheel stalls.
For unattended automation, reliability is non-negotiable. Local models are good — but the failure modes are different from cloud APIs: cold start, RAM pressure from other processes, model file integrity after an OS update. For 31 daily jobs that nobody's watching, that risk profile is wrong.
All scheduled jobs use Abacus RouteLLM via the default Hermes profile. It routes automatically to the best available model — Opus 4.7, Qwen3-235B, whatever fits the task — and the reliability record is clean.
The full routing table
| Task type | Route | Why |
|---|---|---|
| Drafts, reformatting, summaries | Qwen3.5 9B local (Hermes local) |
Zero cost, good enough |
| Brainstorming, ideation | Qwen3.5 9B local | Quantity over quality |
| Scheduled automation | Abacus RouteLLM (Hermes default) |
Reliability non-negotiable |
| Deep research, competitor analysis | Claude Sonnet (Hermes researcher) |
Reasoning depth justified |
| Strategy, roadmap decisions | Claude Sonnet (Hermes strategist) |
Stakes too high for smaller models |
| Code architecture, review | Claude Code | Multi-file context, tool use |
What changed when I built the local tier
Before local models, everything went to cloud. That meant paying API fees — or burning subscription tokens — for tasks that genuinely didn't need it. First drafts, content reformatting, vault summaries, brainstorming sessions. All of it going to Sonnet or GPT-4 because there was no cheaper option.
The local tier absorbed that volume. Cloud usage dropped for draft work. Cloud spend shifted toward tasks that actually benefit from the capability gap — strategy, complex reasoning, code review.
The net effect: better allocation, lower cost, and — counterintuitively — better outputs overall. When you stop rationing prompts, you iterate. Iteration produces better work than one careful expensive shot.
What I'm still working out
The 35B model benchmark is pending. If Qwen3.6 35B-A3B fits in 24GB at Q4 and produces meaningfully better strategy and analysis output, the routing changes: 9B for drafts, 35B for anything requiring reasoning depth, cloud only for scheduled jobs and the highest-stakes work.
That changes the cost/quality equation significantly. I'll report back when the benchmark runs.
This post is part of the Local LLM Lab case study.


