When to Use Local vs Cloud — My Actual Routing Logic

I run two local models and three cloud model tiers. The question that comes up most: when do you actually use each one?

Here's the routing logic I've settled on. Not a framework. Just the actual decision I make every time.

The core question

Would I feel stupid paying API fees for this?

If yes — it's a draft, a quick summary, a reformatted list, a brainstorm — it goes local. Qwen3.5 9B via MLX, zero marginal cost, runs at localhost:8888 through my Hermes local profile.

If no — the task requires reliable reasoning depth, it's going into production, I can't afford a retry — it goes to cloud.

The 31 jobs problem

I run 31 automated jobs via launchd on this machine. They fire at scheduled times: 07:00 for the morning briefing, six times daily for grid intelligence snapshots, 02:00 for the knowledge synthesiser, and so on.

None of them use local models.

The reason is simple: a local model that crashes at 06:00 means no morning briefing. A grid intelligence snapshot that fails silently means a day of missing data. A content review job that times out means the content flywheel stalls.

For unattended automation, reliability is non-negotiable. Local models are good — but the failure modes are different from cloud APIs: cold start, RAM pressure from other processes, model file integrity after an OS update. For 31 daily jobs that nobody's watching, that risk profile is wrong.

All scheduled jobs use Abacus RouteLLM via the default Hermes profile. It routes automatically to the best available model — Opus 4.7, Qwen3-235B, whatever fits the task — and the reliability record is clean.

The full routing table

Task type	Route	Why
Drafts, reformatting, summaries	Qwen3.5 9B local (Hermes `local`)	Zero cost, good enough
Brainstorming, ideation	Qwen3.5 9B local	Quantity over quality
Scheduled automation	Abacus RouteLLM (Hermes `default`)	Reliability non-negotiable
Deep research, competitor analysis	Claude Sonnet (Hermes `researcher`)	Reasoning depth justified
Strategy, roadmap decisions	Claude Sonnet (Hermes `strategist`)	Stakes too high for smaller models
Code architecture, review	Claude Code	Multi-file context, tool use

What changed when I built the local tier

Before local models, everything went to cloud. That meant paying API fees — or burning subscription tokens — for tasks that genuinely didn't need it. First drafts, content reformatting, vault summaries, brainstorming sessions. All of it going to Sonnet or GPT-4 because there was no cheaper option.

The local tier absorbed that volume. Cloud usage dropped for draft work. Cloud spend shifted toward tasks that actually benefit from the capability gap — strategy, complex reasoning, code review.

The net effect: better allocation, lower cost, and — counterintuitively — better outputs overall. When you stop rationing prompts, you iterate. Iteration produces better work than one careful expensive shot.

What I'm still working out

The 35B model benchmark is pending. If Qwen3.6 35B-A3B fits in 24GB at Q4 and produces meaningfully better strategy and analysis output, the routing changes: 9B for drafts, 35B for anything requiring reasoning depth, cloud only for scheduled jobs and the highest-stakes work.

That changes the cost/quality equation significantly. I'll report back when the benchmark runs.

This post is part of the Local LLM Lab case study.

When to Use Local vs Cloud — My Actual Routing Logic

The core question

The 31 jobs problem

The full routing table

What changed when I built the local tier

What I'm still working out

Recommended reading

Benchmarking Qwen3 14B on Apple Silicon

Why I Run a Local LLM on My Mac Mini