02 How · case study

Live

The Method

The Failures First.

1,031 hours. ~1,015 sessions. Five platforms. Three years.

Not one of them started the same way twice — until I built a system that made them.

This is what working with AI at scale actually taught me. The failures first.

1,031h

Total hours of AI operator work

~1,015

Total sessions across all platforms

119

Consecutive sessions, zero hallucination resets

346+

Formally documented architectural decisions

149

Session handover documents produced

Sessions — longest undetected data error

Platforms used across 3 years

20–25

Sessions between documentation audits

Timeline

Jan 2023 – Jul 2024

Phase 1: Interrogative. No System.

330+ Perplexity sessions, 45 ChatGPT conversations. Treated AI as a search engine with better answers. No session continuity, no handovers, no decisions log. Each session started cold. The same ground re-covered constantly. No artefacts survived.

Jun 2025

No North Star Criteria From Session 1

The PIP build started without a scoring gate for feature decisions. By V2 (Sessions 41+), a North Star 3/5 scoring system was introduced. Every feature decision before that had no formal acceptance criteria. The cost: features built that didn't survive V2 scoping. The fix came 40+ sessions too late.

Sessions 1–11

FDB Built Without a Spec. Error Surfaces at Session 49.

The data foundation phase was built without a written design document. RECONCILIATION_DESIGN.md was written before the reconciliation build (Session 50) — near-perfect implementation followed. The FDB work built without spec produced DQ-001: a data quality error undetected for 38 sessions, caught at Session 49. Design before build wins, consistently. The cost of skipping it is paid later, with interest.

Ongoing

Context Flooding

The instinct was to pre-load everything — every relevant file, every previous decision, every open question — into the session start. The effect: diffuse focus, multi-objective sessions, agent attention diluted across too many concerns simultaneously. Sessions with a single clear goal outperformed multi-objective sessions consistently across all 5 build streams.

Ongoing

Agent Prompt Drift

Hermes profiles and Cortex agent system prompts go stale without active maintenance. The degradation isn't visible until it causes an error — a hallucinated field name, a wrong assumption about schema structure, a stale priority stack driving the wrong decisions. Agent prompts are first-class artefacts. Treat them like code.

Ongoing

Prompt Regression

Good prompts — anti-fabrication protocols, context pre-loading sequences, constraint structures that produced reliable output — were not preserved. They were rebuilt from scratch each time, slightly differently, losing the accumulated calibration. No prompt library. No versioning. The same calibration work repeated across sessions.

The constraint structure, not model quality, is the explanatory variable. Any model, properly constrained, delivers reliably. The same model without constraints hallucinates constantly.

Key learnings

1. Handover-as-start-prompt

Every session closes with a verbatim handover document — a complete, self-contained start prompt for the next session. Copy. Paste. Send. Zero warm-up. Zero re-explanation of context. Zero drift across the session boundary.

2. GUARDRAILS as a Living Document

A GUARDRAILS file loaded at every session open. Not a static rulebook — a living document updated in real time. Every rule traces to a specific incident. New constraint added → the incident that caused it is the reason. The file self-corrects the system.

3. Append-Only Decisions Log

346+ formally documented architectural decisions across 5 build streams. Every decision marked APPROVED, REVERSED, SUPERSEDED, or DEPRECATED. Nothing deleted.

4. Tiered Read List

Every session has a read list: mandatory core (always loaded), conditional reference (loaded if in scope), never re-read (stable archive). Controlling what enters the context window is as important as controlling what the agent does with it.

5. One Objective Per Session

Sessions with a single clear goal delivered more reliably than multi-objective sessions. Not because AI can't handle multiple objectives — because the operator can't effectively review the output across multiple objectives without the session becoming a context management exercise.

6. Documentation Audit Every 20–25 Sessions

One meta-session, no delivery obligations. Review every live document: GUARDRAILS, BACKLOG, DECISIONS_LOG, agent system prompts. Update what's stale. Archive what's done. Align the documentation with the current state of the build.

Supporting evidence

The Handover Document: The Single Practice That Changed Everything

Copy. Paste. Send. Zero warm-up. 119 consecutive sessions without a hallucination-forced reset.

Why I Keep a GUARDRAILS File — and What Happens When I Don't

A living document that self-corrects the system. Every rule traces to a specific incident.

38 Sessions of Bad Data: The Most Expensive Oversight in the Build

DQ-001 — a data quality error undetected for 38 sessions. The cost of skipping a design document.

One Objective Per Session — What I Stopped Doing That Made Everything Work Better

Multi-objective sessions diffuse focus. Single-objective sessions deliver consistently.

All Case Studies