
There’s a genre of bloke every pub has. Perfectly pleasant sober. Two schooners in, a different animal entirely — loud, magnetic in the wrong way, and somehow never, ever holding the card when it’s his shout. The phone rings. It always rings.
So I made a video game about getting him banned.
Last Drinks at The Local is a ~30-minute browser game, played on your phone, set in a pixel-art version of my actual local (which goes unnamed in-game, because I’d like to keep drinking there). You play Macca, the newest regular, still on probation for a seat at Table 1, building a case file — Operation Last Drinks — to get Two-Schooner Trev banned before close. Eight mini-games, a shout ledger that doubles as Exhibit A, four endings, and a cast made up of my actual mates, who all gave the nod and are now, legally speaking, NPCs. Trev is the one exception: wholly invented, a composite, never based on a real likeness, for reasons that should be obvious to anyone who has ever owned a defamation lawyer’s phone number.
The game is the joke. The experiment underneath it is the Lab Note: could I ship the whole thing as a conductor — directing, never coding? Four milestones, four git tags, every line of the 12,000-odd written by Claude Code running autonomously, while I sat in a chat window playing publican, art director and bureaucrat. It’s live at thelocal.vasko.com.au. Here’s what the build taught me.
An LLM is a noisy channel for hex
The single most useful artefact in this project isn’t code. It’s a file called MANIFEST.sha256, and the story of why it exists is the story of the whole methodology.
Early on, the project’s “canon” — the design doc, the tech spec, 32 dialogue trees, the trivia pool — lived as documents with their SHA-256 hashes recorded, so an autonomous run could verify it hadn’t drifted from the approved spec. Sensible. Except the hashes were being copied into that document by the chat model. Across three consecutive saves, it made three transcription errors. Sixty-four hex characters is exactly the kind of string a language model will confidently mangle — it’s high-entropy, meaningless, and looks correct at every point where it’s wrong.
The fix wasn’t “be more careful.” It was architectural: route the hashes around the model entirely. Per-file hashes are generated machine-to-machine into MANIFEST.sha256 at the repo root. The project document holds exactly one hash — the hash of the manifest — which travels terminal → clipboard → and back for a read-check, never through model fingers. Every autonomous run verifies the anchor, then sha256sum -c the lot, at the start, mid-run, and before tagging.
That doctrine paid for itself three separate times in a single day. It caught me typing a stray space into the middle of a UUID while hand-assembling a download manifest — the exact failure mode, hours after I’d written the rule. It proved that a script everyone believed had crashed had actually succeeded — fresh downloads hash-compared byte-identical to what was on disk. And it let four unsupervised milestone runs tear through the codebase with the spec mathematically incapable of drifting. If you take one thing from this post: when correctness matters, don’t ask the model to copy it. Ask the machine to verify it.
Four milestones, one interruption each
The other deliberate choice was scheduled human contact. Each big run got exactly one planned pause — a style gate in M3 (pick the art direction from rendered candidates), a feel-check in M4 (load the deployed build on my actual phone and report whether the pub felt like a pub). Otherwise: autonomous. M3 ran about two hours, fanned out eight parallel scene-builder agents, landed 22 commits and a 334-test suite. M4 — lighting, sound, sprites, telemetry, performance, full launch verification — ran 1 hour 50 with that one interruption, and shipped a pub you can hear: an entire ambient soundscape, murmur that swells as Trev gets louder, seventeen sound effects, a jukebox — synthesised at runtime in WebAudio, for a transfer cost of zero bytes.
The cost arc tells its own story. M1 on the cheaper effort setting clocked a notional $26. M2, on the heavy setting, $102 — and earned it, because its eleven-agent audit found a bug the cheaper run had shipped. A single portrait-and-wiring day later registered a notional $489.50, most of it one model reading a third of a billion cached tokens. (Notional, because this ran inside a subscription’s usage limits — which became its own subplot. The weekly meter sat at 66% going into the final stretch, and I learned the hard way that leaving certain integrations connected to a session burns roughly a fifth of it idling. The fix: the agent writes its run log to a file, I relay the report to the tracker from the chat side. Bureaucracy is cheap in a browser tab and expensive in a context window.)
My favourite engineering moment came from a review pass on the portrait wiring. Two adversarial reviewer agents correctly flagged a real bug — the dialogue card re-fetching an image could diverge from the boot-time texture check. Their suggested fix was to clone the loader’s image element. Live QA promptly proved the fix broken: the engine revokes its blob object URLs after texture creation, so the clone renders a dead image. The shipped solution blits the decoded bitmap straight off the loaded texture. Review catches problems; live verification catches reviews. And — to keep myself honest — there was a third leg: when a later bug appeared, my confident diagnosis of the cause was also wrong, and only the prompt’s standing instruction to diagnose before fixing stopped a phantom fix from shipping green. Nothing in this stack gets trusted on vibes. Including me.
The day the regulars got faces
For most of the build, dialogue speakers were coloured rectangles with names on them. Then came the portrait factory: long sessions generating Stardew-style pixel portraits through an image-model integration, me approving every face like a publican checking IDs. Dozens of generations, three expressions per character, each chained off an approved “anchor” so the laugh and the shock are recognisably the same person.
The rules mattered more than the prompts. Real people get affectionate, flattering caricature — never photo-real, never traced. The villain gets no real likeness, ever. Consent nods from everyone depicted before launch, collected and logged like the compliance artefact it genuinely is. And the details my mates supplied themselves turned out to be the best assets in the game: one regular’s signature fishing-company jumper, logo faithfully pixelated; another lit from below by the glow of a phone that never appears on screen; the news-hound tradie who shows up in his hi-vis straight off site, because being on every job in the suburb is how a man knows everything before it’s finished happening. You don’t write characters that good. You just ask people what they’re like and pixelate the answer.
Adding a brand-new character, late, became a clean little assembly line: photo → approve a portrait anchor in chat → generate the expression set → drop the spec into the canon → an autonomous run wires the sprite, the dialogue, the behaviour, reseals the manifest, and deploys to a branch preview. The person becomes a playable NPC in an afternoon, and the canon’s hash moves before the code does, so the run that builds them is verifying against a spec that already contains them.
The soft launch humbled me in ten minutes
Here’s the part no agent could do for me. I shared the “finished” game — four milestones, hundreds of passing tests, verified end-to-end — with a handful of mates. Within ten minutes, two of them, independently, said the same thing: “I don’t know how to play it.”
Three-hundred-and-something green tests, and the game never told a first-time player that the whole screen is a joystick and the people are tappable. Then it got worse, in the instructive way. Someone reported two characters were “too close together and hard to tap” — and the real cause turned out to be that every character only registered taps on a tiny cell at their feet; the visible top three-quarters of each sprite was untappable air. The human reported a symptom (“too close”); verification found the disease (“75% of every NPC is dead zone”). Another player finished all the mini-games and asked, simply, “what now?” — because the game’s own climax, the Hearing, had no signpost pointing at it.
Each of these became a systemic patch — fat tap targets clamped to a real finger size, a taught first minute with a ghost-thumb hint, the climax announced with a banner and a glowing marker — and each shipped with regression tests so it can’t come back. But the lesson is the one every builder re-learns forever: your test suite proves the game does what you built it to do. It cannot tell you that what you built is bewildering to someone seeing it for the first time. There is no agent for “hand it to a stranger and watch their thumbs.” There is only the stranger.
One more, my favourite, because it indicts the cleverness directly. During the Hearing, the character portraits — the ones I’d so carefully generated — were rendering as coloured rectangles. The cause: a placeholder panel from an early milestone that drew rectangles unconditionally and was never wired to portraits, and the graceful fallback I’d built was so polite it logged no error. Launch verification passed clean because the failure looked exactly like a design choice. The fix included making the silent fallback loud — a portrait that should resolve and doesn’t is now a screaming console warning, not a shrug. Graceful degradation is wonderful in production and treacherous in verification, because it makes your failures look like decisions.
What it proves, if anything
A small footnote that’s actually the whole thesis: midway through the final content run, the frontier model I’d been building with had an outage. I switched the run to a different model and it carried on without breaking stride. If the methodology had been “this model writes the code,” an outage stops you cold. Because the methodology is “the canon is the source of truth and the model is an interchangeable executor, verified by machine,” I swapped engines mid-sentence and the spec didn’t notice. Resilience through replaceability.
The execution was never the hard part — it’s been commoditised out from under us, and good riddance. The hard parts were exactly the ones that survived: deciding what’s canon and making drift impossible; knowing which single moment per run genuinely needs a human; writing the spec well enough that an unsupervised two-hour run lands somewhere worth keeping; and then handing the polished thing to a mate and watching them not understand it. The question matters more than the answer. The manifest matters more than the model. And the stranger’s thumbs matter more than the test suite.
The real acceptance test
Then I put a QR code on Table 1 and watched.
They loved it — but that’s not the interesting part, because mates are generous and a free game about themselves is an easy sell. The interesting part was what happened after the laugh. Almost nobody stopped at “this is great.” They went straight to “how did you build this?” — and then, within the same breath, “can you put me in it?” and “can we make one of these to sell?” People started pulling up photos of themselves on the spot, telling me stories that were obviously angling for a character slot, pitching scenarios for the next one. I walked in with a finished game and walked out with a content pipeline and a half-dozen unsolicited business proposals.
Two reactions told me the experiment had actually landed. The first: everyone loved Trev — the villain, the one wholly-invented character, the one nobody could point at. They didn’t love him despite being unable to identify him; they loved what he represents. Every pub has a Trev, so everyone brought their own. The composite worked better than any real likeness could have, which is the rare case where the legal-safe choice was also the creatively superior one.
The second: a noticeable number of people, after playing, went and looked at vasko.com.au — and came back understanding what I actually do for a living in a way that no amount of me explaining “fractional CTO, AI orchestration” ever achieved. A daft game about getting a bloke banned from the pub turned out to be the clearest portfolio piece I’ve ever shipped. It doesn’t describe the methodology. It is the methodology, running, in their hands, about people they know.
The live banter system — small models improvising in-character heckles — produced, unprompted, the line I’d have put on the poster. Sean, surveying Trev’s empty column on the shout ledger: “I’ve seen better charity from a poker machine.” Nobody wrote that. It emerged from a spec. Which is, I suppose, the entire point: you don’t write the answer anymore. You write the conditions under which a good answer becomes inevitable, you make it impossible for the machine to drift from them, and you keep exactly enough humans in the loop to catch what the machine can’t see — the broken fix, the wrong diagnosis, the stranger who doesn’t know it’s a joystick.
The Local is open. The next one’s already being cast. Mine’s a Fernet. 🍻