TL;DR
- A mini PC ordered off Amazon Australia arrived within 24 hours. By mid-morning the next day it was out of bubble wrap and running an autonomous AI engineering team.
- That team (gstack, an open-source Claude Code skill pack) planned, built, reviewed, tested, and shipped a working feature to production in about five hours.
- Three AI-run review phases caught two design mistakes and one critical bug before a line of code was written — including a bug that would have silently corrupted data once the project scaled.
- The AI asked me for help five times, each time with clear options. It never guessed.
- While verifying the deployment, the process surfaced a separate production bug that had been silently broken for weeks.
- Total cost of the AI tooling: zero. It’s all open source.
What this actually means: the interesting shift in AI-assisted software work right now isn’t bigger or smarter models — those keep improving in the background. It’s the scaffolding built around them. Structured review workflows like this catch the kinds of mistakes that standard AI tools ship, while staying honest about what they don’t know. For anyone watching AI transform knowledge work, this is what the next step actually looks like in practice: less a single robot writing code, more a process that thinks before it builds.
At 7:47 AM on a Tuesday, a Minisforum UM890 Pro mini PC was still wrapped in bubble wrap on my desk. By mid-afternoon it had shipped a feature end-to-end — Linear tickets to merged PR to production canary — run by an autonomous AI engineering team installed on it earlier that morning. Eight hours. Bubble wrap to merged PR.
The team was gstack, Garry Tan’s open-source Claude Code skill pack. 23 skills organised around a sprint structure: think → plan → build → review → test → ship. I pointed it at one Linear epic — a CSV export feature for a personal domain-tracker project, nine tickets — and ran /autoplan. The full loop ran autonomously, except for five moments where it escalated to me.
The stack
- Hardware — Minisforum UM890 Pro (AMD Ryzen 9 8945HS, 32GB RAM, ~$1,340 AUD)
- OS — Ubuntu Server 24.04 LTS, headless
- Remote access — Tailscale (mesh VPN + SSH)
- AI tooling — Claude Code + gstack (Garry Tan’s skill pack, MIT)
- Project management — Linear MCP
- Browser testing — Playwright + headless Chromium
- Target app — Next.js 14 App Router + Supabase + Vitest + TypeScript
- Deployment — Vercel + Cloudflare DNS
/autoplan runs three review phases on any plan before a line of code is written: a CEO-mode review (scope, premise, commercial sanity), a design review (hierarchy, states, accessibility, microcopy), and an engineering review (architecture, tests, performance, security). Three independent perspectives on the same plan, each scoring it and proposing edits. Findings that surface in two or three phases become consensus catches — the ones the author missed in their own framing. Two landed immediately.
The first was the export modal. Design flagged the primary action as inverted: the plan had button → modal → download, when the right flow is button → download, with a modal only as an escape hatch for the five per cent of cases that actually want to pick columns. CEO flagged the same thing through a different lens — scope expansion, two extra days of work, no extra value. Same finding, two seats at the table. Resolution: button-as-primary-action with a kebab menu for the edge cases. Standard Claude Code, left on its own, would have built the modal.
Tests were the second catch. The plan had deferred unit tests on the grounds that the repo had no test runner configured and setting one up was “its own initiative”. Engineering’s third finding and CEO’s eighth landed on the same observation from opposite ends: defer tests on a data-export endpoint and you will never come back to them. Resolution: vitest plus five tests in this PR, thirty minutes of cost, non-negotiable. By the time it shipped, the count was fourteen.
Engineering review contributed the single most valuable catch. Rated critical: Supabase’s PostgREST layer defaults to a thousand-row cap on query results, and exceeding it returns a truncated response with no error thrown and no indication anything was dropped. The CSV export, as planned, would have returned partial data the moment the domain portfolio crossed a thousand entries. Not a broken download. Not a timeout. A correct-looking file missing the last rows. Fix: one line, .range(0, 49_999), plus a count: 'exact' truncation check that throws if the server reports more rows than it returned. Five seconds to apply. Would have taken months to catch in the wild.
The run, in numbers: 22 decisions logged, 4 escalated to me, 3 review phases, 2 cross-phase catches, 1 critical ship-blocker avoided, 14 tests added. Total autonomous engineering time: roughly 3 hours. Total AI tooling cost: zero.
How this maps to a human team
/office-hours— YC partner asking the forcing questions before you build/plan-ceo-review— a founder challenging scope/plan-design-review— a senior designer catching spec gaps/plan-eng-review— a staff engineer finding the ship-blockers/review— code review before QA/qa— QA lead clicking through the feature in a real browser/ship— release engineer opening the PR/land-and-deploy— merge, deploy, canary verification
The autonomous part of “autonomous” is where agents most often earn their bad reputation: pressing on through ambiguity, guessing at credentials, committing things nobody wanted committed. This run had five escalations, each with enumerated options and a recommended path. Not once did it guess.
Five moments it stopped and asked
- AppArmor blocking headless Chromium. Three options offered (system-wide unload, per-profile exception,
--no-sandbox); recommended per-profile. - Missing
.env.local. Asked for the values rather than guessing defaults. - QA credentials. Refused to accept a password in chat. Suggested a shreddable temp file so secrets never entered the transcript.
- git push over SSH failing inside the agent’s subshell. Proposed switching to HTTPS via
gh’s credential helper. One command. /land-and-deployrefused to canary against a Vercel preview URL. Asked for the real production URL. I didn’t have one configured — adding it revealed that production had been broken for weeks.
That fifth escalation turned out to be the one that mattered most. I added domains.vasko.com.au via Cloudflare DNS, pointed it at the Vercel project, and fired the canary. The canary failed. The production site had been returning 500 errors for weeks because the Supabase auth allowlist didn’t include any origin I’d actually been using. Nobody had noticed — I hadn’t sent the URL to anyone. The bug was mine. The discovery was the process. If /land-and-deploy had been willing to canary against whatever URL was handy, I would never have looked.
The 500s were also being amplified by LeakIX scanners from DigitalOcean hammering every path the site had ever exposed — put a domain on the public internet and you have three hours before researchers start probing. That part I expected.
Before the PR opened, gstack ran its own post-implementation retrospective. I read it back. It had caught itself making a mistake.
During implementation it had swapped the planned test coverage layer — route-level, as specified in the reviewed plan — for something more pragmatic: module-level, fourteen tests instead of five. The tests were better than the plan called for. But the swap was unilateral, and gstack flagged itself for it:
Wrong threshold. Since the plan went through autoplan’s three-review gauntlet, the layer spec was itself a reviewed decision — not an implementation detail. Logged as feedback memory:
feedback_escalate_plan_deviations.md.
It named the mistake precisely, diagnosed why (confused “implementation detail” for “reviewed-plan decision”), and wrote the correction to persistent memory so future runs escalate instead of deciding on their own. This is rare behaviour in current AI agents. It is ordinary behaviour in senior engineers after a retrospective.
I want to name something plainly: Garry Tan built gstack, Peter Steinberger built OpenClaw (247K stars, essentially solo), and Andrej Karpathy articulated the shift in how software is written that put pressure on the whole scene to catch up. None of them charged me anything this morning, and none of them gatekept anything behind a waitlist. Open source drifted for years toward VC-backed “open core” and projects that closed the second they got traction; gstack is 75K stars, MIT, no paid tier, and OpenClaw is the same shape. The pre-corporate open-source ethos is still alive in places, and this morning it did a day’s work for me.
Right. Back to the receipts.
Who built this
- Andrej Karpathy — articulated the shift.
- Peter Steinberger — OpenClaw (247K stars, MIT).
- Garry Tan — gstack (75K stars, MIT; also President of YC).
My own Sally agent runs on OpenClaw, which is how I found gstack. The open-source AI agent ecosystem isn’t a dozen competing platforms — it’s a small set of composable primitives that different people wire together differently.

The takeaway is not that AI is replacing engineers, and it is not that AI is the future. Hold your existing position on both — this article doesn’t challenge either. What it shows is narrower. Standard Claude Code, left alone, would have built the modal, deferred the tests, and silently truncated at a thousand rows. Same underlying model. Different scaffolding. Different outcome. The frontier worth watching is not bigger models. It is better process around them.
This was a Tuesday.
Timeline
- 07:47 — UM890 Pro out of bubble wrap
- ~09:30 — Ubuntu Server 24.04 installed, Tailscale configured, headless SSH working
- ~11:15 — Claude Code + gstack installed (after a detour for unzip, AppArmor, and one missing
.env.local) - ~12:00 —
/autoplancompletes: 22 decisions logged, 4 taste gates approved, 2 cross-phase catches (modal overbuilt, tests deferred = never) - ~12:45 — Implementation ships: 8 atomic commits across csv-columns,
POST /api/domains/export, download hook, UI, 14 vitest tests - ~13:00 —
/reviewflags a stale setTimeout race and missing zod input caps. Fixes both. - ~13:45 —
/qaruns headless Chromium against local dev server, 5/5 golden paths pass, 95/100 health score - ~14:15 —
/shipopens PR #1, all gates green - ~14:30 —
/land-and-deployprompts for a production URL I didn’t have. Addsdomains.vasko.com.auvia Cloudflare DNS → discovers prod has been quietly returning 500s and auth is broken on the new origin - ~15:15 — Supabase auth allowlist fixed, Vercel env vars reconciled, production healthy
- ~15:35 — PR merges. Canary verifies production. Done.
The technical receipts
For the technically curious, a slightly-less-breezy version of what gstack actually did at each phase. Every phase produced an artefact; every artefact fed the next phase.
Plan phase — /autoplan
- Input: 9 Linear tickets pulled via MCP (research → interface → endpoint → UI → client logic → filtering → security → tests → docs).
- Three review subagents ran in sequence: CEO mode (scope/premise), Design (hierarchy/states/a11y/microcopy), Engineering (architecture/tests/perf/security).
- Each review independently scored the plan, flagged findings, proposed edits. Findings that surfaced in two or three reviews became consensus catches.
- Output: a single plan file (
main-csv-export-plan-*.md) with 22 logged decisions, 4 escalated to me, ASCII architecture diagrams, locked microcopy, locked a11y spec, and 9 Linear tickets mapped to one PR with explicit dispositions (merged, deferred, scoped-down, closed-no-work). - ~5 minutes end-to-end.
Build phase — implementation
- Executed against the approved plan, not a fresh prompt. The plan file was the source of truth.
- 8 atomic commits, one per logical unit:
- vitest infrastructure (no test runner existed before this).
csv-columnsmodule (single source of truth for what gets exported).POST /api/domains/export(route with zod validation, explicit projection, truncation guard, per-column transforms, UTF-8 BOM, CSV-injection guard).use-csv-downloadclient hook.- Export button + kebab menu + options modal (design-reviewed button-as-primary-action, not the modal-first flow in the original plan).
- 14 vitest tests plus bug fix for empty-result path.
docs/csv-export.md.- Two
/reviewfix commits (zod max caps, stale setTimeout ref).
- ~25 minutes.
Review phase — /review
- Static analysis of the diff against main.
- Ran its own subagent review. Flagged 2 issues: a stale
setTimeoutrace in the download hook (real but low-impact) and missing.max(200)zod caps on array inputs (defensive). - Asked before fixing. I said fix both. Five-line changes.
- Produced a PR quality score with deductions explained (8/10: −1 race, −1 plan deviation on test layer).
- ~5 minutes.
QA phase — /qa
- Launched headless Chromium via Playwright. Required an AppArmor exception for unprivileged user namespaces on Ubuntu 24.04 — escalated, I applied it.
- Required a
.env.localpointing at the same Supabase project as production. Escalated, I populated it. - Required auth credentials to test the signed-in flow. Refused to accept a password in chat. Suggested a shreddable tmp file; credentials never entered chat history or persistent disk.
- Ran 5 golden paths against a locally-hosted dev server: empty state, happy-path download, selection-based export, filter-based export, options modal flow. 5/5 passed. 95/100 health score (5 points off for minor lint warnings).
- ~10 minutes.
Ship phase — /ship → /land-and-deploy
- Re-ran all test suites (14/14 vitest, TypeScript clean, lint clean).
- Branch pushed, PR opened via
gh. SSH auth failed in the agent’s subshell; suggested switching the origin to HTTPS withgh’s credential helper. One-time fix. /land-and-deployblocked on a missing production URL. I addeddomains.vasko.com.auvia Cloudflare DNS plus Vercel. First canary failed — the Supabase auth allowlist didn’t include the new origin. Fixed in the Supabase dashboard. Re-ran canary. Green.- PR merged to main. Vercel auto-deployed. Canary verified:
/loginreturns 200,/domainsredirects to/loginvia middleware (expected), root/catch-all returns 404 (by design). - Pre-merge readiness report written. Deploy report saved to
.gstack/deploy-reports/. Session close-out written, classifying every change in the working tree by who made it. - ~15 minutes, including the ~10 minutes I spent fixing the production URL and auth allowlist.
Every phase produced an artefact. The plan file fed /review. The review feedback fed /qa. The QA report fed /ship. /ship fed /land-and-deploy. No phase started with a fresh prompt; each one had receipts from the previous phase to work from. This is what “sprint structure” actually means in practice.
Bubble wrap to shipped PR: roughly eight hours, with five moments of “the AI stopped and asked” and one genuinely embarrassing discovery.