Skip to content
AIgstackClaude Codeopen sourceautonomous agents

Bubble wrap to merged PR: eight hours with gstack

A mini PC, an open-source Claude Code skill pack, and a Tuesday afternoon spent watching an autonomous AI engineering team catch the bugs I would have shipped.

TL;DR

  • A mini PC ordered off Amazon Australia arrived within 24 hours. By mid-morning the next day it was out of bubble wrap and running an autonomous AI engineering team.
  • That team (gstack, an open-source Claude Code skill pack) planned, built, reviewed, tested, and shipped a working feature to production in about five hours.
  • Three AI-run review phases caught two design mistakes and one critical bug before a line of code was written — including a bug that would have silently corrupted data once the project scaled.
  • The AI asked me for help five times, each time with clear options. It never guessed.
  • While verifying the deployment, the process surfaced a separate production bug that had been silently broken for weeks.
  • Total cost of the AI tooling: zero. It’s all open source.

What this actually means: the interesting shift in AI-assisted software work right now isn’t bigger or smarter models — those keep improving in the background. It’s the scaffolding built around them. Structured review workflows like this catch the kinds of mistakes that standard AI tools ship, while staying honest about what they don’t know. For anyone watching AI transform knowledge work, this is what the next step actually looks like in practice: less a single robot writing code, more a process that thinks before it builds.

At 7:47 AM on a Tuesday, a Minisforum UM890 Pro mini PC was still wrapped in bubble wrap on my desk. By mid-afternoon it had shipped a feature end-to-end — Linear tickets to merged PR to production canary — run by an autonomous AI engineering team installed on it earlier that morning. Eight hours. Bubble wrap to merged PR.

The team was gstack, Garry Tan’s open-source Claude Code skill pack. 23 skills organised around a sprint structure: think → plan → build → review → test → ship. I pointed it at one Linear epic — a CSV export feature for a personal domain-tracker project, nine tickets — and ran /autoplan. The full loop ran autonomously, except for five moments where it escalated to me.

The stack

  • Hardware — Minisforum UM890 Pro (AMD Ryzen 9 8945HS, 32GB RAM, ~$1,340 AUD)
  • OS — Ubuntu Server 24.04 LTS, headless
  • Remote access — Tailscale (mesh VPN + SSH)
  • AI tooling — Claude Code + gstack (Garry Tan’s skill pack, MIT)
  • Project management — Linear MCP
  • Browser testing — Playwright + headless Chromium
  • Target app — Next.js 14 App Router + Supabase + Vitest + TypeScript
  • Deployment — Vercel + Cloudflare DNS

/autoplan runs three review phases on any plan before a line of code is written: a CEO-mode review (scope, premise, commercial sanity), a design review (hierarchy, states, accessibility, microcopy), and an engineering review (architecture, tests, performance, security). Three independent perspectives on the same plan, each scoring it and proposing edits. Findings that surface in two or three phases become consensus catches — the ones the author missed in their own framing. Two landed immediately.

The first was the export modal. Design flagged the primary action as inverted: the plan had button → modal → download, when the right flow is button → download, with a modal only as an escape hatch for the five per cent of cases that actually want to pick columns. CEO flagged the same thing through a different lens — scope expansion, two extra days of work, no extra value. Same finding, two seats at the table. Resolution: button-as-primary-action with a kebab menu for the edge cases. Standard Claude Code, left on its own, would have built the modal.

Tests were the second catch. The plan had deferred unit tests on the grounds that the repo had no test runner configured and setting one up was “its own initiative”. Engineering’s third finding and CEO’s eighth landed on the same observation from opposite ends: defer tests on a data-export endpoint and you will never come back to them. Resolution: vitest plus five tests in this PR, thirty minutes of cost, non-negotiable. By the time it shipped, the count was fourteen.

Engineering review contributed the single most valuable catch. Rated critical: Supabase’s PostgREST layer defaults to a thousand-row cap on query results, and exceeding it returns a truncated response with no error thrown and no indication anything was dropped. The CSV export, as planned, would have returned partial data the moment the domain portfolio crossed a thousand entries. Not a broken download. Not a timeout. A correct-looking file missing the last rows. Fix: one line, .range(0, 49_999), plus a count: 'exact' truncation check that throws if the server reports more rows than it returned. Five seconds to apply. Would have taken months to catch in the wild.

The run, in numbers: 22 decisions logged, 4 escalated to me, 3 review phases, 2 cross-phase catches, 1 critical ship-blocker avoided, 14 tests added. Total autonomous engineering time: roughly 3 hours. Total AI tooling cost: zero.

How this maps to a human team

  • /office-hours — YC partner asking the forcing questions before you build
  • /plan-ceo-review — a founder challenging scope
  • /plan-design-review — a senior designer catching spec gaps
  • /plan-eng-review — a staff engineer finding the ship-blockers
  • /review — code review before QA
  • /qa — QA lead clicking through the feature in a real browser
  • /ship — release engineer opening the PR
  • /land-and-deploy — merge, deploy, canary verification

The autonomous part of “autonomous” is where agents most often earn their bad reputation: pressing on through ambiguity, guessing at credentials, committing things nobody wanted committed. This run had five escalations, each with enumerated options and a recommended path. Not once did it guess.

Five moments it stopped and asked

  • AppArmor blocking headless Chromium. Three options offered (system-wide unload, per-profile exception, --no-sandbox); recommended per-profile.
  • Missing .env.local. Asked for the values rather than guessing defaults.
  • QA credentials. Refused to accept a password in chat. Suggested a shreddable temp file so secrets never entered the transcript.
  • git push over SSH failing inside the agent’s subshell. Proposed switching to HTTPS via gh’s credential helper. One command.
  • /land-and-deploy refused to canary against a Vercel preview URL. Asked for the real production URL. I didn’t have one configured — adding it revealed that production had been broken for weeks.

That fifth escalation turned out to be the one that mattered most. I added domains.vasko.com.au via Cloudflare DNS, pointed it at the Vercel project, and fired the canary. The canary failed. The production site had been returning 500 errors for weeks because the Supabase auth allowlist didn’t include any origin I’d actually been using. Nobody had noticed — I hadn’t sent the URL to anyone. The bug was mine. The discovery was the process. If /land-and-deploy had been willing to canary against whatever URL was handy, I would never have looked.

The 500s were also being amplified by LeakIX scanners from DigitalOcean hammering every path the site had ever exposed — put a domain on the public internet and you have three hours before researchers start probing. That part I expected.

Before the PR opened, gstack ran its own post-implementation retrospective. I read it back. It had caught itself making a mistake.

During implementation it had swapped the planned test coverage layer — route-level, as specified in the reviewed plan — for something more pragmatic: module-level, fourteen tests instead of five. The tests were better than the plan called for. But the swap was unilateral, and gstack flagged itself for it:

Wrong threshold. Since the plan went through autoplan’s three-review gauntlet, the layer spec was itself a reviewed decision — not an implementation detail. Logged as feedback memory: feedback_escalate_plan_deviations.md.

It named the mistake precisely, diagnosed why (confused “implementation detail” for “reviewed-plan decision”), and wrote the correction to persistent memory so future runs escalate instead of deciding on their own. This is rare behaviour in current AI agents. It is ordinary behaviour in senior engineers after a retrospective.

I want to name something plainly: Garry Tan built gstack, Peter Steinberger built OpenClaw (247K stars, essentially solo), and Andrej Karpathy articulated the shift in how software is written that put pressure on the whole scene to catch up. None of them charged me anything this morning, and none of them gatekept anything behind a waitlist. Open source drifted for years toward VC-backed “open core” and projects that closed the second they got traction; gstack is 75K stars, MIT, no paid tier, and OpenClaw is the same shape. The pre-corporate open-source ethos is still alive in places, and this morning it did a day’s work for me.

Right. Back to the receipts.

Who built this

  • Andrej Karpathy — articulated the shift.
  • Peter SteinbergerOpenClaw (247K stars, MIT).
  • Garry Tangstack (75K stars, MIT; also President of YC).

My own Sally agent runs on OpenClaw, which is how I found gstack. The open-source AI agent ecosystem isn’t a dozen competing platforms — it’s a small set of composable primitives that different people wire together differently.

gstack session close-out screenshot, showing a table classifying every change in the working tree by who made it (me, gstack, Next.js), with suggested actions, followed by a plain-language summary of the full chain that ran
gstack’s session close-out. After shipping, it classified every change in the working tree by who made it, flagged one as mild scope creep it should have asked about first, and stated the outcome without ceremony.

The takeaway is not that AI is replacing engineers, and it is not that AI is the future. Hold your existing position on both — this article doesn’t challenge either. What it shows is narrower. Standard Claude Code, left alone, would have built the modal, deferred the tests, and silently truncated at a thousand rows. Same underlying model. Different scaffolding. Different outcome. The frontier worth watching is not bigger models. It is better process around them.

This was a Tuesday.


Timeline

  • 07:47 — UM890 Pro out of bubble wrap
  • ~09:30 — Ubuntu Server 24.04 installed, Tailscale configured, headless SSH working
  • ~11:15 — Claude Code + gstack installed (after a detour for unzip, AppArmor, and one missing .env.local)
  • ~12:00/autoplan completes: 22 decisions logged, 4 taste gates approved, 2 cross-phase catches (modal overbuilt, tests deferred = never)
  • ~12:45 — Implementation ships: 8 atomic commits across csv-columns, POST /api/domains/export, download hook, UI, 14 vitest tests
  • ~13:00/review flags a stale setTimeout race and missing zod input caps. Fixes both.
  • ~13:45/qa runs headless Chromium against local dev server, 5/5 golden paths pass, 95/100 health score
  • ~14:15/ship opens PR #1, all gates green
  • ~14:30/land-and-deploy prompts for a production URL I didn’t have. Adds domains.vasko.com.au via Cloudflare DNS → discovers prod has been quietly returning 500s and auth is broken on the new origin
  • ~15:15 — Supabase auth allowlist fixed, Vercel env vars reconciled, production healthy
  • ~15:35 — PR merges. Canary verifies production. Done.

The technical receipts

For the technically curious, a slightly-less-breezy version of what gstack actually did at each phase. Every phase produced an artefact; every artefact fed the next phase.

Plan phase — /autoplan

  • Input: 9 Linear tickets pulled via MCP (research → interface → endpoint → UI → client logic → filtering → security → tests → docs).
  • Three review subagents ran in sequence: CEO mode (scope/premise), Design (hierarchy/states/a11y/microcopy), Engineering (architecture/tests/perf/security).
  • Each review independently scored the plan, flagged findings, proposed edits. Findings that surfaced in two or three reviews became consensus catches.
  • Output: a single plan file (main-csv-export-plan-*.md) with 22 logged decisions, 4 escalated to me, ASCII architecture diagrams, locked microcopy, locked a11y spec, and 9 Linear tickets mapped to one PR with explicit dispositions (merged, deferred, scoped-down, closed-no-work).
  • ~5 minutes end-to-end.

Build phase — implementation

  • Executed against the approved plan, not a fresh prompt. The plan file was the source of truth.
  • 8 atomic commits, one per logical unit:
    • vitest infrastructure (no test runner existed before this).
    • csv-columns module (single source of truth for what gets exported).
    • POST /api/domains/export (route with zod validation, explicit projection, truncation guard, per-column transforms, UTF-8 BOM, CSV-injection guard).
    • use-csv-download client hook.
    • Export button + kebab menu + options modal (design-reviewed button-as-primary-action, not the modal-first flow in the original plan).
    • 14 vitest tests plus bug fix for empty-result path.
    • docs/csv-export.md.
    • Two /review fix commits (zod max caps, stale setTimeout ref).
  • ~25 minutes.

Review phase — /review

  • Static analysis of the diff against main.
  • Ran its own subagent review. Flagged 2 issues: a stale setTimeout race in the download hook (real but low-impact) and missing .max(200) zod caps on array inputs (defensive).
  • Asked before fixing. I said fix both. Five-line changes.
  • Produced a PR quality score with deductions explained (8/10: −1 race, −1 plan deviation on test layer).
  • ~5 minutes.

QA phase — /qa

  • Launched headless Chromium via Playwright. Required an AppArmor exception for unprivileged user namespaces on Ubuntu 24.04 — escalated, I applied it.
  • Required a .env.local pointing at the same Supabase project as production. Escalated, I populated it.
  • Required auth credentials to test the signed-in flow. Refused to accept a password in chat. Suggested a shreddable tmp file; credentials never entered chat history or persistent disk.
  • Ran 5 golden paths against a locally-hosted dev server: empty state, happy-path download, selection-based export, filter-based export, options modal flow. 5/5 passed. 95/100 health score (5 points off for minor lint warnings).
  • ~10 minutes.

Ship phase — /ship → /land-and-deploy

  • Re-ran all test suites (14/14 vitest, TypeScript clean, lint clean).
  • Branch pushed, PR opened via gh. SSH auth failed in the agent’s subshell; suggested switching the origin to HTTPS with gh’s credential helper. One-time fix.
  • /land-and-deploy blocked on a missing production URL. I added domains.vasko.com.au via Cloudflare DNS plus Vercel. First canary failed — the Supabase auth allowlist didn’t include the new origin. Fixed in the Supabase dashboard. Re-ran canary. Green.
  • PR merged to main. Vercel auto-deployed. Canary verified: /login returns 200, /domains redirects to /login via middleware (expected), root / catch-all returns 404 (by design).
  • Pre-merge readiness report written. Deploy report saved to .gstack/deploy-reports/. Session close-out written, classifying every change in the working tree by who made it.
  • ~15 minutes, including the ~10 minutes I spent fixing the production URL and auth allowlist.

Every phase produced an artefact. The plan file fed /review. The review feedback fed /qa. The QA report fed /ship. /ship fed /land-and-deploy. No phase started with a fresh prompt; each one had receipts from the previous phase to work from. This is what “sprint structure” actually means in practice.

Bubble wrap to shipped PR: roughly eight hours, with five moments of “the AI stopped and asked” and one genuinely embarrassing discovery.