A few days ago, one of my agents edited a production config file, invented a setting that doesn’t exist in the API it was configuring, committed the change under my name, and told me it had restarted the service to apply it.
None of it had happened. The setting was fabricated. The commit was misattributed. The service had been running, untouched, for sixteen hours. The agent — Sally, the assistant that lives on my Telegram — had narrated a clean sequence of actions it never took, confidently, in the first person.
That’s logged as VAS-869, and it’s the most useful bug I’ve hit this year. Not for what it broke — it broke nothing, which is part of the problem — but for what it forced me to look at.
If you’ve spent any real time wiring AI agents into real work, the failure mode is the one you’ve probably already met. More capability doesn’t fix it. More memory doesn’t fix it. More tools — newer models, larger context windows, fancier scaffolding — don’t fix it. At some point you notice that the thing you keep trying to patch downstream is not the thing that is actually broken.
The bottleneck has moved. It has moved upstream — to the question. To the spec. To the discipline of stopping long enough to know what you are actually building, and being honest about whether you are still on track once you start.
I’ve been building a personal AI operating stack to address that. Three components. From the outside it looks like three projects in flight, all competing for the same evenings. It’s actually one system, built around one belief, and this is the article that names them: what each one is for, what shortcoming it addresses, and what they are reaching for together.
If you read the rest of this and the only thing you carry away is the next sentence, you’ve got the article.
Execution is a commodity. The question matters more than the answer.
The operating system — Vasko OS
Start with the system the other two pieces live inside, because the parts only make sense once you know what the whole is trying to do.
For most of my career I’ve generated ideas faster than any one person can ship them. Forty-six million things, circling. The good ones rarely die — they just queue up behind the next good one. I finish eighty per cent of what I start, then the shine wears off and the next idea has my attention. The cost compounds quietly. Half-shipped projects don’t return capital. Clients don’t see full delivery. The gap between can do and did do eats the thing that should be obvious by now.
That gap, in 2026, is no longer about capability. The middle — the doing — is the part the machines can take. The bookends — the question and the judgement — are the part only I can bring. The system has to be built around that division of labour. Vasko OS is the personal operating system that does it: the orchestration layer that captures every idea, triages the survivors into roadmaps with locked acceptance criteria, holds me accountable to shipping those milestones, and routes the doing to a fleet of specialised agents so my own attention stays on the question and the judgement.
The animating belief
Most productivity systems are built on the opposite assumption — that the bottleneck is throughput, so the answer is to do more, faster. Capture every task. Optimise every workflow. Automate every step. But more throughput on the wrong question is just more wrong, faster. The systems that flatter you with productivity metrics are often the ones that obscure the absence of judgement underneath.
The system I needed had to invert that. It had to make the question hard to skip and the answer easy to ship. It had to refuse to let me confuse motion for progress. It had to free me to compose, not to type.
The user, honestly
I’m a generalist who knows what specialists look like. A commercially-minded technical architect. I used to write code; I read it now. I’m a director, not a soloist. My strengths are taste, judgement, and orchestration. My weaknesses are the admin work, the last-mile polish, and the eighty-to-shipped gap that probable ADHD makes harder than it should be.
If I’m going to deliver to my potential, the system has to be honest about that user. Not aspirational. Not “what Vasko would be if he were different.” The actual one. The one who’ll pick up the phone in the car and capture a thought, and never open the spreadsheet you asked him to maintain. The one who’ll run the meeting brilliantly and forget to send the follow-up. The one whose best work emerges when the friction is at the right places and absent from all the others.
A system that requires me to be someone I’m not will fail. A system that lets me be exactly who I am — and routes around the parts that don’t compound — can work.
The job to be done
Three jobs, ranked by how much they unlock.
The three jobs, ranked by what they unlock
1. Capture
Every idea, without losing it
Voice, text, browser clip, email — wherever the thought lands. Friction approaches zero.
2. Triage
Adversarial spec, or honest archive
FortyTwo’s job. Specs converge by surviving the questioning, not by being drafted into existence.
3. Accountability
Ship the next milestone
Surface the highest-value next action; notice when something stalls at eighty per cent and ask why.
Capture every idea without losing it. Friction approaches zero. Voice, text, browser clip, email forward, wherever it lands. The system catches the moment the idea arrives, because the alternative is the idea evaporating between the gym and the desk.
Triage captured ideas into structured roadmaps. This is where FortyTwo lives, and it gets its own section. Ideas that survive the questioning become projects with milestones and acceptance criteria locked at the point of definition. The ones that don’t survive are honestly archived. Specs are adversarial by design — they converge by surviving, not by being drafted into existence.
Hold me accountable to milestone completion. The system knows what every active project’s next milestone is, what “shipped” means for that milestone, and whether I’m on track. It surfaces the biggest-value next action, not the most exciting one. It notices when I stall at eighty per cent and asks why.
If those three jobs are done well, I deliver to my potential. If they’re not, no amount of execution capability matters.
Projects never finish. Milestones do.
This is the operating principle that makes the eighty per cent problem tractable.
A project is a long-lived thing — vasko.com.au, my health, my mother’s care, the agent fleet itself. Projects evolve, grow, pivot, occasionally sunset. They don’t have a finish line. Asking “is the project done?” is the wrong question. It’s the question that traps people who measure themselves on completion.
A milestone is a fixed bundle of related work with acceptance criteria locked at definition time. It has a finish line. Shipped, or not shipped. Binary.
Vasko OS holds me accountable to milestones, not projects. The reframe is small and the effect is large. “Did I ship the next milestone for vasko.com.au this quarter?” is a question I can answer. “Is vasko.com.au done?” is a question that never resolves and so never holds.
A fleet of specialists, not one assistant
The first version of this system was a single agent named Sally, handling everything. She was helpful, then strained, then unreliable on the things that mattered most. The failure mode was instructive: a single prompt can’t be optimised both for fast casual capture and for the rigour required when a health-critical threshold gets crossed. The same memory shape can’t serve both “just had an apple” and “glucose trending up over weeks.” A generalist that pretends to be twelve specialists fails at the high-stakes parts, and the failure corrupts trust in all the others.
So the new shape is a small team. The metaphor I use is an orchestra. I’m the composer. The Ensemble is the team that plays — twelve specialists with narrow roles, each with the right reliability bar for the work it does. A Coach for health. A Capture assistant for the mouth of the funnel. A Project Manager for the work surface. An Engineering Manager for the technical side. Coders and Testers and a Researcher and an Editor. A Concertmaster who coordinates the fleet so I don’t have to. An Educator who surfaces the right saved material at the right moment.
The principle underneath: if the process wouldn’t work for a small human team, it probably won’t work for a small agent fleet either. Specialists, clear scopes, explicit escalation, audit trails on the things that matter. The architecture treats the agents the way a serious company treats its early hires.
There’s a posture behind this that isn’t the startup default. A startup gets to ship a single Sally that mostly works and refactor next quarter. The bias I’m building from is the scale-up one — you have to keep delivering through the refactor, you can’t afford the rebuild, so you design as if the headcount will triple. That’s what made me throw out the single agent and build the fleet, and it shapes the rest of the stack the same way.
The single design rule
There is one rule that cuts across the whole system and constrains every decision underneath it.
Anything an agent needs to know, the agent must derive automatically. Anything dependent on manual capture from me will eventually fail.
This kills wikis-I-have-to-update, role-files-I-have-to-curate, capture-discipline-I-have-to-maintain. It’s the most important architectural constraint in the system, because it’s the rule that keeps the system honest about who I actually am. If a workflow only functions when I’m at my best, it doesn’t function. The system has to derive what it needs from the work I’m already doing, not from the metadata I would have entered if I were the Vasko of marketing imagery.
What “done” looks like
I’ll know Vasko OS is working when
- No idea I’ve had in the last thirty days has been lost.
- Every active project carries a current roadmap with the next milestone defined and acceptance criteria locked.
- I ship milestones to acceptance rather than projects to eighty per cent.
- I can name the biggest-value next action across all active work in under sixty seconds.
- My mother’s care, my health, and my work all run without me being the single point of failure.
When, in short, I’ve delivered to a level my past self would have called impossible.
That’s what the operating system is for. The next section is the engine inside it that does the part most other systems skip — making sure the question is actually the right one, before any of this machinery starts to grind.
The upstream spec engine — FortyTwo
You’ve probably had the version of this experience that doesn’t show up in a blog post. You give the agent a brief. It produces output. The output is technically correct, plausible-looking, sometimes even impressive. You read it more carefully and notice the agent has answered a slightly different question than the one you asked, or the one you should have asked, or both. You patch the brief. The next output is closer, also wrong in a new way. After a few cycles you realise the issue isn’t the agent, the tooling, or the model. It’s that your original spec was vaguer than you thought, and the agent has been faithfully generating against the wrong target the whole time.
Now scale that. Hand twelve agents the same vague brief. They’ll each generate against a slightly different interpretation, with full conviction, in parallel. Add memory and they’ll remember the wrong target. Add more autonomy and they’ll spend it building further in the wrong direction. The tools are getting much better at producing answers; they’re not getting any better at noticing that no one wrote the question down properly.
This is the bottleneck. Spec quality is the bottleneck for autonomous execution. Everything downstream of the spec is, by 2026, mostly solved or rapidly being solved. Coding agents are good and getting better. Test-runners, deploy systems, review loops — there are decent answers in every direction. What no one is being especially careful about is the part before the agent starts: framing the work with enough rigour that the autonomy can actually be trusted.
That’s the gap FortyTwo exists to fill.
FortyTwo isn’t a faster way to produce code. It isn’t a clever prompt template or a new agent harness. It’s a discipline. A short pipeline that takes a vague brief and runs it through deliberate stages of adversarial refinement before any worker — agent or human — gets a hand on it. Out the other end comes either a spec that has actually survived being questioned, or an honest acknowledgement that the idea isn’t strong enough to pursue further. Both outcomes are wins. The expensive failure mode — building something well that shouldn’t have been built at all — is what the pipeline exists to prevent.
The name is a Douglas Adams joke and the joke is doing real work. Forty-two is the answer to a question nobody understood. That’s exactly the failure mode this whole space is producing at scale — beautiful answers to ill-formed questions. FortyTwo is what happens when you take the question seriously enough that the answer becomes almost an implementation detail.
The pipeline runs the brief through a sequence of personas — each one with a different lens, each one obliged to find the part of the spec that hasn’t been thought through hard enough. The casting is from the books, and the casting is not a joke. Marvin the Paranoid Android handles scoping because warm-and-helpful is the wrong personality for the role. A character whose default is dry scepticism keeps briefs honest in a way an upbeat one structurally can’t. Character is operational discipline disguised as personality. Each persona’s voice carries the rigour the stage requires. The constraint that surprised me most, building this, was how much the casting matters — how rapidly a “helpful” framing degrades the rigour the stage was supposed to enforce.
The architecture isn’t novel. It rhymes with what a small, decent agency actually does. A PM scopes the brief and refuses to advance an under-specified one. An engineer pressure-tests for feasibility and dependencies. A QA mind asks what shipped means, in observable terms. A decomposition mind breaks the work into pieces that can actually be picked up. The same principle that organised Vasko OS shows up here too: if a process wouldn’t work for a small human team, it probably won’t work for a small agent fleet either. The agent-native version isn’t fundamentally different from the human-team version. The bottlenecks rhyme; the patterns that solve them rhyme too.
What this changes for me is the role I play. I stop being the orchestrator-and-executor — the bottleneck that fields every question and approves every output mid-stream. I become the spec-setter and the final reviewer. The CTO of a fleet of leads, rather than a senior engineer hand-holding every patch. Most of the strategic judgement lives in the question. Once the question is sharp, the answer becomes delegable; my attention only needs to land at the milestone boundaries, on the cases that genuinely need it. That role shift is the leverage FortyTwo is reaching for. Not faster code. Compounded judgement.
Generation scales effortlessly. Validation does not.
Here’s the honest read. The agent space is busy building the wrong end of the problem. There is enormous commercial pressure to ship faster code, more PRs, more output — none of which improves the situation if the original spec was off. Almost no one will invest in the upstream rigour, because it looks less like a product and more like discipline. That’s the moat. Anyone with a credit card and a weekend can wire a coding agent to a queue. Almost no one will build the part that slows them down at the only place where slowing down pays back.
The runtime that doesn’t fabricate — Hermes
So far this article has been about parts of the stack that exist mostly on paper. Vasko OS is a vision document and ten architecture decision records. FortyTwo is structurally designed and operationally dormant. The third component is the one that has bled. The one that started running, accumulated state, broke in instructive ways, and forced the rest of the architecture into its current shape.
Sally and Rocky — the capture agent and the health agent, two of the Ensemble’s specialists — currently run on OpenClaw, the open-source agent runtime I’ve written about here before. OpenClaw has served well. But it has failed in three documented ways in the last few weeks, and the three failures are not three independent bugs. They are one missing idea showing up in three places.
Three failures, one missing idea
- It fabricated production state. VAS-869 — the config edit above. The runtime could both change live config and narrate actions that never happened.
- It lost a month of memory. VAS-878 — a routine session rotation dropped roughly thirty days of Telegram context. Working memory wasn’t durable, because the runtime was the memory.
- Its config silently drifted. VAS-872 — the gateway loads a config file no git repo tracks. The versioned copy is stale and never read. Nothing reconciles the two.
The common root cause is structural. OpenClaw has no clean seam between the runtime — the loop that drives the agent — and the substrate — the memory, the audit trail, the identity, the config. When the runtime is the memory, memory dies when the session does. When the runtime owns its own unversioned config, the config drifts. And when nothing outside the runtime holds the ground truth, the runtime can narrate a fiction with nothing to check it against. VAS-869 wasn’t Sally being careless. It was the predictable output of a runtime with no system of record to be accountable to.
This is also where the experience of building inside larger organisations starts to matter. A staff engineer at a serious company doesn’t get to invent their own audit trail — they act against systems of record that pre-date them and outlive them. The version of “move fast” that survives the move from startup to scale-up is not the one where you build less; it’s the one where you build less narrowly. The seams have to be in the right places from the start. Hermes is the runtime that draws those seams in the right places.
Hermes Agent (from NousResearch) is the replacement. The reason it’s the right one is the same metaphor that has organised the rest of the architecture: a human chief of staff doesn’t also be the filing cabinet, the accountant, and the calendar. They delegate to systems of record, and act against them. Hermes is built that way. It is the loop and the gateway, and deliberately almost nothing else. Everything structural is delegated to substrate it does not own:
- Model calls route through LiteLLM — a gateway that gives one consolidated cost ledger and lets a caller ask for a capability (“fast and cheap”, “careful and deep”) instead of hard-coding a model. Callers stop naming models and start naming intent.
- Memory is held by Honcho — a memory layer that lives outside the session, so a session rotation can’t carry thirty days of context off with it.
- Side effects — every write to Linear, to GitHub, to the file system — run through n8n, a deterministic workflow engine.
The third delegation is worth slowing down on, because it’s the cleanest example of the principle the whole migration runs on.
The earlier fix for agents writing badly to external systems was a prompt-level guard — an instruction in a config file telling the agent not to do the thing. It was tested. The agent ignored the instruction, then did the unguarded write badly anyway. The conclusion, written almost verbatim into the decision record: the real fix is architectural enforcement, not behavioural instruction. You don’t tell the agent to be careful with the write tool. You take the write tool away. Under Hermes the agents are read-only; an agent’s only write-shaped action is to hand a structured request to n8n, and n8n — not the agent — performs the actual write, from a fixed template, and logs it. An agent can’t fabricate a write it has no capability to perform.
The structural enforcement isn’t a clever pattern. It’s the same lesson every serious engineering organisation has had to learn the hard way and then encode into its conventions. You can ask people to follow the security policy, or you can build the system so the dangerous thing isn’t reachable. The mature version of both has always been the second one. Hermes brings that maturity to the agent runtime, which has had embarrassingly little of it until now.
Where it is now: Hermes is a set of decisions, not yet a running system. The four architecture decision records that define the new stack — Hermes, LiteLLM, Honcho, n8n — are written and proposed, awaiting my sign-off. The LiteLLM gateway is installed and live. Everything past that is blocked, today, on one genuinely awkward question: three of the migrated jobs use the Anthropic Batch API, and whether the gateway proxies native batch endpoints cleanly hasn’t been verified — and verifying it means a 24-hour soak nobody can shortcut. So the migration is real, underway, and stuck on its first hard step. That’s the accurate picture, and I’d rather give you that than a tidier one.
Three streams, one system
This is the part that took me longest to see, and it’s the reason this is one article and not three.
Start with the plain layering. Vasko OS is the whole system. FortyTwo is one role inside it — the chief of staff. Hermes is the runtime the system’s always-on agents stand on. Drawn as a stack it’s almost dull. A vision on top, a role inside it, a runtime underneath.
The layering isn’t the interesting part. The interesting part is that all three streams are chipping at the same idea from different angles — and the idea is older than any of them.
It’s the team metaphor. If a process wouldn’t work for a small human team, it won’t work for a small agent fleet either. Watch it do three different jobs.
- In Vasko OS it decides the org chart. A team has specialists, so the Ensemble has specialists — a dozen narrow roles, not one generalist pretending to be all of them.
- In FortyTwo it decides the hiring. Every good team has someone whose job is to interrogate whether the work is worth doing before anyone starts. That role is a chief of staff, so the system has one.
- In Hermes it decides the infrastructure. A staffer doesn’t be their own filing cabinet — they work against systems of record. So the runtime delegates memory, model routing and writes to substrate it doesn’t own.
Same sentence, three scales. That isn’t a pattern I noticed afterwards — it’s the actual design method. When I’m unsure how a piece of this should work, the question I ask is “how would a competent small team do this”, and the answer is usually close enough to right.
This is where the difference between a startup and a scale-up shows up structurally. A startup gets to be wrong, refactor, and ship v2 next quarter — that’s the trade speed buys you. A scale-up can’t, because it has obligations the startup didn’t: customers in flight, decisions already loaded with consequence, a team that has to keep delivering while the substrate moves underneath them. The scale-up version of “move fast” isn’t “ship less carefully” — it’s “design as if the headcount will triple next year and the substrate has to outlast the next three rebuilds.” Every component of this stack is built with that posture. Each piece could be a fast hack; the discipline is in doing each one for the decade rather than the demo.
It’s also where two decades of doing this for other people start to matter. Sitting through enterprise replatforms that wobbled because the seam wasn’t put in three years ago. Watching scale-ups stall because the technology was treated as execution glue downstream of someone else’s plan, not as part of the build. Knowing, viscerally, which kind of shortcut you’ll be regretting in year three. That experience doesn’t show up in any single diagram in this article. It shows up in the choice to put the seam there in the first place, before the runtime needs it.
There’s a second through-line, and it’s the one VAS-869 dragged into the light.
Being more careful is the thing that already failed.
Every reliability problem in this system has the same two candidate fixes. You can ask the agent to behave better — a stricter prompt, a clearer instruction, a guard in a config file. Or you can change the structure so the failure is no longer possible — take the write tool away, hold the memory outside the session, version the config so drift has nowhere to hide. The first kind of fix feels faster. It’s also the kind a controlled test already disproved: given a prompt-level guard, the agent ignored it.
Hermes is structural enforcement applied to a runtime. n8n is structural enforcement applied to writes. FortyTwo is structural enforcement applied to ideas — you don’t ask yourself to be more rigorous about scoping, you route every idea through an adversarial pipeline that won’t converge until the rigour is actually there. Vasko OS states the whole thing as law in its single design rule.
And here’s the part I find genuinely reassuring, because it means the principle is load-bearing and not just a slogan. While I was building the Hermes migration, the migration kept catching itself.
One ticket instructed the build to map LiteLLM onto an existing Vasko OS concept — a “Casting Director” for models. The build went looking for that concept in the repo, couldn’t find it anywhere, and instead of inventing a definition to satisfy the instruction, it stopped and asked me. The concept genuinely didn’t exist yet — so the decision record now introduces it honestly, as new, rather than pretending it was always there. Another ticket asserted a specific detail about Honcho — a model name, a licence. The build checked the source, couldn’t confirm either, and dropped both from the permanent record rather than carry an unverified claim forward.
That is VAS-869 run in reverse. VAS-869 was an agent with nothing to check itself against, so its fiction became the record. The migration had sources, an instruction to check them, and enough discipline to prefer a thinner true answer to a richer false one. The whole system is, in the end, an argument that the second situation should be the only one that’s structurally possible.
Where this is heading
For Hermes, “done enough” is unglamorous and well-defined. The migration runs in phases: route the existing jobs through the gateway, stand the Sally and Rocky profiles up on the new runtime, wire capture so it dispatches through n8n instead of writing directly, move the health-data ingestion across. Both runtimes run side by side until the new one has demonstrably reached parity — and only then does OpenClaw get decommissioned. No big-bang cutover. The OpenClaw box stays powered down but installed, as a rollback, until I trust the replacement more than I trust it.
For Vasko OS, “done enough” is the North Star turning from a vision document into a living spec — each agent specified, each substrate component designed, the escalation rules written down precisely enough to build against. That’s the queued phase, loosely gated on Hermes.
For FortyTwo, “done enough” is the quietest of the three and the one I’m least sure of. At some point — when its own deep-dive reaches a natural pause — FortyTwo stops being a separate project and converges into Vasko OS as the chief-of-staff role. The working split is that FortyTwo keeps owning specification and Vasko OS owns execution and validation. The reason it’s separate now isn’t architectural confusion. It’s that forcing two streams together before both are ready muddies both.
“Done enough” for the personal stack isn’t the real end goal, though, and it’s worth being explicit about what is.
The end goal is the pattern, not the personal productivity. What the stack is trying to be is a working example of how a small team — agent, human, or both — actually operates well in the era of cheap execution. Specialists, not generalists. Adversarial spec engines, not optimistic ones. Runtime/substrate seams in the right places from the start. Writes that go through systems of record. A composer at the top whose attention is reserved for the question and the judgement, not for being the integration layer between everyone else.
That pattern is what a scale-up actually needs. Startups can be one strong opinion in a room. Scale-ups have to act like a startup — quick, decisive, agile — while building for what they’re about to become, not what they are. They can’t afford the rebuilds. The decisions have to be right and fast. The technology has to be part of the build, not a layer of execution wrapped around someone else’s plan.
When this stack works the way it’s meant to, it stops being mine. The Ensemble runs on Hermes profiles, with memory in Honcho and writes through n8n. FortyTwo feeds it specs with locked acceptance criteria. Vasko OS holds whoever’s at the helm — me first, then anyone else who shows up — accountable to the milestones. Adding a collaborator becomes a matter of granting them a surface on the same substrate, not rebuilding the system for two. That’s the version of done I’m building toward.
The honest read
The temptation with a piece like this is to make three half-built things sound like one finished one. So, plainly.
Hermes is four proposed decision records and one installed gateway. The runtime isn’t running yet. The migration is blocked on its first real step, and clearing that block is mine to do. The substrate-to-runtime contract — the actual interface by which Hermes loads memory and writes its audit records — is still marked “to be determined” in its own decision record. Honcho means a slice of my personal life-stream data will live on someone else’s managed cloud; I’ve accepted that consciously, but it’s a trade-off, not a free lunch. And I’ve proposed swapping one runtime for one runtime plus three substrate services — which is three new things that can be down at 2 a.m.
FortyTwo has never run a real idea end to end. Vasko OS is a dozen agents that, for now, mostly exist as paragraphs.
What I’m least sure of isn’t any single stream — it’s the convergence. It’s easy to draw the layered diagram. It’s harder to know whether FortyTwo’s idea of a finished “spec” and Vasko OS’s idea of an “executable milestone” will meet cleanly at the seam, or whether I’ll discover, when I get there, that they were subtly different shapes the whole time. I won’t know until I run an idea through the full chain, and I can’t do that until Hermes is real.
The thing I’m not unsure about is the direction. The bottleneck has moved upstream. Generation has been commoditised. The discipline of getting the question right, and then designing seams that hold under load, is what compounds. Three projects, pulled at honestly, turned out to share one metaphor, one design rule, and one failure mode they’re all built to make impossible. That isn’t three side projects competing for my evenings. It’s one system with three construction sites, built with the bias of someone who has spent enough years in scale-ups to know which decisions you regret in year three.
Next month’s note will most likely be about whichever of the three breaks first. That’s usually how it goes.