AI Can Build the Tower. It Cannot Yet Pour the Foundation.

AI-generated software needs deeper foundations, not prettier demos

Jun 08, 2026

In 2009, a shimmering obelisk of blue glass opened in the heart of San Francisco. It was called the Millennium Tower. It was fifty-eight stories of absolute perfection. If you walked into the penthouse suites, you would find Italian marble countertops, sweeping views of the bay, and flawlessly designed layouts. The architects had created a masterpiece of modern luxury. It was the most prestigious address in the city.

But by 2016, the building had a secret. It was sinking. It had already tilted fourteen inches to the northwest.

The problem was not the marble. The problem was not the glass. The problem was the ground. The builders had anchored the foundation in compacted sand, stopping just short of the deep, stable bedrock. They built a breathtaking interface, but they underestimated the unyielding physics of the earth beneath it.

This stark mismatch between imagination and physics is not just an architectural cautionary tale. It mirrors exactly what is happening in software development today. Imagine hiring a visionary architect who can instantly sketch a breathtaking, 100 story skyscraper with flawless facades and perfect flow. The design dazzles. Yet when the architect is asked to lay a concrete foundation, the dream crumbles. They have no grasp of soil density. They do not understand load bearing calculations. They do not know how to keep a structure from cracking under thousands of tons of steel. Today, the allure of dazzling user interfaces is masking a catastrophic engineering gap.

Vibe Coding Made Us Complacent

AI has made it possible to describe software instead of writing it. This is the promise behind what Andrej Karpathy called “vibe coding”: a mode of development where the user talks to the model, describes the desired behavior, adjusts the result, and increasingly forgets that code exists at all. For prototypes, landing pages, dashboards, internal tools, and lightweight experiments, this is extraordinary. It collapses the distance between idea and artifact. A founder without a technical cofounder can build a product sketch. A marketer can build a workflow. A domain expert can turn operational knowledge into software. That is real progress, and it should not be dismissed merely because it makes traditional engineers uncomfortable.

But it has also created a dangerous illusion: because AI can generate the shape of an application, we assume it can generate the substance of a production system. It usually cannot. The problem is not that the code is ugly, although it often is. The problem is that the code lacks operational discipline. It does not reliably handle persistence, concurrency, recovery, isolation, security, or scale. It often produces software that looks correct under demo conditions but fails under real-world conditions. And real-world conditions are not edge cases. They are the operating environment.

This is the distinction the market is only beginning to understand. AI is remarkably good at producing the visible surface area of software: screens, forms, flows, prompts, dashboards, and increasingly complex business logic. But production software is not primarily a collection of visible surfaces. It is a system of promises about what will happen when things go wrong. It promises that money will not disappear, that records will not contradict each other, that permissions will be enforced, that workflows can resume after failure, and that the state of the business will remain coherent even when the machinery beneath it is under stress.

The ACID Gap

The most important failures in software are often invisible until they become expensive. Consider a scheduling app. Two users try to reserve the same session at nearly the same moment. The AI-generated code checks availability, sees the slot is open, and writes both bookings. The interface looked fine. The logic seemed obvious. The test cases may even have passed. But without proper transaction isolation, the system has created a hidden contradiction. The bug is not visible at the moment of creation. It emerges later, when two customers arrive for the same slot and the business has to apologize for a failure that should never have been possible.

Or consider a payment flow. Stripe successfully charges a customer, but the application fails before recording the order. Now the customer has paid, the business has no order record, and support has to reconstruct the truth from logs, receipts, and angry emails. In a prototype, this seems like an implementation detail. In a real business, it is a trust-destroying event. The software has not merely crashed. It has created two conflicting versions of reality: one in the payment processor and another in the application database.

The same problem appears in long-running workflows. A user begins a multi-step onboarding process, completes three stages, uploads documents, triggers notifications, and then the server restarts. Without a durable workflow model, the process does not resume precisely where it left off. It may restart from the beginning, skip a step, duplicate an email, or simply vanish. These failures are not exotic engineering concerns reserved for hyperscale infrastructure teams. They are the ordinary physics of distributed systems. Any application that handles money, identity, workflow, coordination, inventory, scheduling, or customer records eventually encounters them.

Databases have a term for the guarantees required to survive this world: ACID. Atomicity. Consistency. Isolation. Durability. These properties are not decorative. They are the load-bearing beams of business software. Atomicity ensures that a transaction happens completely or not at all. Consistency ensures the system moves from one valid state to another. Isolation ensures that concurrent actions do not corrupt one another. Durability ensures that once the system says something happened, it will not forget. Large language models do not inherently provide these guarantees. They can produce code that mentions transactions. They can generate plausible retry logic. They can create something that appears to handle failure. But they do not, by themselves, enforce correctness. They predict code. They do not guarantee behavior.

That is the ACID Gap. AI is increasingly good at generating intent. It remains weak at enforcing invariants. This distinction matters because the next era of software will be defined by a growing separation between the person who understands the desired business behavior and the system that must ensure the behavior remains correct under stress. In the old world, professional engineers mediated that gap. In the new world, agents will increasingly mediate it. But unless the substrate changes, the gap does not disappear. It merely becomes harder to see.

The Illusion of the Perfect Prompt

The obvious response is to ask for better code. Write a better prompt. Add stricter instructions. Use a linter. Run a code review agent. Generate tests. Ask the model to be more careful. All of that helps, and none of it should be dismissed. Better prompting can improve structure. Better tests can catch obvious defects. Better code review can identify suspicious patterns. But none of these solves the underlying problem, because the failure mode is not merely poor syntax or missing boilerplate. The failure mode is that probabilistic code generation is being asked to produce deterministic runtime guarantees.

That is the wrong layer of abstraction. A prompt can request safety. A runtime can enforce it. A prompt can ask the model to remember idempotency keys, transaction boundaries, retry semantics, workflow durability, API contracts, authorization checks, and state recovery. A runtime can make many of those properties default, visible, testable, and mechanically enforced. The difference is not cosmetic. It is the difference between hoping every generated building design includes the right foundation calculations and requiring every building to be constructed on a foundation system that already knows how to carry load.

In production software, correctness cannot depend on whether the model happened to remember every invisible engineering requirement every time it generated code. That approach may work for demos and toy applications, but it becomes intolerable when software touches customer money, regulated data, enterprise workflows, or operational systems. The more powerful the coding agent becomes, the more dangerous this gap becomes, because the agent can now generate complexity faster than humans can inspect it. The bottleneck moves from writing software to trusting software.

This is why “just make the model better” is an incomplete answer. Better models will absolutely produce better code. They will catch more mistakes, infer more intent, and learn more of the implicit craft knowledge that senior engineers carry in their heads. But even a perfect model remains probabilistic. It can still omit a crucial guardrail. It can still create an attractive abstraction that leaks under pressure. It can still generate code that looks locally reasonable while failing globally. In distributed systems, guessing is not a foundation. Guessing is sand.

Add this section after “The Illusion of the Perfect Prompt” and before “The Deterministic Substrate.” It bridges the argument from “better prompts won’t solve this” to “we need runtime guarantees.”

The False Comfort of Fast Fixes

Mitchell Hashimoto recently described the anxiety many experienced infrastructure engineers feel when they look at the current AI coding wave.

His concern was not that AI is useless, or that agents will fail to write meaningful software. His concern was more specific and more damning: entire companies seem to be convincing themselves that resilience no longer matters because repair is becoming cheap. In his framing, the industry is replaying an old argument from the cloud infrastructure era, when teams debated the relative importance of mean time between failure and mean time to recovery. The mature lesson from that period was not that failure prevention no longer mattered. It was that resilient systems require both: you reduce unnecessary failure where possible, and you design recovery paths for the failures that inevitably remain.

The dangerous new belief is that AI collapses this distinction. If agents can detect bugs, write patches, generate tests, and deploy fixes faster than humans, then perhaps software does not need to be quite so carefully designed in the first place. Perhaps it is acceptable to ship more broken systems because the system will repair itself quickly. This is the “MTTR is all you need” fantasy applied not merely to infrastructure operations, but to the entire software development lifecycle. It turns speed of recovery into an excuse for architectural negligence. Bugs become tolerable because agents will fix them. Broken workflows become acceptable because agents will regenerate them. Missing invariants become someone else’s problem because the machine can always patch the code later.

But infrastructure already taught us the flaw in that logic. You can automate yourself into a catastrophe machine. A system can look healthier by local metrics while becoming more dangerous globally. Bug reports can decline while latent risk increases. Test coverage can rise while semantic understanding falls. Deployment frequency can improve while architecture decays. The organization sees reassuring dashboards: more tests, faster fixes, fewer reported issues, shorter recovery times. Yet beneath those metrics, the actual system becomes harder to reason about. The code changes too quickly. The causal chains become too long. The generated fixes accumulate. The architecture loses its shape, not through one dramatic failure, but through a thousand locally reasonable patches.

This is the risk of agent-written software. If the model is treated as both author and repair crew, organizations may lose the discipline of designing systems that should not be allowed to enter invalid states in the first place. The agent fixes the symptom, but the substrate still permits the pathology. It patches the double-booking bug after the customer support ticket arrives, but it does not necessarily introduce a durable model of serialized scheduling. It retries the failed payment workflow, but it does not necessarily establish a coherent transactional boundary between the payment processor and the order database. It adds tests around the observed failure, but it does not necessarily discover the class of failures that has not yet surfaced.

Hashimoto’s warning matters because it reframes the AI coding debate in operational terms. The issue is not whether agents will get better at writing code. They will. The issue is whether organizations will mistake faster code generation and faster bug repair for systemic reliability. In the same way cloud automation made it possible to destroy infrastructure at unprecedented speed, AI coding agents make it possible to mutate software systems faster than human understanding can keep up. The danger is not merely bad code. The danger is incomprehensible code that continuously appears to be improving.

That is why the answer cannot be “better agents” alone. A better agent may reduce the frequency of obvious mistakes, but it does not eliminate the need for architecture that constrains what mistakes are possible. In high-stakes systems, the goal is not simply to recover after corruption. The goal is to make certain forms of corruption impossible. You do not want an agent to notice later that two users booked the same scarce resource. You want the runtime to prevent that state from ever existing. You do not want an agent to reconstruct a failed payment flow from logs. You want the system to preserve a durable, auditable workflow from the beginning.

This is the deeper lesson from the MTTR debate. Recovery is essential, but recovery is not a substitute for invariants. Repair is essential, but repair is not a substitute for correctness. AI may dramatically improve our ability to respond to failure, but it does not absolve us from the responsibility to design systems that fail safely. In fact, the more powerful the agents become, the more important those constraints become. Without them, software development becomes a high-speed loop of generation, breakage, patching, and drift. With them, agents can operate inside a system that makes the dangerous states unreachable.

The future of AI-written software therefore depends on resisting the MTTR trap. We should welcome agents that repair, refactor, test, and extend systems. But we should not confuse repair velocity with engineering discipline. The right question is not “How quickly can the agent fix the bug?” The right question is “Why was the system allowed to express that bug as a valid state at all?” That is the question that leads away from prompt engineering and toward substrate design.

The Deterministic Substrate

The next major shift in software will not be better AI-generated code alone. It will be better substrates for AI-generated code to run on. A deterministic substrate is the foundation layer beneath agent-written software. It provides the guarantees that probabilistic generation cannot be trusted to recreate from scratch every time. It makes concurrent operations safe by default. It gives workflows durable state. It turns retries into a first-class primitive. It enforces clean boundaries between services. It gives agents a legible model of application state. It ensures that the system can recover from failure without inventing a new reality.

In other words, it pours the concrete. The substrate does not replace the agent. It makes the agent useful in production. It lets the agent express business intent without forcing it to rediscover the hard-won lessons of distributed systems engineering on every generation. It changes the job of the platform from hosting code to enforcing invariants. That is a profound shift. In the pre-agent world, frameworks mostly helped human developers move faster. In the agentic world, frameworks must also help non-human developers avoid creating unsafe systems at machine speed.

This matters because the builder population is about to expand dramatically. The next wave of software will not be written only by professional engineers. It will be generated by agents acting on behalf of founders, operators, lawyers, teachers, doctors, analysts, and domain experts who know exactly what they want the software to do but do not know how to make it safe under production conditions. That is not a flaw in the movement. It is the point of the movement. The promise of AI-assisted creation is that expertise in a domain can become executable without requiring every domain expert to become a distributed systems engineer.

But democratization without reliability creates a new class of fragile systems. It means more apps, more automations, more integrations, more business logic, and more invisible failure. The answer is not to slow down creation. The answer is to move more correctness into the substrate. The platform has to absorb the operational burden that the new builder population cannot reasonably be expected to carry. If agents are going to write software, the runtime has to become much more opinionated about what safe software is allowed to do.

The Disposable App

Once that happens, the definition of software begins to change. Today, we think of applications as durable objects. They are designed, built, launched, maintained, and distributed. They live in app stores, SaaS dashboards, enterprise procurement systems, and long-term roadmaps. But in an agentic world, many applications may become temporary. They may be generated for a particular purpose, a particular user, a particular workflow, or even a particular moment. The economic logic of software shifts when the cost of creation falls toward zero and the runtime can provide safety by default.

You need a tool to reconcile a family reunion budget across three currencies. Your AI builds it. You use it for a week. Then it disappears. A sales team needs a custom workflow for one strategic account. The agent creates it, connects it to the right systems, enforces the right permissions, and retires it when the deal closes. A legal team needs a one-off diligence tracker for a specific transaction. It is generated, used, audited, and discarded. In each case, the application does not need to become a company, a category, or a permanent product. It simply needs to do a job safely.

This future only works if disposable does not mean dangerous. Generated software must still respect identity, permissions, data integrity, recovery, auditability, and transactional correctness. The app may be temporary. The guarantees cannot be. In fact, the more disposable software becomes, the more important the substrate becomes, because there will be less time for human review, less tolerance for manual hardening, and less institutional knowledge around each generated artifact. The foundation must be reliable precisely because the structures above it will appear and disappear so quickly.

That is the deeper implication of agent-written software. The opportunity is not merely that we will build the same applications faster. It is that we will build new kinds of applications: narrower, more contextual, more personal, more workflow-specific, and more ephemeral. But the only way to unlock that world safely is to separate intent generation from correctness enforcement. The agent can create the shape. The substrate must enforce the physics.

First Pour the Concrete

The Millennium Tower did not fail because it lacked imagination. It failed because imagination was allowed to outrun foundation. That is the risk now facing software. We are entering an era in which AI can generate endless towers of glass: interfaces, workflows, dashboards, forms, automations, and agents that look complete the moment they appear. The surface area of software creation is becoming magical. But software does not stand because it looks good in a demo. It stands because the underlying system respects the hard physics of computation: state, failure, concurrency, recovery, and trust.

The future of agent-written software will not belong merely to the best prompt, the prettiest interface, or the fastest demo. It will belong to the systems that make generated software safe by construction. As software creation moves from engineers to agents acting on behalf of everyone, the winning platforms will be the ones that treat correctness not as an optional discipline but as an architectural primitive. They will not ask every builder to understand the physics. They will make the physics unavoidable.

Before we build higher, we have to build deeper.

Beeps & Breakthroughs

Discussion about this post

Ready for more?