The 2026 Anxiety of AI Investors: When Models Devour Everything, What Remains of Startup Moats?

CN
6 hours ago
2026 Investor Edition of AI Panic: Just give all the money to Anthropic and Nvidia, then go home and sleep.

Author: Sarah Guo

Translated by: Deep Tide TechFlow

Guide by Deep Tide: As large models begin to crush humans across all rankings, investors start to fall into a kind of despair: aside from Anthropic and Nvidia, what else is worth investing in? This top Silicon Valley investor illustrates with data and examples that the real moat is not on the rankings—it hides in places that cannot be measured by benchmarks.

By mid-2026, investors in the AI sector are in a state of despair: nothing is worth investing in anymore; we should just invest all our money in Anthropic and Nvidia and go home.

I have never felt this way. I am already convinced that the models are several versions smarter than me, and I am happy to buy into Anthropic and Nvidia at market price. All my smartest friends are quite certain that self-improvement will succeed soon—but I still do not feel this despair.

This despair is not foolish. The logic goes like this: if models keep getting better at everything, then every company built upon them is just a thin layer, waiting to be absorbed. The only surviving value is computing power and frontier weights.

Take software as an example, which is the case most relied on by despair advocates. Devin could only solve 13% of tasks on standard software benchmarks at launch in 2024, and was largely ignored. A year and a half later, the best agents could achieve scores of over 80, and they are doing real work at Goldman Sachs and within the U.S. Army. Almost everyone drew the same erroneous lesson: models consumed software engineering. But when models devoured the most easily measurable parts of software engineering, we are rediscovering what many teams have known for a long time—engineering has always resisted measurement, and the most easily measurable parts may not be the only important ones.

Mert Demirer from MIT and his collaborators finally provided the numbers: among over 100,000 developers, the latest coding agents increased the amount of code written by about 180%, while the actual amount of code released increased by about 30%. Writing code has become cheaper. The remaining parts still need to be handled by humans, and they are very important. Of course, the net impact is still astonishing.

Benchmarking is about what you can measure, and what you can measure is what you can train against. Therefore, coding agents matured first: compilers are free verifiers, test suites are free verifiers, and when answers check themselves for free, you can continuously refine against checks until you beat them. But testing has never informed you whether this change is correct for a ten-year-old codebase with three undocumented modules and a deployment pipeline tenuously maintained by a cron job that no one wants to admit to having written.

That kind of correctness cannot be read from rankings; in fact, it cannot be read from anything. You learn whether such a complex system is effective by running it long enough in the real world, and smarter models do not make the world run faster. No one runs unit tests on something the size of Google and then believes the green check; you believe it because it has withstood years of real load. Such correctness is not only private, but it is also that slow moat that capital cannot collapse. Even optimists acknowledge that the clock cannot skip: Noam Brown, a pioneer of OpenAI's reasoning models, recently wrote that the only reliable way to assess an agent over a span of a year might be... to run it for a year.

As Gabe Pereyra said, true automation is not just about models getting better. It is about products, models, workflows, and companies moving together, and three out of those four move at the speed of an organization.

The moving parts are those that benchmarks cannot touch: changing the way a skeptical partner handles matters, keeping teams united during the rebuilding process. That’s why, when we hire CEOs, the ability to handle people is at least as important as analytical ability, and smarter models do not change this weighting. Feedback is murky, the time spans are years long, and trust belongs to a person. Every company I know has all its engineers using cutting-edge coding models, but none of them has changed their engineering organization at anything close to that speed. Adoption took a quarter; what a miraculous quarter of token growth that was! Yet rebuilding is taking years.

What is visible is what is leaving. Valuable work is structurally invisible: anything you can put on a leaderboard, you can train against, so anything measurable is already on the path to commoditization. This process takes time and will never be complete, but the direction never reverses. In the words of my friend Matt MacInnis at Rippling: tokens spent answering general questions are worth almost nothing, because any model can answer them, whereas tokens spent reasoning over your company’s data are worth much more, because they do what you really want, not just what seems reasonable.

Visible work is consumed from two directions. From below, task saturation: once a job can be examined cheaply, buyers no longer ask which model performed it but begin to ask how much it cost, and the work falls to the cheapest open-source or distilled model of the week. Wherever they can make an impact, margins ultimately matter. From above, labs are trying to get models to consume their own scaffolding. Retrieval, routing between cheap and expensive calls, tool usage, and even reasoning strategies—all the mechanisms that once wrapped models are being pulled into the weights, until the wrapper is just the model. This is how the frontier is absorbed. Margin pressure also cuts in the opposite direction: general agents must be prepared for anything, which is expensive, whereas focused applications can adjust a workflow until it runs on a small portion of token spending, and unlike laboratories selling those tokens, it retains the margin.

So, we can ask two questions about any type of work. Is its correctness private and built on a high cost, that kind of truth that only exists within someone's data? Is it isolated, locked in a system you cannot access? Compare these with the level of saturation for tasks, and you get a 2x2 matrix. Saturated work with public answers is commodity tokens that open-source models possess. Saturated frontier work, where coding benchmarks reside, is where labs win, because when evaluation is free, possessing it means little. The prize is in the last corner, the non-trainable one: correctness exists only in private domain frontier work. You can see it in the reasoning clouds that host AI-native pioneers, where the vast majority of tokens are generated by custom models, not general open-source models.

The heights of the walls to enter that last corner vary. A toy codebase of a single developer is portable and standardized, so the climb is short. Banks' production systems are neither of those; you won’t gain root access just because you are 2% smarter on SWE-Bench Verified.

Capabilities have eaten many things, but better models do not turn private fundamental truths into public ones. They do not hold licenses, do not sign liabilities, and do not own company documents. When answers are wrong, they cannot be the party being sued. Intelligence is not the bottleneck here. Licensing is, nor is responsibility. You can imagine a model that is much smarter than anyone, but it still must be allowed in, and someone must sign for what it does.

That door has a lock and a bolt. The lock is the environment: you can only verify that an AI has done useful things after being trusted inside the system, after security reviews, integration, and contracts for signing your results. The bolt is the user. Now, most doctors in the U.S. open OpenEvidence daily, and no amount of computing power can buy that. Labs could train a perfect medical model tomorrow, yet still fail to penetrate doctors' habits, or the decision-making processes of UC San Francisco, because trust is slowly built, based on relationships, requiring user consent, rather than erasing their gradient descent.

This is also work. An application wins its place in the non-trainable corner by doing inconspicuous work: arranging the private reality of the company so the model can act on it, providing tools for action to the model, and collaborating with clients to change the reality of their employees. A company that brings translation is hard to replicate—and translation will never end. Integration and maintenance take as long as the durations and relationships involved, won by teams placing domain-specialized engineers and tools alongside clients.

For example, in a top white-shoe law firm, just the M&A business operates nearly a thousand transactions a year. For confidentiality and many other reasons, you cannot let hundreds of assistants each download client files to desktops and ask a general agent to sift through them; even if you could, what you learn will be fragmented, corrections from one assistant at a time, without seeing how the entire transaction flows. Important signals exist at the transaction level, and transactions have a shape: for M&A, it’s confidentiality agreements, term sheets, due diligence, purchase agreements, ancillary documents, and closing checklists; for IP litigation, it’s motions, evidence disclosures, prior art, and more motions. Each business domain has its own, and lawyers and tools cannot interchange across domains. The real problem law firms solve is at a higher level above all this: running each business area in parallel, just like top partners running hundreds of matters at the same time, while introducing new matters and training assistants. Transforming such a law firm is not a single task for which you can write an assessment. It requires an operator to work with data analytics, the goals are extremely vague, feedback is incomplete, timelines are long, all in a dynamic environment.

Unfortunately, invisible value is also hard to sell, for the same reason it is difficult to commoditize: companies cannot assess from the outside whether AI will transform their operations, just as benchmarks cannot. Thus, the strongest enterprises stop trying to prove it from the outside and instead enter from within, pricing the outcomes. Sierra charges clients when its agent solves their problems and does not charge when it kicks the problem to humans, so pricing becomes an evaluation that only works when Sierra has a definition of "solved." Cognition’s Devin takes a similar approach in software, offering "performance guarantees," which only provides results within systems you’re trusted to access.

Even service tokens, which everyone loves to call pure commodities, do not operate like commodities. The best AI-native companies concentrate their services on one or two providers (Baseten or Fireworks) because each token cost is commoditized as planned, while reliability under real traffic and guaranteed access to scarce computing power does not. Where you serve is a different choice than which models you use. Pricing is the only part of reasoning that operates like a commodity.

One frequently raised objection is that the lab is your supplier—why wouldn’t it run its first-party products at a loss to squeeze you, or revoke your API access and occupy the market itself? This is the true version of despair, which only holds if model layers are a single-player game. Clearly, they are not—it looks more like a death race among three and a half parties, a group of international players six months behind in training, trying to scale alliances five times last year’s. Customers want competition between suppliers, and labs want market share more than they want any application to die.

You can see this in the markets where labs contend directly. In consumer chat, the best models never simply win. ChatGPT has maintained its lead in years of real competition; the share it is losing now is flowing to Gemini, leveraging the power of Android and search, not just a better model. Anthropic, which the prediction market (and internet sentiment) currently rates as having the best model, is barely a factor in consumer chat but has built its business in enterprise and coding. If better models cannot steal competitors' users in the most core applications, they will not make it through the hospital records or bank liabilities via integration. Today’s public choices are not just based on coding. If the frontier remains crowded, its upper tier will be valuable.

If work cannot be rated from the outside, someone inside must decide what even constitutes a good answer, and that decision is the whole game. Enough of these decisions, written down, become a benchmark. Harvey published one for law, Sierra published one for voice agents. By becoming the one already in use in a domain, you win the right to define what good means for that domain, and these companies have won this right by the struggle of true adoption.

The assessments that determine real monetary value are private and vary by company: that company, in that matter, will accept what as good work. It is far from complete because the depth of law makes any public testing pale in comparison. OpenEvidence is figuring out what a safe clinical answer looks like. These are not real measurements; it is about judging what is true and what is good, written down until it becomes the standard against which everyone else is measured, and foundational labs, no matter how smart, cannot write it because that kind of authority only exists within that domain. This authority tends to fall where it is already seated. Senior lawyers write legal benchmarks. Defining a safe clinical answer falls to the doctors. And "solved" means anything a company that already has clients says it means.

Absorbing the frontier is continuously rising because we keep learning to measure more work; the measurable gets consumed. The non-trainable ground is shrinking beneath any feet standing on it, so you cannot find a defensible point and then rest. You are constantly moving towards anything that still cannot be scored; you are constantly reinsuring. On a narrow task, with your private data and your own assessments, you can train to the frontier and beat general models in critical areas, and that specialized model becomes part of the moat. On the other hand, competing on general models is a capital war; you will lose to those with the most computing power, which is the trap for companies with shallow access and visible tasks. It promises survival by surpassing frontier training across general tasks on the day the winners seem most determined by data center scale; the outcome is usually not independent champions but sales to those with abundant computing power.

All of this is defense. The harder part is attacking, choosing what to build first. This is what I spent a year looking for, and I might have found it three times. The model does not help here. It will do anything you point it to, but it cannot tell you what is worth pointing to; you cannot benchmark that, so you cannot train it. This is also why existing enterprises will not take everything: they hold onto the territory they have, and the next thing comes from those who discover uses before the rest of us. Perhaps intent is the input that is scarcer than computing power.

Despair is half-right. Thin layers are indeed being absorbed, and much of what looks like companies today is thin packaging. It is wrong about what remains. The mechanisms are clear; the destination is not. I would bet on the direction: intelligence keeps getting cheaper, value keeps sliding towards the few places the models cannot reach. The non-trainable holds historically contingent value. So enter one, do the unnoticed translation, and start writing down what good means there, because someone will go do it. The most cited benchmark score this year is a territory map that is about to become worthless, along with a notice of whose say-so counts as good that is about to be lost.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink