Original Title: The Untrainable
Original Author: Sarah Guo, Conviction
Original Translation: Peggy, BlockBeats
Editor's Note: As AI capabilities continue to leap, a new pessimistic judgment is emerging in the investment circle: if models are becoming stronger, all application companies will eventually be swallowed by models and computing power levels like Anthropic, OpenAI, and Nvidia, leaving only cutting-edge models, computing power, and a few infrastructures in the market. But Sarah Guo believes this judgment is only half correct. Those "thin wrappers" (simple shell applications of models) will indeed be absorbed; any tasks that can be benchmarked, trained on public data, and verified at low cost will gradually become commoditized.
The real question is: After AI devours everything trainable, what remains untrainable?
The answer in this article is the value that exists within real organizations and cannot be easily replicated from the outside: proprietary data, complex workflows, user trust, system permissions, industry judgments, compliance responsibilities, and experiences accumulated over long operations. Models can be smarter but cannot automatically enter a bank's production system; they can generate medical answers but cannot directly gain the trust of doctors and the decision-making processes of hospitals; they can write legal texts but cannot take on the responsibilities of senior lawyers or define what constitutes qualified legal work out of thin air.
Thus, the truly moat-holding AI companies in the future will not simply be smarter than general models, but will delve deeply into a specific industry to complete the tough but critical "translation" work: organizing the client's proprietary realities, tools, processes, and standards into a system actionable by models, and gradually writing down the definition of "what constitutes a good outcome" through long-term service. The stronger AI becomes, the more it will devalue measurable and replicable tasks; it will also highlight those "untrainable things" that are imbued with history, relationships, permissions, and professional judgments. This is the real value that may still be preserved after the models consume everything.
Here is the original text:
By mid-2026, investor versions of "AI madness" convey a sense of despair that nothing is worth investing in anymore: it seems we should just invest all our money in Anthropic and Nvidia and then go home to sleep. But I have never felt this way. For several minor versions now, I have been convinced that the models are smarter than I am; if I could buy Anthropic and Nvidia at market price, I would gladly do so; and my smartest friends are equally convinced that the models' self-improvement will soon be effective — yet I still do not feel this despair.
This despair is not foolish. Its logic is as follows: if models continue to become stronger in everything, then all companies built on models are merely thin shells waiting to be absorbed by models; the only value that can ultimately be preserved will be computing power and the weights of cutting-edge models.
Taking software as an example, this is the case most reliant on that sense of despair. When Devin was released in 2024, it could solve just 13% of tasks in standard software benchmarks, thus was largely undervalued by the market. One and a half years later, the strongest Agent could achieve over 80% high scores and began handling real work internally at Goldman Sachs and the U.S. Army. Almost everyone drew the same erroneous conclusion: the model has swallowed software engineering.
However, after the model devoured the most easily measurable parts of software engineering, we are also rediscovering something many teams have long known: engineering has always resisted measurement, and the most easily measurable parts are not necessarily the only important parts.
MIT's Mert Demirer and his collaborators finally quantified this: among over 100,000 developers, the latest generation coding Agent has increased code writing volume by about 180%, but the actual code delivered and deployed has only increased by about 30%. Writing code has become cheaper, but the remaining steps still need human involvement, and those steps are very important. Of course, the overall net impact is still astounding.
Benchmarking is something you can measure; anything that can be measured can be used for training. Thus, coding Agents matured first: compilers serve as free verifiers, and test suites are also free verifiers. When answers can be self-checked at almost zero cost, you can continuously refine around this checking signal until you break through it.
However, passing tests never means that the change is correct for a code base that has been running for ten years. That module may exist for three undocumented reasons; a deployment pipeline may be barely maintained by a cron job that no one is willing to admit writing.
This correctness cannot be read from the leaderboard, nor can it be truly gleaned from anything directly. You can only let such a complex system run long enough in the real world to know whether it truly works. And smarter models do not make the real world run faster. No one would completely trust a vast system like Google just because it passed unit tests and showed green checkmarks. You trust it because it has withstood years of real load.
This correctness is not only proprietary, but also a slowly formed moat that cannot be directly compressed in time by capital. Even optimists admit that this clock cannot skip ahead. Noam Brown, a pioneer in OpenAI's reasoning models, recently wrote: the only reliable way to assess an Agent's performance over a one-year cycle might be to let it run for a year.
As Gabe Pereyra stated, true automation is not just about models getting stronger. It is about products, models, workflows, and company organizations changing together, and three of these four move at the pace of the organization.
Getting people to move is a part that no benchmark touches: persuading a skeptical partner to change her way of doing things, keeping a team cohesive during a rebuilding process. This is also why we value a CEO's people skills at least as much as analytical skills when hiring. Smarter models do not change that weighting.
The feedback here is ambiguous, the time span is measured in years, and trust belongs to a specific individual. Every company I know has already let every engineer use cutting-edge coding models, but no company's engineering organization has changed at a pace close to that of model advancements. Adopting tools took just one quarter, which was such a magical quarter of token growth! But real rebuilding takes years.
Measurable work is leaving. Truly valuable work is structurally unreadable: anything you can put on a leaderboard can be trained on; thus, anything measurable is already heading towards commoditization. This process takes time and will never be fully completed, but the direction will never reverse.
In the words of my friend, Matt MacInnis at Rippling, putting it in monetary terms: a token that is only used to answer a general question is worth almost nothing, as any model can answer it; but a token that reasons on your company’s data is much more valuable, as it does what you truly want instead of just generating a seemingly reasonable answer.
Readable work will be consumed from two directions.
From below, tasks will saturate: once a job can be checked at low cost, buyers will no longer care about which model accomplished it, but will start asking how much it costs. Thus, the work falls to the cheapest open-source model or distilled model of the week. As long as profit margins can play a role, it will ultimately matter.
From above, labs are attempting to make models swallow their own scaffolding. Retrieval, routing between cheap calls and expensive calls, tool usage, even reasoning strategies—all the devices that used to be wrapped outside the model are being pulled into the model weights until the "shell" itself becomes the model. This is the boundary of absorption.
Profit pressure will also play a role from another direction: a general Agent must be ready to handle everything at any time, which is very costly; while a focused application can optimize a workflow to the extreme, consuming only a small portion of tokens. Moreover, unlike labs selling these tokens, application companies can keep the middle margin.
Therefore, we can ask two questions of any kind of work: is its correctness proprietary and costly, is it a truth that only exists within the data of a particular company? Is it isolated within a system that outsiders cannot enter? When we combine these questions with the level of task saturation, we arrive at a 2×2 matrix.
Tasks that are saturated and with publicly available answers are the domain of commoditized tokens, which open-source models will occupy. Cutting-edge work but with publicly available answers, such as coding benchmark tasks, is where labs will win, as when assessment is free, simply having it is not valuable.
The real prize lies in the last corner, which is the "untrainable" corner: cutting-edge work, but its correctness only exists in proprietary environments. You can see this in the inference clouds serving AI-native pioneers: the vast majority of tokens are generated by custom models, not by general open-source models.
The wall leading to this last corner has variations in height. A developer's toy codebase is transferable and standardized, so getting into it is not difficult. But a bank's production system is neither transferable nor standardized. You won’t gain root access just by being 2% smarter on SWE-Bench Verified.
Capable models will consume many things, but better models will not turn private real standards into public ones. They do not hold licenses, will not sign for liabilities, nor own the company's documents; when answers go wrong, they cannot be a party that is sued. The bottleneck here is not intelligence, but permissions and responsibility. You can imagine a model smarter than anyone, but it still must be allowed to enter and there still must be someone putting their name on what it has done.
That door has a lock and a bolt.
The lock is the environment: only after gaining trust within a system, passing a security review, completing integration, and signing a contract with responsibility for results can you verify whether AI has really done something useful.
The bolt is the user. Nowadays, most doctors in the U.S. open OpenEvidence daily, which cannot be bought with any computing power. A lab could train a perfect medical model tomorrow, but it still cannot enter the doctor's working habits or UCSF's decision-making processes. Because trust is gradually built based on relationships and user acquiescence, not erased away by gradient descent.
This is also the work of application companies. An application can occupy a position in the "untrainable" corner through those rather unglamorous tasks: organizing a company's proprietary realities so that models can act upon them; handing action tools to models; and changing, with clients, how their workforces actually operate.
A company capable of completing this "translation" is difficult to replicate, and this translation never ends. Integration and maintenance will continue alongside customer relationships. Winning this game are teams that place domain-specialized engineers and tools closely beside clients.
For example, in a top-tier established law firm, just in merger and acquisition services, there are nearly a thousand transactions each year. You cannot let several hundred paralegals separately download client files to their desktops and then hand them over to a general Agent for reading. Confidentiality alone does not allow for this, not to mention there are a dozen other issues. Even if it could be done this way, all you learn is fragments: one assistant corrects a little at a time, and no one can see how an entire transaction flows.
The truly important signals exist at the transaction level. A transaction has its own shape: for M&A, there are NDA, terms sheets, due diligence, purchase agreements, ancillary documents, and closing checklists; for intellectual property litigation, there are motions, evidence disclosures, prior art, and more motions. Each business area has its structure, and lawyers and tools cannot be swapped freely.
Moreover, the real problem the law firm needs to resolve lies at a higher level: how to simultaneously run every business area just like top partners manage hundreds of matters in parallel while bringing in new cases and training junior lawyers. Transforming such a company is not a single problem you can write a test task for. It requires a handler who deals with it like playing "data baseball": middle objectives are extremely vague, feedback is incomplete, cycles are very long, and the environment itself does not remain static.
Unfortunately, unreadable value is also difficult to sell, for the same reason it is difficult to commoditize: a company cannot judge from the outside whether AI can transform its operations as benchmark tests suggest. Therefore, the strongest companies will stop trying to prove themselves externally but will first enter the client internally and then price for results.
Sierra only charges when its Agent solves the client's problem; if the issue is handed over to humans, it does not charge. Thus, the price itself becomes the assessment mechanism. And this works because Sierra holds the definition of "solved." Cognition’s Devin has done the same thing in the software realm by introducing "performance guarantees." Only when you are trusted to enter a system can you qualify to provide such guarantees for results.
Even at the level of providing token services — which everyone likes to call pure commodities — its performance is not commodity-like. The best AI-native companies will centralize their services with one or two suppliers, like Baseten or Fireworks. Because while the cost of each token is heading towards commoditization on schedule, the reliability under real traffic and the stable acquisition of scarce computing power will not become commoditized. Where inference services are provided and which models are used are two different choices. The only part of inference that is truly commodity-like is the price.
A common rebuttal is: your labs are your suppliers, why wouldn’t they dump their own first-party products below cost and crush you? Or directly revoke your API access and take the market themselves? This is the true version of that sense of despair. But it only holds when the model layer is a single-player game.
Clearly, that is not the case. The model layer is more like a death race between three and a half players, alongside a batch of international players lagging in training by about six months and a development alliance that is five times the size of last year’s. Clients want competition among their suppliers, while labs want market share more than wanting to kill any specific application.
You can see this in markets where labs compete directly. In consumer chat scenarios, the best models have never simply won the entire market. ChatGPT has maintained its lead through years of real competition; its current lost share is flowing to Gemini, and the reason is the distribution capabilities of Android and search, not that the models are better. Anthropic is currently believed to have the best model in the prediction market and internet ambiance, but it is hardly a major player in consumer chat and is building its business in enterprise and coding scenarios.
If a better model cannot take users from its competitors in the core applications, it will not easily digest a hospital's medical record system or a bank's responsibility framework through integration. Today, the criteria by which the public chooses products are not just about coding capabilities. If the cutting-edge model layer is still crowded, then the application layer above it will hold value.
If a job cannot be scored from the outside, then someone inside must decide what constitutes a good answer. That decision is the entire game itself. If enough of these decisions are written down, they become benchmark tests. Harvey released benchmark tests in the legal domain, and Sierra released benchmarks for voice Agents. Your right to define what "good" means within a domain comes from that domain already using you. And these companies won this right through the hard struggles of real adoption processes.
The assessments that truly determine the flow of money are proprietary and formed company by company: what this company will accept as good work on such matters. And this is far from complete, as the depth of law far exceeds any public test. OpenEvidence is solidifying what constitutes safe clinical answers.
All of this is not really "measurement" in the traditional sense, but about what is real and what good judgments are. These judgments are documented until they become the standards that everyone else must accept for assessment. No matter how smart the foundational model labs become, they cannot create these standards out of thin air, for that status only exists within the domain.
This authority often falls where it originally exists. Senior lawyers write legal benchmarks. Doctors define what safe clinical answers are. What "solved" means is determined by the company that already has client relationships.
The boundary of absorption will continue to rise, as we will continuously learn to measure more jobs, and what can be measured will be consumed. The ground of the untrainable will keep shrinking under the feet of those standing on it, so you cannot find a defensible position and stop. You must continuously move towards areas that cannot yet be scored and continuously reassess, re-underwrite risks.
On a narrow task, with your proprietary data and your own assessment system, you can train to a cutting-edge level and defeat general models in critical scenarios; this specialized model will become part of the moat. On the other hand, if you are competing on the capacity of general models, that is a war of capital, and you will lose to those with more computing power. This is also the trap that companies with only shallow access permissions and highly readable tasks are most prone to falling into.
When a company decides to train over general tasks to exceed the capabilities of cutting-edge models to survive, the outcome often seems predetermined by the size of the data center. The final outcome is often not the emergence of an independent champion, but being sold to a player with ample computing power.
All of the above is defense. The more challenging task is offense: first determining what to build. This is what I have been searching for over the past year, and I have probably only found it three times. Models cannot help with this. They will do what you point them at; but they cannot tell you what is worth pointing at. You cannot establish benchmark tests for this, so you cannot train it.
This is also why the existing giants will not take everything: they will protect the territory they already hold, while the next thing will come from someone who finds a use for it ahead of others. Perhaps intent is a more scarce input than computing power.
This sense of despair is half correct. The thin outer shell is indeed being absorbed, and many things that look like companies today are indeed merely thin shells. But its judgment on "what remains after absorption" is wrong. The mechanism is clear, but the endpoint is not.
I am willing to wager on this direction: intelligence will continue to become cheaper, while value will continue to slide towards areas unreachable by a few models. What is untrainable is value laden with history.
So, enter one of these fields to do those rather unglamorous translation tasks, and then start writing down the definition of "good" there. Because someone will always do this. The most frequently cited benchmark score this year is actually a map set to soon lose all value and a notification: notifying certain individuals that they are about to lose the right to define what "good" is.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。