
Author|Frank Fu @ IOSG
In 2023, the hole proposed by David Cahn has never been filled on the training side. It has been filled on the inference side, and the market only started to account for it in pricing in the past few weeks. When Nvidia reorganized its financial reporting around "service tokens" and Cerebras went public with a 20-fold oversubscription, the bottleneck debate has ended; the real question has turned into the next one: when inference becomes a scarce resource, where will value settle in the power stack?
Follow the GPU: From $200 Billion Problem to $600 Billion Problem
In 2023, Sequoia's David Cahn raised the question hovering over the entire AI construction, known as the "200 Billion Dollar Problem." For every dollar spent on a GPU, about another dollar must be spent to power it in a data center. Therefore, each year's GPU CapEx means that these chips must eventually generate about $200 billion in revenue to recoup this capital. Even with very generous assumptions regarding AI revenue, he found a hole of over $125 billion between "investment" and "actual payments from end customers." The concern is straightforward: GPUs are being overbuilt in advance of real demand.
A year later, the gap has not only failed to narrow but has instead widened. In his 2024 sequel, Cahn redefined it as the "600 Billion Dollar Problem," as ultra-large manufacturers' CapEx ballooned. The bearish logic converges into a familiar shape: overbuilding leads to oversupply, and oversupply will burn capital.
Both articles actually ask the same question: who will fill this hole? The answer has never appeared on the "training" side of the ledger. It shows up on the inference side, and the market has only begun to integrate it into pricing in the past few weeks.
Cerebras IPO and Inference Squeeze
Cerebras went public on Thursday. This IPO received a 20-fold oversubscription, pricing nearly double the final markup on Wednesday. The demand did not stem from a bet on the "next Nvidia killer," but from something more straightforward: the market is beginning to realize that the real bottleneck in AI is inference, not training.
Cerebras's core competency is a chip architecture that allows inference to be extremely fast. Not training, but inference. This is what excites Wall Street. The inference market is recurrent and expands with usage. Every time Claude answers a question, every time an agent performs a task, power is consumed. Training occurs only once, inference never stops.
J.P. Morgan estimates the scale of the inference market to be 10 to 50 times that of training. When machines begin executing tasks assigned by other machines, or agentic expansion occurs, inference demand no longer expands with the number of users but with the power itself.
Nvidia Redraws the Landscape: Inference Becomes Headline
If Cerebras represents the awakening of the market, then Nvidia's latest quarterly report serves as confirmation from the top of the industry chain. In the latest earnings call, Jensen Huang made the implied statement clear: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from discrete inference to logical inference and is now in the agent phase, where it can invoke tools and orchestrate tasks by itself. Huang stated, "Tokens are now profitable." In the AI era, power equals revenue and profits.
This reshapes the entire industry. Training is a one-time cost of building a model; inference is the recurring cost of running it, and today’s bottleneck lies in inference, not training.
Nvidia has incorporated this judgment into its reporting framework. It now discloses two platforms instead of one: Data Center and Edge Computing. Data Center (about $75 billion this quarter, +92% year-on-year) is further split into Hyperscale (about $38 billion, +12% quarter-on-quarter) and ACIE, i.e., AI Cloud, industrial, and enterprises (around $37 billion, +31% quarter-on-quarter). A brand-new line is Edge Computing: $6.4 billion, +29% year-on-year, covering endpoints where agentic AI and physical AI actually operate, such as PCs, workstations, AI-RAN base stations, robots, and cars.
Edge currently accounts for less than 8% of total revenue, but Nvidia has elevated it to be on par with the Data Center as the "second platform." This signal indicates that inference is splitting into two fronts: cloud inference in the data center and endpoint inference on the edge, as AI must see, move, and act in the physical world. The roadmap follows the same logic: starting in the third quarter, the Vera Rubin, with inference throughput up to 35 times that of Blackwell; Huang also provided a brand new total addressable market (TAM) of $200 billion for the Vera CPU designed for agentic loads. Every leading model company is expected to have fully pivoted to it from day one.
When the highest-valued company on Earth reorganizes financial disclosures around "service tokens," the debate over the bottleneck is already settled. The remaining part of this article discusses who captures value when inference (rather than training) becomes a scarce resource.
First, a scope clarification. Among these two fronts, this article discusses cloud inference, i.e., GPU services rented from data centers providing API token services. Endpoint inference runs on local chips within the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU rental and aggregation stack. Here, please view it as amplifying the entire inference economy, supporting the bottleneck argument, rather than the market where Hyperbolic and Venice reside, which are entirely positioned on the cloud side.
The Squeeze Has Arrived
Anthropic is the canary in the coal mine. The usage far exceeds the pre-configured capacity, with complaints about Claude being "cortex-excised" flooding the internet, including limited reply flow, slower inference, and compressed context windows. The solution is sheer computational power: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220,000 Nvidia GPUs and over 300 megawatts, dedicating it specifically for inference, not training.
This capacity unlocked a series of limit changes, each serving as a signal. On May 6, Anthropic doubled the five-hour limit of Claude Code, cancelled peak period limits, and significantly increased Opus's API rate limits. On May 13, it raised Claude Code's weekly limit by another 50% (until July 13). Then, starting from June 15, it did the opposite of "generous": it separated agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipeline) from flat subscriptions into a separately metered credit pool (ranging from $20 to $200 monthly, billed at API rates). This final action crystallized the entire argument in one move: agents consume inference at a rate far exceeding the design capacity of flat subscriptions, thus must be priced according to its true "recurring cost."
Training is a one-time capital expenditure. Inference is a recurring operational cost, compounding with each new user and each new agent.
This Stack: Six Layers, One Bottleneck
Every AI application sits on a supply chain that starts with TSMC's fabs and ends at API endpoints:


Most companies only own one layer. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.
Only one company is the exception.
Hyperbolic: The Only Company Across Three Layers
Hyperbolic launched its on-demand GPU marketplace in June 2025. Within the first few months, its number of developers exceeded 200,000, adopting parties covering leading AI labs, search engines, and large consumer platforms.
Interestingly, its architecture.
Hyperbolic does not own a single GPU. Every card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness, but in reality, it is a moat.
By sitting between the GPU supply side and the consumer side, Hyperbolic can see real-time data that others cannot. It knows who is buying what GPU at what price and when. It caught the oversupply before it became public and saw the demand surge before it hit the market.
Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized unified pool, allowing developers to rent the cheapest available GPU anywhere, without negotiating with each operator or managing a bunch of accounts. The more clouds it connects, the deeper the liquidity, and the richer the pricing data. Furthermore, the team is exploring how to model GPU price curves with this data and eventually invest its own capital to smooth out supply and demand, functioning as a market maker of physical computational power; but this goal is still in its early stages, with the aggregation layer currently compounding real benefits.
This is the flywheel:
Access more clouds → more aggregated supply
More supply → deeper market and real-time pricing data
Better data → smarter routing now, and long-term, pricing models
Better liquidity and prices → more developers → more clouds want to access
No other company is attempting this. Hyperbolic is the only one simultaneously spanning the GPU rental layer, deployment layer, and model API layer.
Venice as This Mirror
Venice is the clearest manifestation of the inference economy at the application layer and serves as a useful contrast to Hyperbolic's position. It is a privacy-first inference application: a set of OpenAI-compatible APIs, combined with consumer subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, of which about two-thirds are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), while the rest are anonymous passthroughs of closed-source frontier models. The key is that Venice itself does not own meaningful computational power. It rents GPUs from undisclosed partners and confidential computing suppliers (NEAR AI Cloud, Phala) and pays frontier labs to perform passthroughs, so its true cost of revenue is the inference computing power, not SaaS hosting.
What Venice truly sells is privacy. The "privatization" here does not mean turning public power into private ownership, but adds a layer of assurance to commoditized inference: no data retention, no training with the data, requests anonymized, with some loads running in TEE, making it so that even operators cannot see the plaintext. The underlying computational power is common commodity; the markup is this layer of privacy packaging. Moreover, this layer of assurance is stratified and not homogeneous: for models running on its own controlled or TEE GPUs, it can achieve near-end-to-end confidential computing; but for the anonymous passthrough of closed-source models like Claude or GPT, privacy only strips identity while the frontier lab still processes your raw prompt. Thus, the strongest privacy only covers the open-source part, while the frontier model part is "anonymous" rather than "truly confidential." Venice's gross profit = subscription price - inference costs paid downstream, and the portion it can charge above the bare API price is almost entirely supported by this layer of privacy premium, which is also why it experiences thin margins and is constrained by frontier passthrough pricing.
Token design packages this portion of inference demand. Venice operates on two tokens: VVV (for staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 worth of computational power per day. Paid subscriptions trigger programmatic buyback and destruction of VVV (Pro / Pro+ / Max approximately $2 / $5 / $10, respectively), with emissions scheduled to decline: 6M per month → 5M → 4M VVV, reducing to 3M on July 1. The buyback is real but at the discretion and not substantial: in April and May, about $103,000 was destroyed each month, with June slowly climbing towards about $110,000, far below the $200,000 barrier monthly.
The fundamentals are healthier than the headlines. The publicly circulated figure of "$70 million ARR" can almost certainly be traced back to mistaken subscription renewals being counted as net new customer acquisitions; the defensible observable range is closer to $6 million to $15 million ARR. Below this, traction is real: about 136,000 token holder addresses, approximately 9.9 million website visits per month (about 330,000 per day), with new Pro subscriptions hovering around 1,400 daily. This is a real business, but it is a low-margin business, its economics constrained by the power it purchases.
This precisely explains why Hyperbolic occupies the layer above it. If Venice is a gas station, Hyperbolic is a refinery. Venice purchases computational power from the same constrained supply that everyone relies on; Hyperbolic aggregates, standardizes that fragmented supply, and sells it to Venice and all players like it. As inference demand grows, value accumulates not only on the consumer's computational power applications but also on the layers that aggregate and route power and capture the cost of revenue these applications pay.
Why This Matters Now
Nvidia has reorganized its finances around "service tokens." Cerebras's IPO proves that the market has understood that inference is the bottleneck. Anthropic's frantic search for capacity demonstrates that this is a real issue. Agentic and physical AI will magnify demand by several orders of magnitude, crossing both cloud and edge lines.
And it has also closed the loop on the "600 Billion Dollar Problem" from the other side. Cahn's bearish logic of overbuilding followed by oversupply may ultimately be validated. But oversupply is precisely the optimal market condition for light asset aggregators: when GPU prices fall and supply fragments across dozens of clouds, the player who does not own any hardware but routes every workload to the cheapest available card will earn the price differential, while operators holding depreciating GPUs will bear the losses. Hyperbolic is long on oversupply, not shorting it.
The company that ultimately prevails will not be the one with the most GPUs, but the one that can tell you which GPUs are available where and at what price, and route every workload to where it can run at the lowest cost.
Hyperbolic is building such a company. Not owning GPUs, purely software, spanning three layers deep, yet shaping to become the ultimate aggregation layer for inference power.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。