Author: Frank Fu, IOSG
In 2023, the hole proposed by David Cahn had never been filled on the training side. It was filled on the inference side, and the market only began accounting for it in pricing over the past few weeks. As Nvidia reorganized its financial report around "service tokens" and Cerebras went public with 20 times oversubscription, the bottleneck debate has ended, and the real question has turned into the next one: when inference becomes a scarce resource, at which layer of the computing power stack will value be anchored.
I. Following the GPU: From $200 Billion Problem to $600 Billion Problem
In 2023, David Cahn from Sequoia raised the question that hangs over the entire AI build: the "200 billion dollar problem." For every dollar spent on GPUs, about an additional dollar is necessary for powering them in data centers, meaning that every year's GPU CapEx represents a requirement for these chips to produce approximately 200 billion dollars in revenue to recoup that capital. Even making extremely generous assumptions about AI revenue, he still discovered a hole of over 125 billion dollars between "investment" and "actual payments by end customers." The concern was straightforward: GPUs are being overbuilt ahead of actual demand.
A year later, the gap not only did not narrow but actually widened. In his 2024 sequel, Cahn redefined the issue as the "600 billion dollar problem," with the CapEx of hyperscale players expanding. The bearish logic converged into a familiar shape: overbuilding led to oversupply, and excess would burn capital.
Both articles were essentially asking the same question: Who will fill this hole? The answer never appeared on the "training" side of the ledger. It appeared on the inference side, and the market only began accounting for it in pricing over the past few weeks.
II. Cerebras IPO and Inference Squeeze
Cerebras went public on Thursday. This IPO received 20 times oversubscription, pricing nearly double the final mark-up from Wednesday. The demand did not arise from betting on the "next Nvidia killer," but from a simpler realization: the market began to acknowledge that the real bottleneck in AI is inference, not training.
Cerebras's expertise lies in a chip architecture that enables very fast inference. It's not training, it's inference. This is what excites Wall Street. The inference market is recurring and expands with usage. Each time Claude answers a question or an agent executes a task, it consumes computing power. Training occurs once, while inference never stops.
J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. When machines start executing tasks assigned by other machines in an agentic expansion, inference demand no longer expands with the number of users but instead grows with the computing power itself.
III. Nvidia Redrawing the Landscape: Inference Takes Center Stage
If Cerebras represents the awakening of the market, then Nvidia's latest quarterly report serves as confirmation from the top of the industry chain. During the latest earnings call, Jensen Huang made explicit the unspoken truth: AI demand is experiencing parabolic growth. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-time inference to logical reasoning and then to the agent phase where it can call tools and orchestrate tasks. Huang stated, "Tokens are now profitable." In the age of AI, computing power equates to revenue and profit.
This reshapes the entire industry. Training is a one-time cost for building a model, while inference is an ongoing operational cost, and now the bottleneck lies in inference, not training.
Nvidia has integrated this judgment into its financial report metrics. It now discloses figures for two platforms instead of one: Data Center and Edge Computing. Data Center (approximately $75 billion this quarter, +92% year-on-year) is further divided into Hyperscale (approximately $38 billion, +12% sequentially) and ACIE, which stands for AI Cloud, Industrial, and Enterprise (approximately $37 billion, +31% sequentially). A new line is Edge Computing: $6.4 billion, +29% year-on-year, covering endpoints where agentic AI and physical AI really operate, such as PCs, workstations, AI-RAN base stations, robots, and cars.
Edge currently still accounts for less than 8% of total revenue, but Nvidia has elevated it to be on par with Data Center as the "second platform." This signals that inference is splitting into two fronts: cloud inference in the data center and endpoint inference on the edge, where AI needs to see, move, and act in the physical world. The roadmap follows the same logic: The Vera Rubin shipped in the third quarter can have inference throughput up to 35 times that of Blackwell; Huang also provided a brand new $200 billion TAM for the Vera CPU designed for agentic loads. Every cutting-edge model company is expected to fully pivot to it on day one.
As the most valuable company on the planet reorganizes its financial disclosures around "service tokens," the bottleneck debate has already settled. The remainder of this article discusses who captures value when inference (and not training) becomes a scarce resource.
Let’s clarify a scope. In these two fronts, this article discusses cloud inference, which involves leasing data center GPUs that provide API token services. Endpoint inference runs on local chips within the devices (Nvidia’s Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU leasing and aggregation stack. Here, please view it as an amplification of the entire inference economy, supporting the bottleneck argument, rather than the markets of Hyperbolic and Venice, which exist entirely along the cloud line.
IV. The Squeeze Has Arrived
Anthropic is the canary in the coal mine. Usage far exceeds the pre-configured capacity, with complaints about Claude being "neuroleptic" flooding the internet, including throttled responses, slower inference, and compressed context windows. The solution is starkly computing power: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220,000 Nvidia GPUs and 300+ megawatts, dedicating it specifically for inference, rather than training.
This capacity unlocking brought a series of limit changes, each one signaling an adjustment. On May 6, Anthropic doubled the five-hour limit for Claude Code, removed throttling during peak times, and significantly raised Opus's API rate limit. On May 13, it increased the weekly limit for Claude Code by 50% (until July 13). Then, starting June 15, it did the opposite of "generous": it separated agentic and programmatic use (Agent SDK, headless mode claude -p, CI pipeline) from the flat subscription into an independently metered credit pool (ranging from $20 to $200 per month, billed at API prices). This final step distilled the entire argument into one action: the speed at which agents consume inference far exceeds the design capacity of flat subscriptions, thus necessitating pricing based on its actual "recurring cost."
Training is a one-time capital expenditure. Inference is a recurring operational cost, compounding with each new user and each new agent.
V. This Stack: Six Layers, One Bottleneck
Every AI application resides on a supply chain that begins at TSMC’s fabrication plant and ends at the API endpoint:


Most companies only own one of the layers. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.
There is only one exception.
VI. Hyperbolic: The Only Company Spanning Three Layers
Hyperbolic launched its on-demand GPU market in June 2025. In the initial months, it surpassed 200,000 developers, covering cutting-edge AI laboratories, search engines, and large consumer platforms.
Interestingly, its architecture is unique.
Hyperbolic does not own any GPUs. Every card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This may sound like a weakness, but it is actually a strong moat.
By sitting between GPU suppliers and consumers, Hyperbolic can see real-time data that others cannot. It knows who is buying which GPUs at what price and when. It sees oversupply before it becomes public and can gauge rising demand before it hits the market.
Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches fragmented capacity from dozens of independent clouds and data centers into a standardized unified pool, enabling developers to rent the cheapest available GPU anywhere without negotiating with each operator or managing a multitude of accounts. The more clouds it connects, the deeper the liquidity, and the richer the pricing data. Furthermore, the team is exploring how to model GPU price curves using this data, eventually deploying its own capital to smooth supply and demand, acting as a market maker for physical computing power; however, this goal remains in its early stages, as what is truly compounding at present is the aggregation layer.
This is the flywheel:
Access more clouds → More aggregated supply
More supply → Deeper market and real-time pricing data
Better data → Smarter routing now, long-term pricing models
Better liquidity and pricing → More developers → More clouds want to connect
No other company is attempting this. Hyperbolic is the only company that spans GPU leasing layers, deployment layers, and model API layers simultaneously.
VII. Venice - This Mirror
Venice is the clearest manifestation of the inference economy at the application layer and serves as a useful contrast to Hyperbolic's position. It is a privacy-first inference application: a suite of OpenAI-compatible APIs combined with a consumer-facing subscription (Free / Pro / Pro+ / Max), routing requests to about 75 models, two-thirds of which are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), and the rest anonymously passing through cutting-edge proprietary models. The key is that Venice does not own meaningful computing power itself. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays cutting-edge labs for passthrough, so its true cost of revenue is inference computing power, rather than SaaS hosting.
What Venice is truly selling is privacy. The "privatization" here is not about turning public computing power into private property, but rather wrapping commoditized inference in a layer of guarantees: no data retention, no training with data, anonymized requests, and some loads even running in TEE, where operators cannot see plaintext. The underlying computing power is commoditized, with the markup being this layer of privacy packaging. Moreover, this layer of guarantee is stratified and not homogeneous: for runs on self-controlled or TEE GPUs for open-source models, near end-to-end confidential computing can be achieved; yet for anonymous passthrough of closed-source models like Claude and GPT, privacy merely strips the identity, while the cutting-edge labs still process your original prompts. Therefore, the strongest privacy only covers the open-source part, while the cutting-edge model part is "anonymous" rather than "truly confidential." Venice's gross profit = subscription price - inference cost paid to downstream, and it can charge more than the raw API price almost entirely due to this privacy premium, which is also why it operates on thin margins, subject to the pricing of cutting-edge passthrough.
Token design wraps this part of the inference demand. Venice operates on two tokens: VVV (for staking and platform access) and DIEM, the latter being an inference credit, where each DIEM equates to approximately $1 of computing power per day. Paid subscriptions trigger a programmatic buyback and destruction of VVV (approximately $2 / $5 / $10 for Pro / Pro+ / Max), with emissions reduced according to a fixed schedule: from 6M → 5M → 4M VVV monthly, adjusting to 3M on July 1. The buyback is real, but discretionary and still not significant: about $103,000 was burned in April and May each, with June slowly approaching about $110,000, far below the monthly $200,000 mark.
The fundamentals are healthier than the headlines. The publicly circulated figure of "70 million dollar ARR" is almost certainly a product of misinterpreting subscription renewals as net new customer acquisition; defensible observable ranges are closer to 6 million to 15 million dollar ARR. Below this scale, traction is genuine: approximately 136,000 unique wallet addresses, about 9.9 million website visits per month (around 330,000 daily), and new Pro subscriptions hovering around the line of about 1,400 per day. This is a real business, but a thin-margin business, whose economics are constrained by the computing power it purchases.
This is precisely why Hyperbolic is positioned one layer above it. If Venice is the gas station, Hyperbolic is the oil refinery. Venice purchases computing power from the same limited supply that everyone relies on; Hyperbolic aggregates and standardizes that fragmented supply, reselling it to Venice and all players like it. As inference demand grows, value accumulates not only toward the application of computing power consumption but also toward the layer that aggregates, routes computing power, and captures the cost of revenue that these applications pay.
VIII. Why This Matters at This Moment
Nvidia has reorganized its finances around "service tokens." Cerebras's IPO proves that the market has understood inference as the bottleneck. Anthropic's frantic quest for capacity demonstrates this is a real issue. Agentic and physical AI will amplify demand by several orders of magnitude, spanning both cloud and edge.
Moreover, it has also closed the loop on the "600 billion dollar problem" from the other side. Cahn's bearish logic—that overbuilding leads to oversupply, which ultimately could be validated—lags. But the oversupply is precisely the optimal market for light-asset aggregators: when GPU prices decline, and supply is fragmented across dozens of clouds, the player who does not own any hardware and routes every workload to the cheapest available card will reap the price differences, while operators holding depreciating GPUs will incur losses. Hyperbolic is long on oversupply, not short.
In the end, the company that prevails will not be the one with the most GPUs, but rather the one that can tell you which GPUs are available where, at what price, and route every workload to where it can be run at the lowest cost.
Hyperbolic is building such a company. It does not own GPUs, operates purely in software, spans three layers deeply, but aims to become the ultimate aggregating layer for inference computing power.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。