Original Title: "Data as an Asset: DataFi is Opening a New Blue Ocean"

Original Author: anci_hu49074, Core Contributor at Biteye

"We are in an era where the world is competing to build the best foundational models. While computing power and model architecture are important, the real moat is the training data."

——Sandeep Chinchali, Chief AI Officer at Story

Discussing the potential of the AI Data track starting from Scale AI

If we talk about the biggest gossip in the AI circle this month, it has to be Meta showcasing its cash capabilities, with Zuckerberg recruiting talent everywhere to form a luxurious Meta AI team primarily composed of Chinese researchers. Leading the team is Alexander Wang, who is only 28 years old and founded Scale AI. He created Scale AI, which is currently valued at $29 billion, serving clients that include the U.S. military and several competitive AI giants like OpenAI, Anthropic, and Meta, all relying on Scale AI for data services. The core business of Scale AI is to provide a large amount of accurate labeled data.

Why can Scale AI stand out among many unicorns?

The reason lies in its early recognition of the importance of data in the AI industry.

Computing power, models, and data are the three pillars of AI models. If we compare a large model to a person, then the model is the body, computing power is the food, and data is knowledge/information.

In the years since the rise of LLMs, the industry's focus has shifted from models to computing power. Most models have established transformers as their framework, occasionally innovating with MoE or MoRe, etc. Major players either build their own Super Clusters to complete their computing power infrastructure or sign long-term agreements with strong cloud services like AWS. Once the basic needs for computing power are met, the importance of data gradually becomes prominent.

Unlike traditional To B big data companies like Palantir, which are well-known in the secondary market, Scale AI, as its name suggests, is dedicated to building a solid data foundation for AI models. Its business goes beyond mining existing data; it also looks towards long-term data generation and attempts to form AI trainer teams composed of experts from different fields to provide higher quality training data for AI model training.

If you are skeptical about this business, let's first take a look at how models are trained.

Model training consists of two parts—pre-training and fine-tuning.

The pre-training part is somewhat like the process of a human baby gradually learning to speak. We typically need to feed the AI model a large amount of text, code, and other information obtained from web crawlers. The model learns to speak human language (academically referred to as natural language) through self-learning, acquiring basic communication skills.

The fine-tuning part is similar to going to school, where there are usually clear rights and wrongs, answers, and directions. Schools cultivate students into different talents based on their positioning. We also train the model using some pre-processed, targeted datasets to equip it with the capabilities we expect.

At this point, you may have realized that the data we need is also divided into two parts.

· One part of the data does not require much processing; it just needs to be abundant, usually sourced from large UGC platforms like Reddit, Twitter, Github, public literature databases, and private corporate databases.

· The other part, like professional textbooks, requires careful design and selection to ensure it can cultivate specific excellent qualities in the model. This requires necessary data cleaning, filtering, labeling, and human feedback.

These two datasets constitute the main body of the AI Data track. Do not underestimate these seemingly low-tech datasets; the mainstream view currently holds that as the advantages of computing power in scaling laws gradually diminish, data will become the most important pillar for different large model vendors to maintain competitive advantages.

As model capabilities continue to improve, various more refined and specialized training data will become key influencing variables for model capabilities. If we further compare model training to the cultivation of martial arts masters, then high-quality datasets are the best martial arts manuals (to complete this analogy, we could also say that computing power is the elixir, and the model is the innate talent).

From a vertical perspective, AI Data is also a long-term track with snowballing capabilities. As early work accumulates, data assets will also possess compounding abilities, becoming more valuable over time.

Web3 DataFi: The Chosen Land for AI Data

Compared to the hundreds of thousands of remote human labeling teams that Scale AI has built in places like the Philippines and Venezuela, Web3 has a natural advantage in the AI data field, giving rise to the new term DataFi.

In an ideal scenario, the advantages of Web3 DataFi are as follows:

1. Data sovereignty, security, and privacy guaranteed by smart contracts

At a stage where existing public data is about to be fully developed, how to further mine undisclosed data, even private data, is an important direction for expanding data sources. This faces a significant trust choice issue—do you choose to sell your data through a centralized company's contract, or do you opt for a blockchain method, retaining control over your data IP while clearly understanding through smart contracts who uses your data, when, and for what purpose?

Additionally, for sensitive information, methods like zk and TEE can ensure that your private data is only handled by machines that keep it confidential and will not be leaked.

2. Natural geographical arbitrage advantage: A free distributed architecture that attracts the most suitable labor force

Perhaps it is time to challenge traditional labor production relationships. Instead of searching for low-cost labor worldwide like Scale AI, we can leverage the distributed characteristics of blockchain and use publicly transparent incentive measures guaranteed by smart contracts to allow dispersed labor around the world to participate in data contribution.

For labor-intensive tasks like data labeling and model evaluation, using the Web3 DataFi approach is beneficial for participant diversity compared to the centralized data factory model, which has long-term significance in avoiding data bias.

3. Clear incentive and settlement advantages of blockchain

How to avoid tragedies like the "Jiangnan Leather Factory"? Naturally, by using a smart contract with clearly priced incentive systems to replace the darker aspects of human nature.

In the inevitable backdrop of de-globalization, how to continue achieving low-cost geographical arbitrage? Opening companies worldwide has clearly become more challenging, so why not bypass the barriers of the old world and embrace on-chain settlement?

4. Facilitating the construction of a more efficient and open "one-stop" data market

"Middlemen making a profit" is a perpetual pain for both supply and demand sides. Instead of letting a centralized data company act as a middleman, we can create a platform on-chain, allowing data supply and demand sides to connect more transparently and efficiently, similar to a public market like Taobao.

As the on-chain AI ecosystem develops, the demand for on-chain data will become more vigorous, segmented, and diverse. Only a decentralized market can efficiently digest this demand and transform it into ecological prosperity.

For retail investors, DataFi is also the most favorable decentralized AI project for ordinary participants.

Although the emergence of AI tools has lowered the learning threshold to some extent, the original intention of decentralized AI is to break the current monopoly of giants in the AI business; it must be acknowledged that many current projects do not offer strong participation opportunities for retail investors without technical backgrounds—participating in decentralized computing networks often involves expensive upfront hardware investments, and the technical barriers of model markets can easily deter ordinary participants.

In contrast, there are few opportunities for ordinary users to seize during the AI revolution—Web3 allows you to participate without signing a data blood factory contract; you only need to click your mouse to log into your wallet and can participate by completing various simple tasks, including: providing data, labeling models based on human intuition and instinct, evaluating, or further using AI tools for simple creations, participating in data trading, etc. For seasoned participants, the difficulty level is essentially zero.

Potential Projects in Web3 DataFi

Where the money flows, the direction follows. Besides Scale AI receiving a $14.3 billion investment from Meta in the Web2 world and Palantir's stock soaring over 5 times in a year, the DataFi track has also performed excellently in Web3 financing. Here we provide a brief introduction to these projects.

Sahara AI, @SaharaLabsAI, raised $49 million

Sahara AI's ultimate goal is to create a decentralized AI super infrastructure and trading market, with the first trial segment being AI Data. Its DSP (Data Services Platform) public beta version will launch on July 22, allowing users to earn token rewards by contributing data and participating in data labeling tasks.

Link: app.saharaai.com

Yupp, @yupp_ai, raised $33 million

Yupp is a feedback platform for AI models, primarily collecting user feedback on model output content. The current main task is for users to compare outputs from different models for the same prompt and select the one they believe is better. Completing tasks can earn Yupp points, which can be further exchanged for USDC and other fiat stablecoins.

Link: https://yupp.ai/

Vana, @vana, raised $23 million

Vana focuses on transforming users' personal data (such as social media activity, browsing history, etc.) into monetizable digital assets. Users can authorize the upload of personal data to the corresponding DataDAOs' Data Liquidity Pools (DLP), where this data will be aggregated for tasks such as participating in AI model training, and users will receive corresponding token rewards.

Link: https://www.vana.org/collectives

Chainbase, @ChainbaseHQ, raised $16.5 million

Chainbase's business focuses on on-chain data, currently covering over 200 blockchains, transforming on-chain activities into structured, verifiable, and monetizable data assets for dApp developers. Chainbase primarily acquires data through multi-chain indexing and processes it using its Manuscript system and Theia AI model, with limited participation opportunities for ordinary users.

Sapien, @JoinSapien, raised $15.5 million

Sapien aims to massively convert human knowledge into high-quality AI training data, allowing anyone to perform data labeling tasks on the platform, ensuring data quality through peer verification. Users are also encouraged to build long-term reputations or make commitments through staking to earn more rewards.

Link: https://earn.sapien.io/#hiw

Prisma X, @PrismaXai, raised $11 million

Prisma X aims to create an open coordination layer for robots, with physical data collection being key. The project is currently in its early stages, and based on a recently released white paper, participation may involve investing in robots to collect data or remotely operating robot data. Currently, a quiz activity based on the white paper is open for participation to earn points.

Link: https://app.prismax.ai/whitepaper

Masa, @getmasafi, raised $8.9 million

Masa is one of the leading subnet projects in the Bittensor ecosystem, currently operating Data Subnet 42 and Agent Subnet 59. The data subnet is dedicated to providing real-time access to data, primarily through miners crawling real-time data from X/Twitter using TEE hardware, making participation difficult and costly for ordinary users.

Irys, @irys_xyz, raised $8.7 million

Irys focuses on programmable data storage and computation, aiming to provide efficient, low-cost solutions for AI, decentralized applications (dApps), and other data-intensive applications. Currently, there are limited opportunities for ordinary users to participate in data contributions, but there are multiple activities available during the current testnet phase.

Link: https://bitomokx.irys.xyz/

ORO, @getoro_xyz, raised $6 million

ORO aims to empower ordinary people to contribute to AI. Supported methods include: 1. Linking personal accounts to contribute personal data, including social accounts, health data, e-commerce, and finance accounts; 2. Completing data tasks. The testnet is now live and open for participation.

Link: app.getoro.xyz

Gata, @Gata_xyz, raised $4 million

Positioned as a decentralized data layer, Gata has launched three key products for participation: 1. Data Agent: a series of AI agents that automatically run and process data as soon as users open a webpage; 2. AI-in-one Chat: a mechanism similar to Yupp's model evaluation for earning rewards; 3. GPT-to-Earn: a browser plugin that collects users' conversation data on ChatGPT.

Link: https://app.gata.xyz/dataAgent

https://chromewebstore.google.com/detail/hhibbomloleicghkgmldapmghagagfao?utm_source=item-share-cb

How to view these current projects?

Currently, the barriers for these projects are generally not high, but it must be acknowledged that once user and ecosystem stickiness is accumulated, platform advantages will rapidly build up. Therefore, early efforts should focus on incentive measures and user experience; only by attracting enough users can this big data business succeed.

However, as labor-intensive projects, these data platforms must also consider how to manage labor and ensure the quality of data output while attracting human resources. After all, a common issue with many Web3 projects is that most users on the platform are merely opportunists—sacrificing quality for short-term benefits. If they become the main users of the platform, it will inevitably lead to bad data driving out good data, ultimately compromising data quality and failing to attract buyers. Currently, we see projects like Sahara and Sapien emphasizing data quality and striving to establish long-term healthy cooperative relationships with the labor on their platforms.

Additionally, insufficient transparency is another issue facing current on-chain projects. Indeed, the impossible triangle of blockchain has led many projects to adopt a "centralization driving decentralization" approach during their startup phases. However, more and more on-chain projects give the impression of being "old Web2 projects dressed in Web3 skin"—with very few publicly traceable on-chain data and even less visible long-term commitment to openness and transparency in their roadmaps. This is undoubtedly toxic for the long-term healthy development of Web3 DataFi, and we look forward to more projects maintaining their original intentions and accelerating their steps toward openness and transparency.

Finally, the path to mass adoption of DataFi can be viewed in two parts: one part is attracting enough toC participants to join this network, forming a new force in data collection/generation projects and consumers of the AI economy, creating an ecological closed loop; the other part is gaining recognition from currently mainstream toB companies, as they are the main source of large data contracts in the short term due to their financial strength. In this regard, we see that Sahara AI, Vana, and others have made good progress.

Conclusion

In a deterministic sense, DataFi is about nurturing machine intelligence with human intelligence over the long term while ensuring that the labor of human intelligence is rewarded through smart contracts, ultimately allowing humans to enjoy the benefits of machine intelligence.

If you are anxious about the uncertainties of the AI era, and if you still hold blockchain ideals amidst the fluctuations of the crypto world, then following the footsteps of capital giants and joining DataFi is undoubtedly a good choice to ride the wave.

This article is from a submission and does not represent the views of BlockBeats.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Data is an asset: DataFi is opening up a new blue ocean.