Who is the best at using Claude Code? The answer might not be programmers.

CN
PANews
Follow
2 hours ago

Author: Anthropoic

Translator: Peggy

Editor's Note: This report is based on approximately 400,000 Claude Code sessions and discusses how AI programming tools are changing the relationship between humans and code.

The core finding of the article is: in agent programming, humans primarily decide "what to do", while Claude is mainly responsible for "how to do it." Users take on most of the planning decisions, while Claude handles the majority of the execution work. In other words, AI is taking over the implementation aspects such as writing code, changing files, running commands, and debugging, but goal setting and result evaluation still rely on humans.

More importantly, the effectiveness of using Claude Code does not solely depend on whether the user is a programmer. The report shows that among users in non-technical professions such as law, finance, management, and research, the success rate in generating code-related tasks is nearly on par with software engineers. What truly affects the outcome is whether users understand the problem they are trying to solve.

This means that AI programming reduces the barrier to implementation, rather than the barrier to judgment. In the future, those who understand business, scenarios, and can clearly articulate requirements and evaluate results may leverage AI better than those who merely know how to write code. AI will not automatically replace domain knowledge; instead, it will amplify the value of domain knowledge.

Here is the original text:

Key Findings

Building on existing research, we proposed a framework for studying interactive agent programming. This framework is based on a privacy-protected analysis of approximately 400,000 Claude Code sessions from October 2025 to April 2026, assessing task composition, human-AI collaboration methods, and task success rates.

In a typical session, humans are responsible for most planning decisions, i.e., deciding "what to do"; Claude is responsible for most execution decisions, i.e., deciding "how to accomplish it." The stronger the user’s expertise in a certain domain, the greater the amount of work triggered by each command to be completed by Claude. In coding tasks, the average success rate of various major occupational groups—namely whether they accomplish what the user originally intended, with verifiable evidence such as passing tests or submitting code—is almost on par with software engineers.

The stronger the user's domain expertise, the more likely the session will end successfully. However, the gap between intermediate users and expert users is not significant. Over the seven months we observed, the proportion of sessions for debugging nearly halved, and the usage trend shifted towards more end-to-end agent applications: deploying and running code, analyzing data, and drafting non-code documents.

During these seven months, the value of typical tasks increased across nearly all job types. We estimated the task value by comparing it with information posted for freelance jobs, and the results indicated an average increase of about 25%.

Introduction

Agent programming is rapidly rising. Since the end of 2025, the proportion of coding agent activities in GitHub projects has more than doubled, with Claude Code users now averaging 20 hours of tool use per week. Can individuals without formal programming experience successfully direct an agent to complete complex technical work? How will the rapid adoption and capability enhancement of these tools impact broader knowledge work? We cannot yet provide a complete answer, but we can see some early signals from the usage data of Claude Code.

This report is based on a privacy-protected analysis of approximately 235,000 users and about 400,000 interactive sessions from October 2025 to April 2026, providing evidence of the actual use of Claude Code. It continues our earlier research on the autonomy metrics in Claude Code sessions and how Claude Code is changing the internal workings of Anthropic. This article will propose a framework for describing the use of interactive AI programming assistants: what work people are doing, who is doing the work, and whether the work is successful. We focus on how users utilize Claude Code through command-line interface (CLI), Claude.ai, or Claude Code desktop application. By tracking how agent programming usage changes as model capabilities improve, we can better understand the impact of these tools on the programming professional and knowledge worker labor market.

What is happening with Claude Code may herald the future direction of knowledge work: agents are gradually embedding themselves in non-coding tasks. We find that Claude is handling more complex and valuable tasks. At the same time, there remains a clear division of labor in agent programming: humans decide what to build, while the agents decide how to build it.

We also see evidence that the real enhancement in tool usage effectiveness comes from domain expertise, rather than programming proficiency. In particular, domain experts tend to succeed more easily and recover more readily from mistakes and misunderstandings. However, the gap between experts and intermediate users is not large. This suggests that as long as one has sufficient proficiency in a certain domain, they can use such tools almost as effectively as deep experts.

These findings allow us to observe potential shifts in the labor market. In our data, success depends on whether a person understands the problem they are trying to solve rather than their programming training. If these patterns hold true across the entire economy, it suggests that while agent programming tools may be absorbing some implementation-focused work, they are simultaneously rewarding those who truly understand the problems they are solving in their work. Coding agents are not replacing domain expertise. Instead, the more understanding a worker brings to the agent, the more high-quality work the agent can accomplish.

Division of Labor

What People Do with Claude Code

To understand how people use Claude Code, we categorized each session into one of nine work modes, identifying the single activity that best describes the session's goal. Four of these modes directly involve coding or maintenance: building new things, fixing broken things, testing code, and orchestrating other agents or automated pipelines. Another category involves operating software, including deploying, configuring, running pipelines, and monitoring systems. Two more categories pertain to figuring out "what to do": understanding how an existing system works and planning changes before making modifications. Finally, the last two categories are unrelated to code or where code is just an auxiliary part of the final product: analyzing data and communicating through presentations and other text-based documents.

About 56% of sessions involve writing code (25%), fixing code (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploration accounts for 14%, and analyzing or writing text accounts for 13% (see Figure 1).

Figure 1: Nine Work Modes. Each interactive session is categorized according to the single work mode that best describes its goal.

We first had the model read session records and classify each session accordingly; then, we used our privacy-protected analysis tool to cross-validate the classification results with telemetry data automatically recorded for each session, including whether lines of code were added or deleted. There is a high degree of consistency between the two sources. For example, in sessions marked by our classifier as creating or modifying code, over 90% also show code changes in the telemetry data. Details are in the appendix.

Who Makes Decisions

How autonomous is Claude Code? Capability assessments show its upper limits are already quite high and still rising. For instance, in benchmark tests such as the METR duration assessments, cutting-edge models can now autonomously complete software tasks that would originally take humans hours, overcoming obstacles in the process. But what does this look like in actual use? Here, we focus on how much guiding work humans and Claude each undertake in real sessions.

We studied this issue from two perspectives. First, we looked at the extent to which people delegate decision-making to Claude; second, we observed how much action they allocate to Claude. To understand the degree of decision delegation in a session, we constructed a privacy-protected decision attribution classifier based on the session content. We asked the classifier to list all meaningful decisions made during the session and classify them as planning decisions and execution decisions. Planning decisions include what to do, which methods to adopt, and what counts as completion; execution decisions include which files to modify, what code to write, in which language to write, and which commands to run. The classifier then attributes each decision to Claude or the user, generating two numbers for each session: the proportion of planning decisions made by the user and the proportion of execution decisions made by the user.

On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In actual use, agent programming illustrates a clear division of labor: humans decide what to build, while agents decide how to build it.

To understand the extent of action delegation in a session, we looked not at the content but at the structure of the interactions. Claude Code sessions consist of back-and-forth interactions between Claude and the user: the user sends prompts, and Claude takes action; the user then sends the next prompt, and so on. In a typical session, these rounds number about four. In our historical data from October to April, for every prompt sent by the user, it triggers Claude to execute about 10 actions on average, sometimes exceeding 100 actions. In each round, Claude reads files, edits code, runs commands, and produces an average output of 2400 words.

How much work Claude completes between two user checks largely depends on who is making decisions. When the user retains control over the execution process—specifically when the user makes more than 80% of execution decisions—Claude executes fewer actions, around 8 per round. However, when Claude has planning control—when Claude makes more than 80% of planning decisions—the number of actions it undertakes is highest, around 16.

Figure 2: Proportions of Claude in Planning and Execution Decisions. This figure shows the distribution of proportions across different sessions, attributing planning decisions (what to do) and execution decisions (how to do) to Claude versus the user. In a typical session, users make about 70% of planning decisions, while Claude makes about 80% of execution decisions.

Expertise Level

Based on each session record, Claude assesses the user's apparent level of expertise on the task using a five-level scale, from novice to expert. The expertise level classifier focuses on three signals: the precision of the instructions provided by the user, what the user asks Claude to validate, and whether the user is more frequently correcting Claude or vice versa. It is important to note that this level of expertise is completely distinct from position or general capability and is crucially task-specific. A seasoned engineer encountering a Rust issue for the first time may still be a novice on Rust tasks. An accountant who has never used Python can become an expert on the task if they accurately inform Claude of specific reconciliation rules that must be followed in a Python script and identify edge cases mishandled at month-end close.

The table below shows how we define each level of expertise in the classifier and provides example requests from the publicly available coding agent session dataset SWE-chat. Conversations classified as "novice" offer vague instructions with no specific domain knowledge conveyed; those classified as "expert" relay a deep understanding of the codebase and technical environment.

Table 1: Expertise Level Classifier. Examples are rewritten, anonymized, and compressed from actual sessions labeled by our classifier. Many examples are derived from the publicly available coding agent sessions dataset SWE-chat.

We quantified the relationship between expertise level and the output and activity levels generated by Claude for each user prompt. In typical novice sessions, each prompt triggers Claude to execute about 5 actions and produce around 600 words; in expert sessions, the action chain length exceeds that of novices by more than double, approximately 12 actions, with an output reaching about 3200 words, which is five times that of novices (see Figure 3). This gap between novices and experts appears across all types of work and every task value range.

These metrics complement our previous research on the autonomy of Claude Code. Previous studies tracked the duration of agent operation and how often users autonomously approved its actions. In contrast, our decision attribution metrics capture who is making substantive decisions throughout the entire session, while the output volume and number of actions triggered by each prompt measure how much autonomous activity Claude can generate in response to human instruction.

Figure 3: Claude completes more work per prompt for more specialized users. The higher the level of expertise, the more actions (left bar chart) and greater text output (right bar chart) are generated by Claude per prompt. The boxes indicate the interquartile range, split at the median. The whiskers represent the 5th to 95th percentiles. The white dot indicates the geometric mean. Both upward trends are statistically significant (p < 0.001), and each step difference between adjacent expertise levels is also statistically significant. Controlling for work modes, task values, months, occupations, and model series while clustering errors by user, this trend remains significant: for each level increase in expertise, the number of actions increases by 9%, and output volume increases by 13%.

Who is Using Claude Code and What They Do with It

Users

To understand who is doing this work, we infer each user's profession based on session records and map them to one of 23 major categories in the Standard Occupational Classification (SOC) system of the U.S. Bureau of Labor Statistics. The classifier is tasked to make judgments solely based on the following signals: the project context loaded by the agent at the beginning of the session, file names and structure, materials or products referenced by the user, such as legal documents, clinical data, financial reports, course materials, etc., and the vocabulary used by the user. The classifier is explicitly instructed not to consider "writing code" itself as evidence of a user being in a programming profession. Only when there are clear signals indicating that software or data work is the user's profession will the session be classified under coding-related SOC categories, namely "Computer and Mathematical Occupations." If a lawyer builds a script to automatically check whether certain clauses are missing from a set of contracts, then even if the session is focused primarily on writing software, it will still be categorized under the legal profession. If there are no signals regarding the user's profession, the session will not be classified.

We were able to infer professions in about 70% of the sessions. Among these classifiable sessions, "Computer and Mathematical Occupations" is the largest group, which is not surprising as this category covers most software-related jobs. Following that are Business and Financial Operations, Arts, Design and Media, Management, as well as Life Sciences, Physical Sciences, and Social Sciences. The fastest-growing non-software occupational group in our sample is Management, Sales, and Legal professions.

Work

From October 2025 to April 2026, the composition of work completed using Claude Code underwent significant changes. The most notable change was the proportion of sessions aimed at fixing broken code, which dropped from 33% to 19% (see Figure 4). In its place, there was more work centered around coding. The proportion of software operation increased from 14% to 21%. Writing and data analysis approximately doubled, rising from around 10% to about 20%.

The value of the tasks themselves is also rising. We estimated the economic value of each session by approximating the cost of similar work in the freelance market and calibrating it with real publicly available job datasets. According to this metric, the estimated value of an average session increased by 27% during the period from October to April. This increase was observed across various work types. The values of tasks related to building, operating, and fixing were estimated to have grown by approximately 43%, 34%, and 32%, respectively. These price estimates are somewhat rough; hence, we mainly use them to compare trends over time between different tasks rather than as direct dollar-value readings. Details on how the task value estimator was constructed can be found in the appendix.

Figure 4: Changes in the Composition and Value of Work Done Using Claude Code from October 2025 to April 2026. This figure shows the proportions of various work modes within sessions over a seven-month window. The proportion of sessions aimed at fixing broken code dropped from 33% to 19%, while the proportions for software operation, data analysis, and document writing increased.

Success Depends on What Users Bring

Estimating task value is one way to understand how Claude Code helps people accomplish work. Another angle is to observe how many sessions are successful and which session characteristics are associated with success. Among all success indicators, we see a clear pattern: the higher the level of expertise exhibited by users in a session, the more likely the session is to succeed. Most of the improvements are concentrated at the lower end of the expertise spectrum, meaning the gap from novice to intermediate users is greater than from intermediate to expert users.

Before analyzing the characteristics of successful sessions, we need to accurately specify how success is measured. We cannot observe users' real-world outcomes, nor can we directly ask if they accomplished what they intended with Claude. Thus, we rely on two complementary, session record-based measuring methods. The first is "determining success," where a classifier reads complete session records to ascertain whether the user achieved their originally set goal, with options including success, partial success, failure, or no clear goal. Subsequently, two supporting classifiers evaluate the evidential strength of this judgment to determine "verified success." The success signal classifier looks for verifiable evidence of success, particularly including git activities matching the work, such as submissions and pull requests, passing test suites, and explicit user acknowledgment. It scores sessions on a scale from "no signal" to "weak signal" (1 point) to "multiple hard signals" (5 points). Another parallel failure signal classifier scores evidence of failure, including errors, test failures, repeated attempts at the same task, and user objections to outputs. Verified success requires both conditions to be met: the session is deemed successful, and at least one hard verifiable success signal exists. The following analysis focuses on the degree of success or failure within sessions, thus we exclude sessions identified by the success result classifier as "no clear goal," accounting for about 7.7% of the full sample.

Returns on Expertise Level

So, which sessions are most likely to succeed? The results show that the expertise level ratings aforementioned have a significant impact on session success.

One might worry that expertise level isn’t the true driving factor. Perhaps experts simply choose different tasks or possess other differences. In this section, we partially address this concern by comparing sessions of the same work type, same estimated value, same month, same subject, and drawn from the same broad occupational group, examining how differing user expertise levels affect outcomes.

Table 2: Definitions of Success and Failure Derived by Classifier. Examples are drawn from actual sessions in the public interactive coding agent dataset SWE-chat, rewritten and summarized, and labeled by our classifier.

Among all success indicators, the higher the level of expertise exhibited by users in a session, the more likely the session is to succeed. Sessions rated as novices achieved a verified success rate of 15% on our strictest metric "verified success", achieving at least partial success in 77%. Conversely, sessions rated as intermediate or above saw verified success rates ranging from 28% to 33%, with partial success rates ranging from 91% to 92% (see Figure 5).

In each indicator, most gains arise from the uplift from novice to intermediate; the slope slows down from intermediate to expert. For details on the regression analysis behind Figure 5, see the appendix.

Figure 5: Expertise Level and Session Outcomes. This figure displays session results according to users’ expertise level ratings, from novice to expert across five levels. The left figure includes all sessions. The middle and right figures are limited to sessions encountering issues, specifically those with failure signal counts above 3, demonstrating different ratios of achieving defined success and failure in those sessions. Each point represents an adjusted ratio. We estimate differences between various expertise levels by only comparing sessions that share the same work mode, same task value range, same month, same task subject, and same user type, i.e., whether they belong to software-related professions. Relevant details of the regression are in the appendix. The whiskers indicate the confidence intervals for the sample mean, most of which are too small to be visible in the figure. These figures exclude sessions classified as "no clear goal" by the success result classifier.

In sessions that encountered challenges, a similar gradient can also be observed. When failure signals are recorded as verified evidence of failure, we consider that session to have "encountered problems." This might include errors arising, test failures, multiple attempts to complete the same task, or the user expressing frustration and dissatisfaction. In these problematic sessions, controlling for all the aforementioned variables, the verified success ratio rises from 4% in novice sessions to 15% in expert sessions (see Figure 5). When using more lenient success metrics, we find that the proportion of at least partial success is 60% among novice users, and 80% to 81% among intermediate to expert users.

We also tracked another reverse relationship, namely that between expertise level and various failure indicators. It is noteworthy that in this analysis, sessions classified as failures are those that did not achieve even partial success. If a problematic session is classified as a failure and has no lines of code written, we consider it abandoned. Among sessions that appear to be from novice users, 19% ultimately get abandoned; for other user groups, this ratio is between 5% and 7%. In other words, the least experienced users are more likely to abandon their efforts when facing difficulties in reaching their objectives. Part of the value of professional capability seems to lie in being able to steer the agent back in the right direction.

Profession May Be Less Important Than Expertise Level

Users in software-related professions have a verified success rate of about 30% across all sessions, while users in other professions have a success rate of about 26%. In sessions where code was generated—specifically those where at least one line of code was added or modified—these numbers are 34% and 29%, respectively (see Figure 6). If we employ a more lenient definition of success, this gap between software-related and other professions narrows even further. In coding sessions, both user categories achieve at least partial success ratios of 89% and 88%, respectively. A five-percentage-point difference is not significant and remained neither larger nor smaller over the seven months, even though success rates for both groups have been increasing. Among the ten largest occupational groups in our dataset involved in coding, each has a success rate within seven percentage points of software engineers. The management occupation group has the highest verified success rate, slightly exceeding that of software engineering professions. The higher verified success rate of managers may reflect that management skills are transferable to the task of directing agents. However, this may partly stem from our measurement method: verification relies somewhat on explicit acknowledgment from users within the session, and managers might be more accustomed to expressing satisfaction when achieving desired results.

Figure 6: Coding Session Success Rates and Verified Success Rates by Inferred Occupation. This figure shows the strict success definition ratios in sessions where at least one line of code is added or modified, categorized by the inferred user occupation, including defined success and verified success. The figure displays the ten largest occupational groups. Each group's success rate variance from software/mathematics users—in the SOC classification of Computer and Mathematical Occupations—is within seven percentage points. Error bars indicate 95% confidence intervals calculated based on different accounts.

Outlook

The results of this report outline an emerging picture: agent programming is amplifying certain knowledge and skills while replacing others. In sessions where code is generated, success rates across various professions are not significantly lower than those in software-related jobs. It appears that coding agents are making a programming background less crucial for successfully completing programming tasks.

At the same time, successful sessions are more likely to demonstrate domain expertise. Sessions rated as experts have verified success rates more than double that of novice sessions. When sessions encounter challenges, novices abandon their efforts at rates several times higher than other users. The collaborative nature of this environment brings greater clarity to this picture: domain experts are able to guide Claude to accomplish more work with each instruction. Therefore, the ability to steer Claude towards success largely stems from mastery of a domain rather than proficiency in coding. Individuals who possess this mastery in any field can now tackle technical work that was previously beyond their reach. Conversely, those who lack such professional understanding, even when using the same tools, achieve much less. Moreover, the benefits primarily arise from competence rather than mastery. Having an actionable understanding of a domain is already enough to gain most of the rewards; deep specialization on top of that will only offer marginal additional advantages.

These findings are still preliminary. Like most of our research, we are unable to measure real-world outcomes, such as whether code written in a session was later utilized or discarded, or whether it resulted in economically valuable outcomes. Additionally, the non-interactive usage excluded from this report constitutes a significant portion of overall activity. Developing a framework capable of measuring such usage will be one of our future priorities. Furthermore, all our classifications of sessions rely on the model’s reading of the session records. In the appendix, we show that the classifier remains consistent with independent telemetry data in expected directions and aligns with strong reference model judgments in most sessions. However, validation of classifiers remains challenging on a large scale; Claude Code sessions themselves add complexity as they can be lengthy and intricate, making it difficult to establish human annotations as genuine baselines.

As models, users, and the division of labor between them continue to evolve, the picture presented in this report will also be updated. We hope these metrics will help us track significant shifts currently taking place. For instance, if the returns driven by expertise level begin to decline, this would indicate that models are starting to provide users with critical judgments that previously depended on their input, and the benefits of these tools would also extend from domain experts to a broader population. If the proportion of non-software profession users successfully completing coding sessions continues to rise, it may suggest that software production is becoming part of regular work across various fields, rather than the product of a single profession. Such transformations would reshape who can benefit from agent programming and to what extent, impacting the most valued skills in the labor market.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink