Website logo
Home

Blog

Anthropic used 16 Claude agents to write a C compiler in Rust and successfully compiled the Linux kernel -

Anthropic used 16 Claude agents to write a C compiler in Rust and successfully compiled the Linux kernel -

Anthropic has documented an experiment that pushes autonomous programming with AI to a new limit: 16 instances of the cloud build in parallel, from scratch and in root, with a C compiler capable of compiling the Linux 6.9 kernel.The result,...

Anthropic used 16 Claude agents to write a C compiler in Rust and successfully compiled the Linux kernel -

Anthropic has documented an experiment that pushes autonomous programming with AI to a new limit: 16 instances of the cloud build in parallel, from scratch and in root, with a C compiler capable of compiling the Linux 6.9 kernel.The result, achieved after nearly 2,000 sessions and spending close to USD $20,000, not only puts on the table a new form of engineering with "teams of agents", but also uncomfortable questions about the quality, technical limitations and security of software produced without constant human supervision.

Anthropic has written an experiment that pushes autonomous programming and AI to a new limit: 16 scenarios of Claude working in parallel to build, from scratch and in Rust, a C programmer capable of compiling the Linux 6.9 kernel. The result, achieved after almost 2,000 programs and spending close to USD $ 20,000, not only puts on the table a new type of engineering and"representative groups", but also uncomfortable questions about the quality, technical limits and security of software produced without continuous human supervision.

- 16 Anthropic Claude agents were tasked with writing C compilers in Rust, and reported that they were able to build Linux 6.9 for x86, ARM, and RISC-V.

- The project consumed about 2,000 Claude Code units and cost about $20,000, including 2 billion tokens and 140 million exit tokens.

- The test provided information on testing, CI and parallelism, but also found limitations: part of GCC, no one to assemble and link, and no useful code creation.

🚨 Anthropic unveils a milestone in AI🚨

16 Claude agents can build a C compiler in Rust to compile the Linux 6.9 kernel

The project cost about US$20,000 and cost about 2 billion marks.

Debate over risks in programming heats up... pic.twitter.com/s7TVNOvnrY

— Diario฿itcoin (@) Pebrero 6, 2026

an autonomous programming experiment aimed at "software at scale".

Anthropic publicó un informe técnico firmado por Nicholas Carlini, investigador del equipo de salvaguardias, en el que describe un enfoque de supervisión y ejecución para modelos de lenguaje llamado “equipos de agentes”. La idea central consiste en ejecutar múltiples instancias de Claude en paralelo, trabajando sobre una base de código compartida, sin intervención humana activa durante la ejecución cotidiana.

Para poner el enfoque a prueba, Carlini encargó a 16 agentes construir un compilador de C basado en Rust, desde cero, con una meta que funciona como prueba de estrés: que el resultado pudiera compilar el kernel de Linux. Según el reporte, tras casi 2.000 sesiones de Claude Code y un costo total cercano a USD $20.000, el equipo produjo un compilador de 100.000 líneas que construye Linux 6.9 en x86, ARM y RISC-V.

The text emphasizes that the compiler is an interesting artifact in its own right, but the purpose of the experiment is to derive knowledge about the design of harnesses for long-lived autonomous agents.The focus is particularly on how to write tests that keep agents on track, how to structure work to parallelize progress, and at what points the practical limitation of this scheme appears.

For technology, AI, and market audiences, the story strikes a chord: If programs move from artisanal processes to agent-driven "industrial capabilities," the marginal cost of producing software can drop.That promise is attractive to startups and corporations, but it also increases the impact of errors, regressions or vulnerabilities when no one reviews each change manually.

How Antropia made “Long-Length Claudes” possible with a simple loop

According to Carlini, the starting point is that existing scaffolding such as cloud code often assumes an available human operator.In long tasks, the model may solve a part, but then stop and wait for a new instruction, clarification, or status update.

To support autonomous progress, the authors designed a harness that placed Claude in a loop.When the task is completed, the system immediately starts the next session.The report includes a sample script that runs continuously and redirects the output to a log file, along with an explicit warning: it must be run in a container, not on a real machine.

He explained that the agent's prompts ask the cloud to break down the problem into smaller parts, track their actions, decide their next move, and continue "until it's perfect." Still, the text acknowledges the typical failure of automation. In one case, the cloud executed a command that "killed" itself and ended the process, breaking the loop.

Anecdotally aside, this approach reveals a typical strain of artificial intelligence applied to engineering: if a human is removed from the path, any operational error can have a cumulative effect.In a corporate environment, these risks are usually mitigated by minimal permissions, ephemeral environments, and auditing, aspects that the author himself suggests when recommending the use of containers.

Parallelism with Docker and Git: Coordination without orchestrators

The report argues that running multiple events in parallel attacks will affect two weaknesses of an agent: one session can only do one thing at a time, and specialization becomes difficult.With multiple agents, some can adjust while others do paperwork, improve performance, or perform specific tasks.

The implementation described is intentionally "rudimentary".An empty Git repository is created and, for each agent, a Docker container is started with the repository as upstream.Each agent clones a local copy, works on it, and sends changes upstream when the job is done.

To prevent two agents from solving the same problem at the same time, the harness uses a locking mechanism based on text files inside the tasks directory.One agent takes over a task by writing a file, and if another agent tries to request the same thing, syncing with Git forces the other agent to choose a different task.The agent then integrates the changes, resolves merge conflicts, pushes and removes locks.

The most important part of the statement is what is missing from the system: no agent orchestration, no additional communication channel between agents, and no consistent high-level target process.The author explains that it lets each cloud decide what to do, and in many cases the model takes on "the next most obvious problem" by recording failed and pending attempts.

La lección principal: pruebas de altísima calidad o el agente optimiza lo equivocado

For Carlini, an infinite loop only works if the agent knows how to act.So much of the effort was focused on the environment: tests, scenarios, feedback, and mechanisms to guide Claude without constant human supervision.

An important recommendation in the report is to write "near-perfect" tests.If the verifier is poorly designed, the model will solve the wrong problem, and do so effectively.Improving the test harness includes finding high-quality compiler test suites, building checkers and scripts to compile open source packages, and looking at failure modes to create new test targets.

Near the end of the project, the report says that Claude began breaking existing projects every time he implemented a new phase.In order to include this pattern, the author created a continuous and implemented dynamic rules, intended to prevent new commitments from violating the behavior that has already been achieved.

This point resonates beyond the case of the compiler.In any agent system, evidence becomes the "contract" that governs behavior.In software markets, this translates into a direct result: invested capital goes from writing code to designing checks, tests, and quality checks that avoid wrong decisions.

Designing for Sampling Limitations: Context and "Time Blindness"

Another lesson from the exhibition is summed up as "putting yourself in Claude's shoes".Each agent starts in a new container with no context, and the author noted that the model needs to take time to locate itself, especially in large projects.To reduce friction, it included instructions to maintain extensive README files and progress notes, with frequent updates.

The document also lists limitations that must be "designed around."One is the contamination of context windows: the harness should not print thousands of useless bytes.According to Carlini, it is advisable to display only a few lines and save the rest in files, in easy-to-process formats, for example marking errors with the word "ERROR" on the same line so that tools like grep can locate them.

Another limitation is "time blindness".Reports say that Claude can't tell time, and if left alone, can run tests for hours instead of making progress.As a limitation, tape increments advance infrequently and perform a deterministic single representative sample of 1% or 10% to cover files without messing up the context, but fast mode runs randomly on virtual machines.is provided by default.

In terms of product design, these solutions show that autonomy does not depend only on the model.It depends on the feedback interface, how status is summarized and how work is prioritized.This is an important caveat for companies adopting agents: without good signal design, AI will waste computation on low-value activities.

When the Linux kernel stops concurrency: GCC as an oracle

Parallels was correct when many independent tests failed.Each agent can catch a different error.According to the report, with 99% success rate, agents showed examples of SQlite, Redis, libjpeg, MQuickJS and Lua, and started building small and medium-sized open source projects.

However, the Linux kernel presented another problem.Compiling Linux is a single, huge task, and the report describes how multiple agents come across the same bug, fix it, and then overwrite each other's changes.In that scenario, having 16 agents was no help because they were all working on the same job.

The solution was to use GCC as the popular "oracle online" compiler.The author built a new armor that compiled most of the kernel randomly with GCC and the rest with the Claud compiler.If the kernel is compiled, then the problem is not with the files Claude has compiled.

This approach allowed agents to spread errors across different files before the compiler could combine them.The text adds that it was still necessary to apply delta debugging to find pairs of files that failed together but worked separately.This is a concrete example of how "parallelism" also requires problems with partitioning techniques.

Results and numbers: 2000 sessions, 2 billion tokens, 100,000 compiler lines

Anthropic offers the project as a measure of skill.Carlini says he used the compiler as a reference throughout the Claude 4 series.The first plan was ambitious: start with a free, GCC-compatible compiler capable of compiling the Linux kernel with multi-backend support, and an IS SSA to enable upgrade operations.

En la evaluación, el informe indica que, en casi 2.000 sesiones durante dos semanas, Opus 4.6 consumió 2.000 millones de tokens de entrada y generó 140 millones de tokens de salida, con un costo total de “poco menos” de USD $20.000. El autor lo califica como un proyecto extremadamente costoso frente a planes comerciales, pero también como una fracción de lo que costaría desarrollarlo solo o con un equipo.

The text states that this was a "clean room" implementation: Claude had no Internet access during development, and the compiler relied only on the Rust standard library.In addition to making Linux 6.9 bootable on x86, ARM, and RISC-V, the compiler can compile QEMU, FFmpeg, SQlite, postgres, and redis, with the GCC torture test suite achieving 99% success on most compiler test suites.

As a cultural nod to the developers, the report notes that Doom also recruits and manages.That detail works as a short story: it does not prove the completion of the standard, but suggests a practical level of integration that usually requires many parts of the construction and management of the ecosystem.

Perceived limitations: partial GCC dependency and low efficiency

The report is clear about the limitations.One of them is the lack of a 16-bit x86 compiler which is required to properly update Linux, so it reverts to GCC at this stage.The article states that there are x86_32 and x86_64 compilers.

It is also recognized that it does not have its own assembler and linker.According to Carlini, these were parts that Claude eventually began to automate, but they still turned out to be defective.Even the demo video, the author notes, was produced with the GCC assembler and linker.

Another limitation is incomplete coverage: you can build many projects, but not all.The terr report makes it clear that there is no complete replacement for a real compiler.

Finally, the author rates the quality of the Rust code as "reasonable", but far from what an expert programmer would produce.It is also argued that the compiler has reached almost the limit of Opus' capabilities, and that new features or fixes often break existing functionality, a pattern that has imposed stricter testing and integration discipline.

A hard case: why 16-bit x86 became a bottleneck

The report provides a specific example of a limitation that could not be addressed: running 16-bit x86 generator code to boot in real mode.

The technical information adds that the compiler can produce 16-bit x86 directly using the 66/67 opcode prefixes, but the compilation is over 60kb.Since it exceeds the 32k code limit provided by Linux, it cannot be used at that boot level.

As a result, the system "cheats" and calls GCC for this phase.The text explains that this only applies to x86.For ARM or RISC-V, the compiler can compile completely on its own.

This kind of limitation shows a common situation in installation agents: it is not enough to get "correct" in the description.It is also necessary to meet global restrictions such as maximum size, combination of equipment and communication.There, agents may need more guidance, better expectations, or special instructions.

Anticipation: Ambition, Productivity, and Safety Alert

Carlini ends with an evolutionary reading of the work on language models.It describes a sequence: first automatic completion in the DRA, then complete a function of docstrings, then pairing programming with agents like Claude Code.In their vision, agent teams strive to implement whole projects independently, allowing consumers to be more ambitious.

However, the report notes that fully autonomous development carries real risks.When sitting with the model, they can maintain consistent quality and detect errors in real time.Instead, in autonomous systems, it's easy to accept that the tests pass and the work is done, even if "rarely," according to the text.

The author relates this concern to his previous experience in penetration testing and exploiting vulnerabilities in large company products.He found the idea of ​​using software that no one had personally verified unsettling.This quote is not a rejection of technology, but a reminder that automation amplifies both the good and the bad.

Anthropic also says that the compiler's source code is available and invites you to download it, read it, and try it out.The author maintains that the best way to understand what models can do is to push them to their limits and see where they start to break.In the "coming days", he will continue to ask Claude to make changes to address the remaining limitations, proposing a live project and an ongoing lab.

DISCLAIMER: Provides informative and educational content on various topics, including cryptocurrency, artificial intelligence, technology and regulations.We do not provide financial advice.Investing in crypto assets is high risk and may not be suitable for everyone.Do your research, consult an expert, and check current laws before investing.You can lose all your capital.

Bringing you breaking news with deep dives into Sports, Entertainment, Technology, and Health.

© 2025 Bateo Libre, Inc. All Rights Reserved.