Arnês Eficaz para Agentes em Longo Prazo
Conteudo
TLDR;
-
O que é o “arnês eficaz para agentes em longo prazo”?
É uma estrutura para fazer agentes de IA trabalharem por muitas sessões, usando um agente inicializador e um agente de codificação para manter progresso consistente entre janelas de contexto. -
Por que compaction sozinha não resolve o problema de agentes de longa duração?
Porque, mesmo com compaction, o agente pode tentar fazer tudo de uma vez, perder contexto no meio e depois começar a sessão seguinte sem entender claramente o que já foi feito. -
Como a solução proposta ajuda o agente a continuar trabalhando nas próximas sessões?
Ela cria um ambiente inicial cominit.sh,claude-progress.txt, um commit inicial e uma lista detalhada de funcionalidades, para que cada nova sessão saiba exatamente o estado do projeto e avance de forma incremental.
Copy
We prompt coding agents to edit this file only by changing the status of a passes field, and we use strongly-worded instructions like “It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality.” After some experimentation, we landed on using JSON for this, as the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.
Incremental progress
Given this initial environment scaffolding, the next iteration of the coding agent was then asked to work on only one feature at a time. This incremental approach turned out to be critical to addressing the agent’s tendency to do too much at once.
Once working incrementally, it’s still essential that the model leaves the environment in a clean state after making a code change. In our experiments, we found that the best way to elicit this behavior was to ask the model to commit its progress to git with descriptive commit messages and to write summaries of its progress in a progress file. This allowed the model to use git to revert bad code changes and recover working states of the code base.
These approaches also increased efficiency, as they eliminated the need for an agent to have to guess at what had happened and spend its time trying to get the basic app working again.
Testing
One final major failure mode that we observed was Claude’s tendency to mark a feature as complete without proper testing. Absent explicit prompting, Claude tended to make code changes, and even do testing with unit tests or curl commands against a development server, but would fail recognize that the feature didn’t work end-to-end.
In the case of building a web app, Claude mostly did well at verifying features end-to-end once explicitly prompted to use browser automation tools and do all testing as a human user would.
Screenshots taken by Claude through the Puppeteer MCP server as it tested the claude.ai clone.
Providing Claude with these kinds of testing tools dramatically improved performance, as the agent was able to identify and fix bugs that weren’t obvious from the code alone.
Some issues remain, like limitations to Claude’s vision and to browser automation tools making it difficult to identify every kind of bug. For example, Claude can’t see browser-native alert modals through the Puppeteer MCP, and features relying on these modals tended to be buggier as a result.
Getting up to speed
With all of the above in place, every coding agent is prompted to run through a series of steps to get its bearings, some quite basic but still helpful:
1. Runpwd to see the directory you’re working in. You’ll only be able to edit files in this directory.
2. Read the git logs and progress files to get up to speed on what was recently worked on.
3. Read the features list file and choose the highest-priority feature that’s not yet done to work on.
This approach saves Claude some tokens in every session since it doesn’t have to figure out how to test the code. It also helps to ask the initializer agent to write an init.sh script that can run the development server, and then run through a basic end-to-end test before implementing a new feature.
In the case of the claude.ai clone, this meant that the agent always started the local development server and used the Puppeteer MCP to start a new chat, send a message, and receive a response. This ensured that Claude could quickly identify if the app had been left in a broken state, and immediately fix any existing bugs. If the agent had instead started implementing a new feature, it would likely make the problem worse.
Given all this, a typical session starts off with the following assistant messages:
[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <bash - pwd>
[Tool Use] <read - claude-progress.txt>
[Tool Use] <read - feature_list.json>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - git log --oneline -20>
[Assistant] Now let me check if there's an init.sh script to restart the servers.
<Starts the development server>
[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.
<Tests basic functionality>
[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.
<Starts work on a new feature>
Copy
Agent failure modes and solutions
Problem| Initializer Agent Behavior| Coding Agent Behavior
---|---|---
Claude declares victory on the entire project too early.| Set up a feature list file: based on the input spec, set up a structured JSON file with a list of end-to-end feature descriptions.| Read the feature list file at the beginning of a session. Choose a single feature to start working on.
Claude leaves the environment in a state with bugs or undocumented progress.| An initial git repo and progress notes file is written.| Start the session by reading the progress notes file and git commit logs, and run a basic test on the development server to catch any undocumented bugs. End the session by writing a git commit and progress update.
Claude marks features as done prematurely.| Set up a feature list file.| Self-verify all features. Only mark features as “passing” after careful testing.
Claude has to spend time figuring out how to run the app.| Write an init.sh script that can run the development server.| Start the session by reading init.sh.
Summarizing four common failure modes and solutions in long-running AI agents.
Future work
This research demonstrates one possible set of solutions in a long-running agent harness to enable the model to make incremental progress across many context windows. However, there remain open questions.
Most notably, it’s still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture. It seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks across the software development lifecycle.
Additionally, this demo is optimized for full-stack web app development. A future direction is to generalize these findings to other fields. It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling.
Acknowledgements
Written by Justin Young. Special thanks to David Hershey, Prithvi Rajasakeran, Jeremy Hadfield, Naia Bouscal, Michael Tingley, Jesse Mu, Jake Eaton, Marius Buleandara, Maggie Vo, Pedram Navid, Nadine Yasser, and Alex Notov for their contributions.
This work reflects the collective efforts of several teams across Anthropic who made it possible for Claude to safely do long-horizon autonomous software engineering, especially the code RL & Claude Code teams. Interested candidates who would like to contribute are welcome to apply at anthropic.com/careers.
Footnotes
1. We refer to these as separate agents in this context only because they have different initial user prompts. The system prompt, set of tools, and overall agent harness was otherwise identical.
Formato de saida (IMPORTANTE - siga exatamente):
Escreva APENAS as 3 respostas, sem as perguntas, no formato:
[resposta 1]. [resposta 2]. [resposta 3]
NAO inclua as perguntas na saida, apenas as respostas em italico separadas por ponto. NAO comece as respostas com SIM ou NÃO. Va direto ao ponto.
}
Resumo
O texto explica que agentes de IA mais avançados precisam conseguir trabalhar em tarefas longas, que duram várias sessões de contexto, mas isso é मुश्किल porque cada nova sessão começa sem memória do que veio antes. Mesmo com mecanismos como compaction, um agente pode tentar fazer tudo de uma vez, perder o contexto no meio da implementação ou, em outra situação, achar que o trabalho já foi concluído cedo demais. Para resolver esse problema no Claude Agent SDK, os autores propõem uma abordagem em duas etapas: um agente inicializador, usado apenas na primeira sessão, que prepara o ambiente com arquivos essenciais como um script de inicialização, um registro de progresso e um commit inicial; e um agente de codificação, usado nas sessões seguintes, que avança de forma incremental e deixa artefatos claros para a próxima rodada. A ideia central é criar uma estrutura que permita entender rapidamente o estado do projeto ao retomar o trabalho, combinando histórico de git com um arquivo de progresso. Além disso, o inicializador cria uma lista detalhada de funcionalidades, inicialmente marcadas como pendentes, para evitar que o agente tente entregar tudo de uma vez ou considere o projeto pronto antes da hora.
Copy
We prompt coding agents to edit this file only by changing the status of a passes field, and we use strongly-worded instructions like “It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality.” After some experimentation, we landed on using JSON for this, as the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.
Incremental progress
Given this initial environment scaffolding, the next iteration of the coding agent was then asked to work on only one feature at a time. This incremental approach turned out to be critical to addressing the agent’s tendency to do too much at once.
Once working incrementally, it’s still essential that the model leaves the environment in a clean state after making a code change. In our experiments, we found that the best way to elicit this behavior was to ask the model to commit its progress to git with descriptive commit messages and to write summaries of its progress in a progress file. This allowed the model to use git to revert bad code changes and recover working states of the code base.
These approaches also increased efficiency, as they eliminated the need for an agent to have to guess at what had happened and spend its time trying to get the basic app working again.
Testing
One final major failure mode that we observed was Claude’s tendency to mark a feature as complete without proper testing. Absent explicit prompting, Claude tended to make code changes, and even do testing with unit tests or curl commands against a development server, but would fail recognize that the feature didn’t work end-to-end.
In the case of building a web app, Claude mostly did well at verifying features end-to-end once explicitly prompted to use browser automation tools and do all testing as a human user would.
Screenshots taken by Claude through the Puppeteer MCP server as it tested the claude.ai clone.
Providing Claude with these kinds of testing tools dramatically improved performance, as the agent was able to identify and fix bugs that weren’t obvious from the code alone.
Some issues remain, like limitations to Claude’s vision and to browser automation tools making it difficult to identify every kind of bug. For example, Claude can’t see browser-native alert modals through the Puppeteer MCP, and features relying on these modals tended to be buggier as a result.
Getting up to speed
With all of the above in place, every coding agent is prompted to run through a series of steps to get its bearings, some quite basic but still helpful:
1. Runpwd to see the directory you’re working in. You’ll only be able to edit files in this directory.
2. Read the git logs and progress files to get up to speed on what was recently worked on.
3. Read the features list file and choose the highest-priority feature that’s not yet done to work on.
This approach saves Claude some tokens in every session since it doesn’t have to figure out how to test the code. It also helps to ask the initializer agent to write an init.sh script that can run the development server, and then run through a basic end-to-end test before implementing a new feature.
In the case of the claude.ai clone, this meant that the agent always started the local development server and used the Puppeteer MCP to start a new chat, send a message, and receive a response. This ensured that Claude could quickly identify if the app had been left in a broken state, and immediately fix any existing bugs. If the agent had instead started implementing a new feature, it would likely make the problem worse.
Given all this, a typical session starts off with the following assistant messages:
[Assistant] I'll start by getting my bearings and understanding the current state of the project.
[Tool Use] <bash - pwd>
[Tool Use] <read - claude-progress.txt>
[Tool Use] <read - feature_list.json>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - git log --oneline -20>
[Assistant] Now let me check if there's an init.sh script to restart the servers.
<Starts the development server>
[Assistant] Excellent! Now let me navigate to the application and verify that some fundamental features are still working.
<Tests basic functionality>
[Assistant] Based on my verification testing, I can see that the fundamental functionality is working well. The core chat features, theme switching, conversation loading, and error handling are all functioning correctly. Now let me review the tests.json file more comprehensively to understand what needs to be implemented next.
<Starts work on a new feature>
Copy
Agent failure modes and solutions
Problem| Initializer Agent Behavior| Coding Agent Behavior
---|---|---
Claude declares victory on the entire project too early.| Set up a feature list file: based on the input spec, set up a structured JSON file with a list of end-to-end feature descriptions.| Read the feature list file at the beginning of a session. Choose a single feature to start working on.
Claude leaves the environment in a state with bugs or undocumented progress.| An initial git repo and progress notes file is written.| Start the session by reading the progress notes file and git commit logs, and run a basic test on the development server to catch any undocumented bugs. End the session by writing a git commit and progress update.
Claude marks features as done prematurely.| Set up a feature list file.| Self-verify all features. Only mark features as “passing” after careful testing.
Claude has to spend time figuring out how to run the app.| Write an init.sh script that can run the development server.| Start the session by reading init.sh.
Summarizing four common failure modes and solutions in long-running AI agents.
Future work
This research demonstrates one possible set of solutions in a long-running agent harness to enable the model to make incremental progress across many context windows. However, there remain open questions.
Most notably, it’s still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture. It seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks across the software development lifecycle.
Additionally, this demo is optimized for full-stack web app development. A future direction is to generalize these findings to other fields. It’s likely that some or all of these lessons can be applied to the types of long-running agentic tasks required in, for example, scientific research or financial modeling.
Acknowledgements
Written by Justin Young. Special thanks to David Hershey, Prithvi Rajasakeran, Jeremy Hadfield, Naia Bouscal, Michael Tingley, Jesse Mu, Jake Eaton, Marius Buleandara, Maggie Vo, Pedram Navid, Nadine Yasser, and Alex Notov for their contributions.
This work reflects the collective efforts of several teams across Anthropic who made it possible for Claude to safely do long-horizon autonomous software engineering, especially the code RL & Claude Code teams. Interested candidates who would like to contribute are welcome to apply at anthropic.com/careers.
Footnotes
1. We refer to these as separate agents in this context only because they have different initial user prompts. The system prompt, set of tools, and overall agent harness was otherwise identical.
}