In my experience AGENTS.md files only save a bit of time, they don't meaningfully improve success. Agents are smart enough to figure stuff out on their own, but you can save a few tool calls and a bit of context by telling them how to build your project or what directories do what rather than letting it stumble its way there.
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”
Hey, paper author here.
We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.
I think that is a rather fitting approach to the problem domain. A task being a real GitHub issue is a solid definition by any measure, and I see no problem picking language A over B or C.
If you feel strongly about the topic, you are free to write your own article.
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification.
Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code.
Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this.
So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.
But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that.
The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
I think this statement is on the same level as "a human cannot explain why they gave the answer they gave because they cannot actually introspect the chemical reactions in their brain." That is true, but a human often has an internal train of thought that preceded their ultimate answer, and it is interesting to know what that train of thought was.
In the same way, it is often quite instructive to know what the reasoning trace was that preceded an LLM's answer, without having to worry about what, mechanically, the LLM "understood" about the tokens, if this is even a meaningful question.
There can be higher- and lower-level descriptions of the same phenomenon. when the kettle boils, it’s because the water molecules were heated by the electric element, but it’s also because I wanted a cup of tea.
My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.
I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.
Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices
Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.
Hey, a paper author here :)
I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.
> Their definition of context excludes prescriptive specs/requirements files.
Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).
I only put things when the LLM gets something wrong and I need to correct it. Like “no, we create db migrations using this tool” kind of corrections. So far it made them behave correctly in those situations.
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"
I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.
I'd be interested to see results with Opus 4.6 or 4.5
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.
Progressive disclosure is good for reducing context usage but it also reduces the benefit of token caching. It might be a toss-up, given this research result.
Research has shown that most earlier "techniques" to get better LLM response no longer work and are actively harmful with modern models. I'm so grateful that there's actual studies and papers about this and that they keep coming out. Software developers are super cargo culty and will do whatever the next guy does (and that includes doing whatever is suggested in research papers)
Software developers don't have to be cargo-culty... if they're working on systems that are well-documented or are open-source (or at least source-available) so that you can actually dig in to find out how the system works.
But with LLMs, the internals are not well-documented, most are not open-source (and even if the model and weights are open-source, it's impossible for a human to read a grid of numbers and understand exactly how it will change its output for a given input), and there's also an element of randomness inherent to how the LLM behaves.
Given that fact, it's not surprising to find that developers trying to use LLMs end up adding certain inputs out of what amounts to superstition ("it seems to work better when I tell it to think before coding, so let's add that instruction and hopefully it'll help avoid bad code" but there's very little way to be sure that it did anything). It honestly reminds me of gambling fallacies, e.g. tabletop RPG players who have their "lucky" die that they bring out for important rolls. There's insufficient input to be sure that this line, which you add to all your prompts by putting it in AGENTS.md, is doing anything — but it makes you feel better to have it in there.
(None of which is intended as a criticism, BTW: that's just what you have to do when using an opaque, partly-random tool).
Many of the practices in this field are mostly based on feelings and wishful thinking, rather than any demonstrable benefit. Part of the problem is that the tools are practically nondeterministic, and their results can't be compared reliably.
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.
In my experience AGENTS.md files only save a bit of time, they don't meaningfully improve success. Agents are smart enough to figure stuff out on their own, but you can save a few tool calls and a bit of context by telling them how to build your project or what directories do what rather than letting it stumble its way there.
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.
Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”
Well, task == Resolving real GitHub Issues
Languages == Python only
Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)
So seems like a highly skewed sample and who knows what can / can't be generalized. Does make for a compelling research paper though!
Hey, paper author here. We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.
I think that is a rather fitting approach to the problem domain. A task being a real GitHub issue is a solid definition by any measure, and I see no problem picking language A over B or C.
If you feel strongly about the topic, you are free to write your own article.
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
I think this statement is on the same level as "a human cannot explain why they gave the answer they gave because they cannot actually introspect the chemical reactions in their brain." That is true, but a human often has an internal train of thought that preceded their ultimate answer, and it is interesting to know what that train of thought was.
In the same way, it is often quite instructive to know what the reasoning trace was that preceded an LLM's answer, without having to worry about what, mechanically, the LLM "understood" about the tokens, if this is even a meaningful question.
There can be higher- and lower-level descriptions of the same phenomenon. when the kettle boils, it’s because the water molecules were heated by the electric element, but it’s also because I wanted a cup of tea.
> So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.
You may want to ask the next LLM versions the same question after they feed this paper through training.
My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:
- How to build.
- How to run tests.
- How to work around the incredible crappiness of the codex-rs sandbox.
I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.
I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.
It’s like instructing a toddler.
For TypeVar I’d reach for a lint warning instead.
Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».
Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices
Exactly my thoughts... the model should just auto ingest README and CONTRIBUTING when started.
Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.
Also important to note that human-written context did help according to them, if only a little bit.
Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.
Hey, a paper author here :) I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.
> Their definition of context excludes prescriptive specs/requirements files.
Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).
I only put things when the LLM gets something wrong and I need to correct it. Like “no, we create db migrations using this tool” kind of corrections. So far it made them behave correctly in those situations.
It is still baffling to me why we need AGENTS.md
Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.
Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"
If the harnesses had a simple system prompt "read repository level markdown and honor the house style"?
Think of the agent app store people's children man, it would be a sad Christmas.
This paper shoulda just done a study on elixirs usage_rules?
https://github.com/ash-project/usage_rules
I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.
Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.
This is life of an LLM researcher. We literally ran the last experiments only a month ago on what were the latest models back then...
I'd be interested to see results with Opus 4.6 or 4.5
Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).
Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.
I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.
Progressive disclosure is good for reducing context usage but it also reduces the benefit of token caching. It might be a toss-up, given this research result.
Research has shown that most earlier "techniques" to get better LLM response no longer work and are actively harmful with modern models. I'm so grateful that there's actual studies and papers about this and that they keep coming out. Software developers are super cargo culty and will do whatever the next guy does (and that includes doing whatever is suggested in research papers)
Software developers don't have to be cargo-culty... if they're working on systems that are well-documented or are open-source (or at least source-available) so that you can actually dig in to find out how the system works.
But with LLMs, the internals are not well-documented, most are not open-source (and even if the model and weights are open-source, it's impossible for a human to read a grid of numbers and understand exactly how it will change its output for a given input), and there's also an element of randomness inherent to how the LLM behaves.
Given that fact, it's not surprising to find that developers trying to use LLMs end up adding certain inputs out of what amounts to superstition ("it seems to work better when I tell it to think before coding, so let's add that instruction and hopefully it'll help avoid bad code" but there's very little way to be sure that it did anything). It honestly reminds me of gambling fallacies, e.g. tabletop RPG players who have their "lucky" die that they bring out for important rolls. There's insufficient input to be sure that this line, which you add to all your prompts by putting it in AGENTS.md, is doing anything — but it makes you feel better to have it in there.
(None of which is intended as a criticism, BTW: that's just what you have to do when using an opaque, partly-random tool).
Many of the practices in this field are mostly based on feelings and wishful thinking, rather than any demonstrable benefit. Part of the problem is that the tools are practically nondeterministic, and their results can't be compared reliably.
The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.