Two Weeks of Working With Coding Agents on One Project

For the past two weeks, for whatever reason, I have been working intensively (at least ~15h/day) with coding agents (mainly Claude Code, occasionally Codex). I made a couple observations that I thought would be interesting, although there are probably hundreds of similar posts about coding agents already out there, with much deeper observations.

Disclaimer: I am nowhere close to a pro coding-agent user, and usually I am pretty lazy about looking into more advanced features like “Skills” and “Agents”. By reading more about them, some of the issues discussed below (point 1 and 4 specifically). But I haven’t tried enough to confidently say it could.

Obviously, things would not be meaningful without context. First of all, I am writing Python. At a super high level, the task is to parse several repos written in a special-purpose programming language and perform some static analysis on them. (Don’t ask me why I’m not using a language tooling thing to help the process. Obviously this is a coding task that must be done under time constraints, and when I took over the coding, it was already too late to dive into existing tools.) The subsequent task also involves formatting the static analysis results into a particular format, as well as improving the current evaluation pipeline around that generated format.

Setup: I use Cursor, but I don’t really use any of its provided features anymore (so it’s basically VSCode). I keep the terminal tabs open and invoke one or more coding agents in each of the terminal tabs on the side. For Claude I’ve been using Claude-4.5-Opus, and for Codex I’ve been using gpt-5.1-codex-max (medium reasoning effort).

TL;DR: While Claude Code can write perfect code given a super concrete task (i.e., using to do ), and can also do a fair job at writing a general initial template, its specific problem-solving skill—or what I call the “architecting” and “planning” skill—often behaves like it’s in “close the issue” mode: find the minimal patch that satisfies the current tests, instead of revisiting the design. Nevertheless, with the right supervision and enough iteration, it can in general do a decent job at whatever you expect it to do (at a much faster speed than doing it yourself).

1. It really doesn’t remember to read Claude.md that often

Disclaimer: maybe I have not tried that hard in having Claude.md at every level of the code base.

Whenever I have a module I would like the coding agent to implement, I always start by writing the Claude.md. (It also helps myself sort out some details.) I also spend a decent amount of time to make sure my Claude.md is clear and sufficient to express my objectives. It usually also includes a section that talks about the “Don’t”s for the implementation.

Since a lot of the time, to test the code the agent would need to obtain some inputs, I also include some important paths that would be helpful to read when needing to inspect and understand the input.

The unfortunate part is: they usually don’t always remember to read it. And even when they do, they might forget quickly (and they won’t go back to it unless you remind them). Maybe it is simply due to the long-context nature of the task that they start to forget, but given Claude.md usually encodes a relatively complete specification from the user that is easy to refer to, I would expect them to remember it.

What did I end up doing: I remind them in chat when I think they should go back to the doc, or I just directly remind them of the specific part of the specification.

2. It is pretty lazy when it comes to reading / writing / modifying a larger part of the codebase

This is actually reflected through multiple perspectives:

When implementing a feature, it doesn’t look around to see if we have implemented similar things before.
After implementing a feature, it doesn’t refactor into modular code.

These two issues bring huge pain when developing software. You can easily get 2000 lines of Python code without realizing that most of them are duplicating what other code has already been doing.

(Side note: later on I found that even when you tell it to duplicate the functionality of one script into another script, it does not always duplicate it well—it might omit some functionality from the original codebase based on its own understanding of the task, and that often ends up being bug-inducing.)

When fixing a bug it discovered, it tends to choose the simplest fix—but not necessarily generalize (this will be discussed in another section as well).

3. When implementing a feature / fixing a bug, it tends to not do it in the most principled way

This is honestly the most annoying part. It feels like a software engineer who keeps writing hacks, when there are still principled ways of doing it.

For instance, let’s say I’m writing a parser. Of course there are places where a regex is necessary, and even preferred—for example filtering out reserved keywords, or extracting a function name. But it doesn’t mean a regex is sufficient to do everything, including classifying whether a word is a variable, a theorem, or a definition. The funny thing is that the coding agent does realize it’s not sufficient, so it tries to come up with even more heuristics to do that (for example, any word with fewer than 3 characters must be a variable, not a theorem name).

What is further tricky is that it is difficult to discover everything they do on the fly. Even if I monitor every code chunk carefully, since sometimes it is indeed okay to write regex, it is hard to tell whether they are inserting regexes in the right place or not. (What I did later on to prevent it is I immediately stop the code generation as soon as I see a regex and inspect carefully.)

Even if later on you tell it to remove all regex usage, it will still secretly leave some regexes there (and tell you that they have successfully removed everything). And later on if you ask why it did not tell you, it will come up with some random reason.

Most seriously, it doesn’t think about how to implement a feature “principledly” from the very beginning. If I tell it to do X and attach one example for the task, it is highly likely to overfit to that particular instance. Sometimes I guide it to think about the “most generalizable” implementation, but the answers are unsatisfying (for example, “this is a fundamental limitation of static analysis”—while it is true, I can think of a better solution). And I guide it to think about the “root cause” of the bug, and when it can identify the “root cause,” it often says explicitly that fixing the root cause is a lot of work and it would rather do the simplest fix (i.e., fix at the downstream level), which leads to overfitting to particular error cases.

Does it mean the coding agent is not capable of doing it? Not really. Once you give some super high-level ideas of how a better implementation would work, they can efficiently implement what you suggested in the way you expected. You just need to guide them.

4. It writes tests, but it doesn’t record them (unless you request it)

I find this amusing from a software engineering perspective. It proposes a lot of tests during its implementation, but I often don’t see the test cases (which is fine to me actually), and I don’t see it recording those test cases. There are also failures I often present to them; sometimes I present multiple failures at a time. They run the tests (as a command-line script) but they never record them. Ideally, those test cases would be the perfect regression test suite to avoid future modifications triggering the same errors. But they won’t do it unless you tell them, and they often forget to run those test cases after finishing their new implementation / bug fix.

A similar observation is: “it writes analysis, but it doesn’t record it (unless you request it).” Toward the end of the two-week coding period, I often need it to run some data analysis and extract some tables. What is annoying is that unless I request it, it almost never writes down the code that led to the analysis. This is a super bad habit given that all analysis needs to be carefully inspected; without the code, it is really difficult.

5. It does a great job at finding issues (as long as you tell it where you think is wrong)

The flexible tool-calling feature of coding agents comes in really handy when comparing inputs and outputs—especially when you have input/output pairs that consist of hundreds of lines of text. I often see the coding agent make ~100 tool calls per comparison (mostly bash scripts for string comparison), and I would never imagine myself doing that.

That said, you can’t overwhelm them. If you have 100 input/output pairs to inspect, it is probably a good idea to do it 5-by-5 (i.e., 5 pairs per conversation).

For Claude Code: once it has “carefully” inspected one or two pairs, it will quickly go into “write a script to compare” mode (and that script often uses some regex to do random things), and it will no longer inspect every instance. For Codex: it will almost always emphasize the fact that it is “running out of time” (even when there is plenty of context left), and try its best not to perform the inspection.

What did I make sense out of this experience?

Coding agents are great, and most of my days I have been interacting with them. Without coding agents, it would take me ~1 month to finish the coding for this project (and potentially I would write more bugs than the current version).

But beyond writing a specific piece of code, or building web apps where there is a super good recipe for it, coding agents alone are not sufficient. Human supervision is greatly required, especially for domains that need a lot of insights from the human themselves, and also in general for software engineering practices. At this point, I would expect my graduate students to write better code than Claude Code for the level of engineering that we do in our research.

Does it mean coding agents can’t get any better? Not really. I’m just not sure how the kind of problem-solving and abstraction I’m relying on here gets represented as training data in general. In the wild, a lot of software work is shaped by constraints and incentives—tight deadlines, unclear specs, and the need to ship incremental fixes—so it wouldn’t surprise me if models learn that “pragmatic patching” style as a default. At the same time, when I think through an issue I have a pretty specific internal chain-of-thought, and it’s unclear whether an LLM can pick up that process in a directly learnable way. Maybe they don’t need my chain-of-thought; maybe they just need much better training data (or better training signals) for the kind of end-to-end reasoning and design I care about.