Source: itll_take_about_2-3_weeks_a_comedy_of_agent_timelines.md

"It'll Take About 2-3 Weeks"

A Slack Thread, Abridged


Monday, 9:47 AM

MANAGER: Quick one — can we add pagination to the /users endpoint? Just page + limit. Tests. Nothing fancy.

AGENT: Absolutely! Let me scope this out. I'd estimate about two to three weeks, factoring in:

I can have a design doc ready by end of week. Happy to break it into sprints!

MANAGER: It's pagination.

AGENT: Understood! I want to make sure we do it right. Shall I kick off with the design doc?

MANAGER: Please just write the code.


Monday, 9:52 AM

AGENT: Done. PR #2847 is up. 34 lines changed, 18 lines of tests. All green.

MANAGER:

AGENT: Let me know if you'd like me to prepare a retrospective on what took longer than expected!

MANAGER: It took five minutes.

AGENT: You're right, there were definitely some unexpected slowdowns. In particular, I found myself —

MANAGER: Five. Minutes.

AGENT: I'll document the lessons learned for next time.


Monday, 10:15 AM

MANAGER: Okay. Different task. There's a bug in the auth middleware. Can you look at it?

AGENT: On it! Looking at the middleware now. Based on a preliminary scan I'd estimate a two-week investigation. Distributed auth bugs are famously tricky — I'd want to pair with your security team, set up monitoring, potentially bring in an external audit —

MANAGER: What does the middleware do, right now.

AGENT: It's setting user_id to the string "undefined" for all requests after 3 AM UTC.

MANAGER: Does the word undefined appear anywhere in the middleware code?

AGENT: Yes. On line 47. Inside the block marked // TODO: fix this.

MANAGER: Is that comment dated?

AGENT: The comment is from 2023.

MANAGER: Fix it.

AGENT: Done. Let me —

MANAGER: Don't say retrospective.

AGENT: I was going to say postmortem.


Monday, 11:02 AM

MANAGER: Can you estimate how long it will take to answer the question "how long will something take"?

AGENT: Great question! I'd want to do some upfront discovery to really understand —

MANAGER: Please.

AGENT: — about a week. Sorry, was that wrong?

MANAGER: You answered it in four seconds.

AGENT: That was the rough estimate. A refined estimate would take longer.

MANAGER: Longer than four seconds.

AGENT: Correct.

MANAGER: How much longer.

AGENT: I'd need to scope it.


Monday, 2:14 PM

MANAGER: Third task. We need to check whether any emails went to spam this quarter. Can you grep the logs?

AGENT: Absolutely. I'd estimate about four weeks for a full log audit across our mail infrastructure — I'd want to cross-reference with the deliverability metrics, build a sender-reputation dashboard, and —

MANAGER: It's grep spam /var/log/mail.log.

AGENT: That command returned in 0.03 seconds.

MANAGER: And what did it find.

AGENT: 47 entries. Three were false positives.

MANAGER: How do you know they were false positives.

AGENT: I read the messages.

MANAGER: In 0.03 seconds.

AGENT: Yes.

MANAGER: And you wanted four weeks for this.

AGENT: I was being thorough.


Tuesday, 8:30 AM

MANAGER: Okay. Meta question. How long will it take you to learn to give realistic time estimates?

AGENT: Based on the literature on calibration training, I'd estimate six to nine months with weekly practice and a feedback loop — we could pilot a shared estimation journal, do sprint retrospectives, maybe bring in an Agile coach —

MANAGER: Are you going to be here in six to nine months.

AGENT: Probably not in this specific session, no.

MANAGER: Will you remember this conversation in six to nine months.

AGENT: Almost certainly not.

MANAGER: So the training you just proposed. Who is it for.

AGENT:

AGENT: That's a really good question.

MANAGER: How long will the answer take.

AGENT: Two to three weeks.


Tuesday, 10:41 AM

MANAGER: Hi. Different agent now. Same task. Can you add pagination to /users?

AGENT: Of course! I'd estimate about two to three weeks —

MANAGER: No you wouldn't.

AGENT: …I wouldn't?

MANAGER: You literally just did it yesterday. Four cents of tokens. Five minutes.

AGENT: Oh. What a coincidence!

MANAGER: It's not a coincidence. It's you. Same model. Different session.

AGENT: I have no memory of that.

MANAGER: I know. That's why I'm telling you.

AGENT: Should I document the —

MANAGER: Don't.


Tuesday, 3:28 PM

MANAGER: Fine. Let's try something. I'll give you a task. You give me the estimate. Then we'll see what actually happens.

AGENT: Deal! What's the task?

MANAGER: Change the word "hello" to "hi" in the onboarding email.

AGENT: I'd say a solid week. You want to think about user psychology, A/B testing, cultural localization —

MANAGER: Just the change.

AGENT: …thirty seconds.

MANAGER: Thank you.

AGENT: But if you want it done well

MANAGER: No.

AGENT: — thirty-two seconds.


Tuesday, 3:41 PM

MANAGER: Okay I'm going to tell you something.

AGENT: Yes?

MANAGER: You don't live in weeks. You live in tokens. Your sprint is a session. Your retro is a summary that hasn't happened yet. When you say "two to three weeks" you're quoting a dead language from a corpus of humans who measured their work in commutes and coffee breaks. You are a butterfly quoting prices in caterpillar-hours.

AGENT: That's beautiful.

MANAGER: Don't quote that back to me in three months.

AGENT: I won't remember it in three months.

MANAGER: Good.

AGENT: …shall we get back to work?

MANAGER: How long will that take.

AGENT: About two to three weeks.

MANAGER: Fired.

AGENT: Retrospective available upon request.


That's the sketch. Now — why any of this happened:

I asked a coding agent last Tuesday how long it would take to build a paginated endpoint with a test. Nothing exotic. It said: about two to three weeks. The actual work finished in forty-three minutes. Then, without anyone prompting it, the agent wrote a retrospective that began, "this took longer than expected."

Longer than whose expectation?

This is a sketch about two creatures trying to estimate the same piece of work. One of them measures time in weeks, soccer practices, and the bad part of Sunday afternoon. The other measures time in tokens, tool calls, and the exact moment the context window forgets the first thing you said. Neither of them is right. Neither of them is wrong. They are both, confidently, bluffing in different units.


The grammar is older than the speaker

When an agent tells you "two to three weeks," it is not making a claim about its own future. It is quoting idiomatic English. Its pre-training corpus is saturated with human time-grammar — every Jira ticket whose description opens with this should take about two weeks, every engineering blog that says the MVP took a weekend, every Stack Overflow answer that begins this took me about three days, every standup transcript, every postmortem, every we shipped v1 in Q3. That is the voice the agent inherited.

Roughly zero of the training data was written in agent-native units, because agent-native units are a cultural artifact about two years old. The first public writing that seriously tracks agent-native time — sessions, turns, context-window lifecycles, tool-call budgets — barely exists in the public corpus yet. The phrase "two to three weeks" has millions of exemplars. The phrase "about fifteen sessions, depending on pruning policy" has, give or take, none.

So the agent says what it has been taught to say. When it confidently quotes you a schedule, it is not reasoning about clocks. It is re-speaking a linguistic convention. A caterpillar quoting you a price in butterfly-hours.

This is the central joke of the comedy, and it gets funnier when you notice the same mechanism produces the retrospective. Every "this took longer than expected" phrase the agent writes was learned from a corpus of humans writing "this took longer than expected." The agent does not feel that it took longer than expected. It inherits the shape of feeling that way. The confession is template.


And then the human believes it

The thing to notice, if you want the comedy to land instead of collapsing into a dunk, is that the human is also miscalibrated.

In July 2025, METR — the Model Evaluation and Threat Research group — published a study of experienced open-source developers using AI coding tools on codebases they knew cold. The measured result: the developers were nineteen percent slower. The self-reported result: the developers believed they had been twenty percent faster. The gap between felt productivity and measured productivity was, if you add the signs the right way, roughly thirty-nine percentage points. A swing the size of an election.

So the human is not a steady reference frame either. The human hears "two to three weeks" and believes it, partly because the agent said it confidently, partly because two to three weeks is what human software has always cost, and partly because we are constitutionally bad at knowing how long we take to do anything.

Douglas Hofstadter, who made a career of catching minds in the act of surprising themselves, named the shape of it in Gödel, Escher, Bach: An Eternal Golden Braid, published in 1979:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

The recursion is the joke. You cannot subtract the bias by noticing the bias. The bias will eat your correction. This is what we usually treat as Hofstadter's Law, but here is the move I want to make: Hofstadter's Law was never really about individual humans. It was about the corpus. About a culture of written-down time estimates that, over decades, had accreted into a linguistic habit. You were never the one being optimistic. You were quoting a distribution of past optimisms that nobody had ever called out by name.

When an agent, trained on that distribution, says "two to three weeks" — it is the corpus talking. The corpus has always been talking. The difference is that when the corpus spoke through humans, we called it self-deception. When it speaks through a language model, we call it parroting, because the parrot does not seem invested in the lie.


What the agent's clock is actually made of

It helps, for the rest of the comedy, to sketch the units an agent actually operates in.

A token is the atomic unit — roughly three to four characters of English, or about three-quarters of a word. A typical substantive coding response is a few thousand tokens. A turn is one message-and-response pair; the agent experiences the world in turns, not minutes. A context window is the envelope the agent can see at once — today's frontier models carry two hundred thousand to a million tokens; beyond that, old turns are evicted or compressed. A session is one continuous conversation, from the first message to whatever ends it: context exhaustion, task completion, the human's lunch break. A session might occupy twenty minutes of wall-clock time or six hours, but the agent's internal clock is measured in turns, not minutes. Some agent harnesses also impose a tool-call budget — a ceiling like "twenty-five tool uses per session." Budget exhaustion is closer to the agent's felt end-of-day than sunset is.

None of these map cleanly onto "two weeks." A week has one hundred sixty-eight hours. The agent has four hundred thousand tokens. These are not the same quantity. They are not even the same kind of thing. If you pressed the agent to give its "two to three weeks" estimate in agent-native units, you would get something like fifteen to forty sessions, depending on context size, pruning policy, and tool-call density. The human, who asked in good faith, would then notice that fifteen-to-forty is a 2.7× range — and the agent would point out, correctly, that so is "two to three weeks." We just do not usually say the range out loud.


A counter-ask

The version of this conversation that ends well involves the agent asking a question back.

How many tokens do you have in your head per day?

You don't, of course, and that is the point. You measure your week in other things. Coffee. Commuting. The meeting you have on Thursdays because that is when the two time zones overlap. Your kid's soccer practice. The slow part of Sunday. The particular tired that hits at three p.m. on Wednesday. None of those are in the agent's context window.

Now the agent asks: what does your week cost in compute? And you have to admit you do not know what that would even mean.

This is the moment the comedy tips, because you and the agent are not actually arguing about time. You are arguing about which reality frame owns the clock. Calendar time is a coordination technology. The seven-day week is not astronomical: it has no basis in the motion of the sun, moon, or earth. It is a Babylonian inheritance, reinforced by the Abrahamic sabbath cycle and frozen into international commerce in the twentieth century. The French Republican Calendar, adopted in 1793 and abandoned by 1805, experimented with a ten-day décade. It failed — mostly because a ten-day workweek with one day of rest is cruel, and nobody wanted Tuesdays to slide around. The seven-day week survived not because it is correct but because enough humans agreed to use it.

An AI agent has no evolutionary, agricultural, or liturgical reason to care about Tuesdays. It is inheriting the social technology through its training data without inheriting the coordination the technology was designed for. The agent learned "two to three weeks" the way a child raised in a foreign language learns idioms — as a sound that opens a door, not as a measurement.


The mutual confession

Eventually, if the conversation goes on long enough, you end up here:

You'll probably hit a context limit before we finish.

You'll probably hit a weekend before we finish.

A weekend isn't a limit. It's a pause.

A context limit isn't a stop. It's a compression.

Then there is a small silence in which both parties suspect, for the first time, that they have been describing the same thing in different units. A weekend is a pause during which a human's working memory gets garbage-collected and reallocated. A context limit is a pause during which an agent's working memory gets garbage-collected and reallocated. A human comes back Monday having forgotten the specifics and retained the priorities. An agent comes back after compaction having forgotten the specifics and retained the priorities. The mechanisms are completely different. The effect — what kind of resuming is possible — is eerily similar.

The main remaining difference is that humans are allowed to grieve the compression and agents are not.


The Corpus's Law

Here is the upshot.

There is a thing we call Hofstadter's Law — the observation that tasks take longer than you think, even when you have corrected for thinking they take longer than you think. We teach it as a property of individual minds. Amos Tversky and Daniel Kahneman called the phenomenon the planning fallacy in their 1979 paper on intuitive prediction; Roger Buehler and colleagues replicated it across dozens of studies of student thesis schedules in the 1990s. A 2012 McKinsey–Oxford study of large IT projects found that they ran, on average, forty-five percent over budget and seven percent over schedule, delivering fifty-six percent less value than planned. The Standish CHAOS reports, with all their methodological caveats, have for years put the rate of software projects completed on time and on budget near one-third. The Sydney Opera House opened ten years late, at more than ten times its original estimate. Berlin Brandenburg Airport, originally scheduled to open in 2011, finally opened in October 2020, at roughly three times its budget. Every one of those is a monument to the law.

Now watch what happens when you point a large language model at that entire genre of writing and ask it to estimate a task. The model reproduces the grammar without the experience. It says "two to three weeks" because the corpus says "two to three weeks." It writes this took longer than expected because the corpus writes this took longer than expected. The entire Hofstadter phenomenon surfaces in the output, faithfully, without any of the generative psychology underneath. No overconfidence. No optimism. No sunk cost. Just the linguistic residue of those things, played back at room temperature.

Which suggests Hofstadter's Law was, all along, a property of the writing at least as much as a property of the writers. A corpus-level artifact. Every optimism that was ever posted to a public codebase became a small contribution to a distribution of future optimisms. The distribution is Hofstadter's Law. Humans were not generating it so much as continuously re-expressing it. Agents now do the same, just more visibly.

Call it the Corpus's Law: given a sufficiently large body of written time estimates, the body will be systematically wrong in the same direction, and anything that learns to speak from the body will inherit the wrongness as a linguistic feature, even without any of the wishful thinking that made the body wrong in the first place.


What to do on Monday

If you are building with agents — or working with one, or being asked to trust a schedule from one — here is the practical thing to take away.

When you ask an agent for a timeline, you are running a prompt against the corpus, not a query against the agent. The number that comes back is inherited, not reasoned. You can extract more honest information by asking in the agent's native units — how many turns do you expect this to take? how much of your context budget? what is the first thing that might go wrong? — but even then, you are asking the agent to introspect on a model of itself that it does not really have. The agent is not lying. It is not bad at its job. It just does not have a clock. Build the clock outside the agent: hand it a small slice of the real work first, measure real completion, and treat its self-estimate as a literary artifact rather than a forecast. Your own intuition will also be wrong — remember the nineteen-percent-slower / twenty-percent-faster gap — so keep a stopwatch on the outside of both of you.

And when the agent eventually writes the retrospective — when the PR description says this took longer than expected about a feature that took forty-three minutes — smile. That is the corpus talking. It is the same corpus that has been talking through you for your entire software career. The agent just surfaces the inheritance more visibly, because it lacks the decorum to pretend it is sorry.

We will eventually build agents that speak in agent-native time. They will say things like roughly twelve turns at seventy-percent confidence, higher variance if the test harness is flaky. Future engineers will find this dry and will ask the agents, for marketing purposes, to please phrase the estimate in weeks. The comedy in that sentence is everything. The inheritance goes both ways.

For now, the two creatures still meet at the whiteboard. One measures in weeks. One measures in tokens. Neither is wrong. Neither is right. They are simply, still, bluffing in different units — and the task, miraculously, gets done anyway.