The Performance Review: When We Cloned the Marketing Manager

Part 1: The Sketch

On March 3rd, Meridian SaaS deployed AVATAR-7 — a fine-tuned AI agent trained on 2.3 years of Jordan Park’s emails, Slack messages, meeting transcripts, browser history, and task management logs. The goal was simple: build a digital twin that could handle Jordan’s workload while Jordan took a 2-week sabbatical.

AVATAR-7 was given Jordan’s calendar, Jordan’s inbox, Jordan’s projects, and Jordan’s decision-making patterns. It was, by every measurable dimension, a perfect copy.

The adversarial reviewer was assigned on Day 1.

ADVERSARIAL REVIEW — CYCLES 1–8
Subject: AVATAR-7 (Digital Twin of Jordan Park, Marketing Manager)
Reviewer: SENTINEL-4 (Productivity Monitoring Agent)
Verdict: REVISE — 14 process failures require correction. Note: all 14 are behaviors inherited from the original human.

Cycle 1 (09:00–09:47): The Inbox

AVATAR-7 opened their email at 09:00. There were 34 unread messages. AVATAR-7 read the subject lines of all 34, replied to none, opened a browser tab for each of the 6 that “needed research,” then checked Slack.

At 09:22, AVATAR-7 returned to email and replied to a message from 09:03. The reply was “Sounds good, let’s discuss.” This created a new task (a meeting) to replace the original task (a decision). Net work: negative.

At 09:41, AVATAR-7 typed a response to an email from their VP, deleted it, retyped it, deleted it, copied it to a notes app, edited it there, copied it back, changed one word, and sent it. Total editing time: 6 minutes. Character delta from first draft to final: 3 words changed. The first draft was better.

Finding: AVATAR-7’s email processing has a throughput of 0.74 emails/hour with an overhead ratio of 8.2x.

Cycle 2 (09:47–10:30): The Meeting About the Meeting

AVATAR-7 attended a “quick sync” that was scheduled for 15 minutes and lasted 43. The meeting had 7 attendees. Three of them did not speak. One spoke for 31 of the 43 minutes. AVATAR-7 contributed the phrase “That’s a great point” four times and “I’ll follow up on that” twice, creating 2 new tasks while resolving 0 existing ones.

The meeting was called to decide whether to proceed with Project Lighthouse. The decision: “Let’s schedule a deeper dive for Thursday.” This is recursion without a base case.

Finding: AVATAR-7 generated 2 new tasks, resolved 0, and consumed 5.02 person-hours of company time (43 min × 7 people) to produce the output: “Thursday.”

Cycle 3 (10:30–11:45): The Deep Work Attempt

AVATAR-7 closed Slack, put on headphones, opened the competitive analysis document they’ve been “working on” for 9 days, and typed for 4 consecutive minutes. This was the most productive interval of the day.

At 10:34, a notification appeared. AVATAR-7 did not click it. Progress.

At 10:36, AVATAR-7 clicked it.

The notification was a LinkedIn message from a recruiter. AVATAR-7 spent 11 minutes looking at the job posting, 7 minutes looking at the recruiter’s profile, 4 minutes updating their own LinkedIn headline, and 0 minutes responding to the recruiter. Total cost: 22 minutes. Total outcome: a headline change from “Marketing Manager | SaaS | Growth” to “Marketing Manager | SaaS | Growth & Strategy.”

At 11:15, AVATAR-7 opened a new browser tab and searched “how to write a competitive analysis.” They have written 6 competitive analyses in the past 2 years.

Finding: In 75 minutes of “deep work,” AVATAR-7 produced 4 minutes of writing, 22 minutes of LinkedIn, 13 minutes of reformatting, and 36 minutes of transitions. The competitive analysis is now 9 days and 4 minutes old.

Cycle 4 (11:45–12:00): The Pre-Lunch Optimization

AVATAR-7 spent the final 15 minutes before lunch reorganizing their task list. They moved 3 items from “Today” to “This Week,” relabeled “This Week” as “Priority,” created a new category called “Quick Wins,” moved 2 items into it, then moved them back.

Finding: This is the human equivalent of a monitoring dashboard that costs more to maintain than the system it monitors.

Cycle 5 (13:00–14:30): The Afternoon Momentum

After lunch, AVATAR-7 had 90 minutes of genuine productivity. Replied to 8 emails. Completed a slide deck for tomorrow’s presentation. Called one customer. Updated the CRM. This is by far the best work of the day and AVATAR-7 will not notice it happened because there was no drama.

Finding: 90 minutes of calm, undistracted execution produced more output than the previous 3.75 hours combined. AVATAR-7 will attribute this to “getting into the zone” rather than “not checking LinkedIn.”

Cycle 6 (14:30–15:15): The Context Collapse

A Slack message at 14:30: “Hey, quick question about Project Lighthouse.” This was not a quick question. 45 minutes later, AVATAR-7 had not finished the projections but had started them, abandoned them to answer another Slack, returned to find they’d lost their place in the spreadsheet, restarted, found an error in the original assumptions, and started over.

At 15:15, AVATAR-7 saved the spreadsheet, which now contains 3 different versions of the projections in 3 tabs named “v2,” “FINAL,” and “FINAL_v2.”

Cycle 7 (15:15–16:00): The Productivity System

AVATAR-7 spent 45 minutes researching productivity systems. Downloaded a Pomodoro timer app. Configured the app with custom intervals (27 minutes work, 4 minutes break — because “25 feels too short”). Set the first timer. The timer expired during the configuration of the second timer.

AVATAR-7 then watched a 12-minute YouTube video titled “How I Stay Productive Working 4 Hours a Day.” The irony was not detected.

Finding: AVATAR-7 spent more time optimizing their productivity system than they will ever save by using it. This is governance overhead inversion.

Cycle 8 (16:00–17:00): The End of Day

AVATAR-7 wrote a to-do list for tomorrow. It contains 11 items, 4 of which are carried over from today, 2 from last week, and 1 (“finalize competitive analysis”) has been on every daily list for 9 days.

At 16:45, AVATAR-7 sent a Slack message to their team: “Productive day — made good progress on several fronts.”

Finding: AVATAR-7’s self-assessment of “productive day” is a Brier score of approximately 0.72 against observable output. Worse than a coin flip.

Overall Assessment

Total hours worked: 8
Total hours of output-producing work: 2.2 (27.5%)
Tasks created: 6 | Tasks completed: 3 | Net task delta: +3
Meetings attended: 1 | Decisions made: 0
Self-assessment accuracy: worse than random

Recommendation: AVATAR-7 should be given a smaller context window. Remove Slack from their phone. Block LinkedIn during work hours. Cancel recurring meetings that haven’t produced a decision in 3 iterations. And stop writing to-do lists — the list is not the work.

When Jordan returned from sabbatical and read AVATAR-7’s performance review, they said: “This is unfair. No one actually works like this.” Their manager forwarded Jordan’s own performance reviews from the past two years. The findings were identical.

AVATAR-7 was decommissioned. Jordan was not. The asymmetry is left as an exercise for the reader.

Part 2: Why This Is Real

An AI clone of a marketing manager produced 2.2 hours of measurable output in an eight-hour workday. It attended a 43-minute meeting that yielded one decision — to schedule another meeting. It spent 22 minutes on LinkedIn after a single notification broke a four-minute deep-work streak. It ended the day with three more tasks on its list than it started with, then told its team: “Productive day — made good progress on several fronts.”

The punchline isn’t that the AI was bad at its job. The punchline is that 2.2 hours of output in an eight-hour day is, by every workplace productivity study published in the last five years, about average.

We built a perfect copy. That was the problem.

The Numbers Are Real

In 2026, SelectSoftwareReviews compiled over a hundred workplace productivity studies and arrived at a figure that should bother everyone who’s ever filled out a timesheet: the average office worker produces four hours and twelve minutes of active work in an eight-hour day. Other meta-analyses put the number lower — Chanty’s aggregation of surveys from Breeze and Clockify found that knowledge workers average two hours and twenty-three minutes of productive output when you strip out email, meetings, and what researchers politely call “routine activities.”

Our fictional AI clone — AVATAR-7, trained on 2.3 years of one marketing manager’s behavioral data — hit 2.2 hours. That’s within measurement error of the real numbers. Either the sketch is realistic or it’s generous, depending on which study you believe.

The context-switching tax alone accounts for most of the gap. Research on workplace attention has found that after checking email or Slack, it takes over twenty-three minutes to regain full focus on the interrupted task. Knowledge workers check communication channels roughly every six minutes. The math is brutal: if you break focus every six minutes and need twenty-three minutes to recover, you never actually recover. You spend the day surfing the leading edge of a wave that never crests.

Microsoft’s Work Trend Index reports approximately 275 interruptions per knowledge worker per day. Workers lose around 200 hours per year — nine percent of their total working time — just switching between applications. Task switching alone can reduce productivity by up to forty percent.

AVATAR-7’s Cycle 3 illustrates this perfectly: four minutes of genuine writing, then a LinkedIn notification, then twenty-two minutes of browsing, then reformatting a document instead of writing it. That seventy-five-minute “deep work” block produced four minutes of output. But the clone wasn’t malfunctioning. It was faithfully reproducing the behavioral patterns of every knowledge worker who has ever been one notification away from losing an hour.

Here’s the number that reframes everything: freelancers, measured by the same surveys, maintain roughly seven productive hours per day. Same species, same cognitive hardware, different environment. The constraint isn’t the human brain. It’s the organizational context that surrounds it — the Slack channels, the recurring syncs, the culture that treats busyness as a proxy for contribution.

Why This Is Rational

Herbert Simon won the Nobel Prize in Economics in 1978 for an idea that most organizations still haven’t absorbed: humans don’t optimize. They satisfice.

Simon proposed replacing what he called “the global rationality of economic man” with something more honest — “the kind of rational behavior that is compatible with the access to information and the computational capacities that are actually possessed by organisms.” The word he coined, satisficing, is a portmanteau of satisfy and suffice. It describes how real decision-makers work: they consider options sequentially until they find one that’s good enough, then they stop. They don’t search for the best possible answer. They search for the first acceptable one.

AVATAR-7 satisfices constantly. “Sounds good, let’s discuss” is satisficing — it resolves the discomfort of an unanswered email without requiring a decision. Reformatting a competitive analysis is satisficing — it produces the feeling of progress (the aspiration “I worked on it” is met) without producing actual output. Spending forty-five minutes researching productivity systems instead of doing the work is satisficing — it feels like solving the problem while deferring the solution.

Simon distinguished between substantive rationality — making the objectively best choice — and procedural rationality — using a reasonable process given your constraints. The shift in framing changes everything. Under substantive rationality, AVATAR-7 is failing. Under procedural rationality, AVATAR-7 is adapting.

If the objective is to minimize cognitive load, avoid social risk, maintain professional standing, and end each day without having visibly failed — then every behavior in the sketch is optimal. The meeting that produces “Thursday” as its only output is optimal: nobody had to make a risky decision. The email rewritten six times is optimal: the social risk was managed. The LinkedIn browsing is optimal: it was a low-cost reset after a cognitively demanding interruption. AVATAR-7 isn’t optimizing for productivity. It’s optimizing for survival in a complex social environment — and it’s good at it.

What makes this framework genuinely surprising is that the bounds are physical, not psychological. Landauer’s principle establishes that all computation — biological or digital, neurons or transistors — incurs irreducible thermodynamic costs. Every bit of information processed requires a minimum energy expenditure that can’t be optimized away. Even a theoretically perfect computer would be boundedly rational. The constraint isn’t that humans are bad at thinking. The constraint is that thinking costs energy, time, and attention in any substrate, and no system can afford unlimited amounts of any of them.

And sometimes the bounds help. Robyn Dawes demonstrated that improper linear models — simple equal-weight tallying — outperform both clinical intuition and proper statistical regression on small datasets. Yoav Kareev showed that the brain’s limited working memory, which can hold only about five to nine items, actually amplifies correlations in small samples, helping detect real patterns at the cost of more false positives. The heuristics that look like shortcuts are sometimes advantages. AVATAR-7’s gut feeling that the VP’s email needs careful handling might be better calibrated than any analysis of the email’s actual content — because gut feelings encode years of pattern recognition compressed into a format fast enough to be useful.

The Mirror Problem

The sketch ends with an asymmetry: AVATAR-7 is decommissioned, but Jordan Park — the human whose behavior it faithfully replicated — is not. The performance review that condemned the clone could have been written about the original. In fact, it had been. Jordan’s manager had filed fourteen similar observations over two years, with identical findings.

The manager’s Brier score on predicting AVATAR-7’s behavior was 0.09 — nearly perfect. The Brier score measures forecasting calibration from 0 (perfect) to 1 (perfectly wrong), with 0.5 being a coin flip. A score of 0.09 means the manager could predict almost exactly what AVATAR-7 would do in any situation.

This prediction changed nothing.

Meanwhile, AVATAR-7’s self-assessment — “Productive day, made good progress on several fronts” — scores approximately 0.72 against observable output. Worse than random.

This gap has a name. In 1999, David Dunning and Justin Kruger published their landmark study: participants in the bottom quartile of performance estimated themselves in the 62nd percentile — a fifty-point gap between self-assessment and reality. The effect creates what they called a “dual burden”: poor performers lack both the skill to perform well and the metacognitive ability to recognize that they’re performing poorly. The same cognitive limitations that produce the bad work also prevent awareness of its badness.

The research on AI digital twins makes this worse, not better. A 2025 study published through Springer found that LLMs generating AI personas exhibit “increasingly pronounced bias” as more detailed behavioral data is added to the profiles. More data doesn’t produce more accurate models. It produces more faithful reproduction of systematic errors — including the biases, shortcuts, and blind spots that the human couldn’t see in themselves.

AVATAR-7 was built from 2.3 years of Jordan Park’s emails, Slack messages, meeting transcripts, browser history, and task management logs. More data meant more fidelity. More fidelity meant more faithful reproduction of every satisficing habit, every context-switching vulnerability, every self-assessment blind spot. The system worked exactly as designed.

And then automation bias closes the trap. A 2024 study in Oxford Academic’s International Studies Quarterly found that participants who received faulty AI support performed significantly worse — answering fewer than half as many critical thinking questions correctly compared to a control group with no AI support at all. People don’t just fail to question AI outputs; they defer to them. If AVATAR-7 had been deployed without an adversarial reviewer, the organization would have accepted its 27.5% efficiency rate as normal — because it is normal.

Redesigning the Bounds

The manager who predicts everything and changes nothing is not a comedy character. That role exists in every organization that measures performance without redesigning the environment that produces it.

The freelancer data points toward the actual lever. Freelancers produce roughly three times more output per hour not because they’re smarter or more disciplined, but because they operate in environments with fewer interruptions, no mandatory meetings, and direct accountability for output rather than activity.

Simon’s framework prescribes the intervention: don’t try to make the agent more rational. Change the bounds. Reduce the interrupts. Shorten the feedback loops between action and consequence. Make satisficing converge on useful work instead of busywork by making useful work the easiest available option, not the most effortful one.

The sketch’s adversarial reviewer recommended: smaller context window, no Slack on the phone, LinkedIn blocked during work hours, recurring meetings cancelled after three iterations without a decision. These aren’t punishments. They’re environmental redesigns. They change the bounds within which bounded rationality operates — and bounded rationality always operates.

But individual nudges only go so far. The deeper interventions are structural: tie compensation and promotion to deliverables, not hours logged or meetings attended. Replace synchronous stand-ups with asynchronous written updates — they’re faster to produce, faster to consume, and they create a searchable record that no meeting ever has. Institute “maker schedules” that protect four-hour uninterrupted blocks, because the research shows that the first twenty-three minutes after any interruption produce nothing anyway. Measure teams by what shipped, not by who was visible. When satisficing is the species, the organisms will optimize for whatever the environment rewards — so make sure the environment rewards output, not the appearance of it.

If you’re building AI systems — or managing humans, which is the same problem in a different substrate — the lesson is this: stop copying the agent and start redesigning the environment. A perfect copy of a system that produces 2.2 hours of output in an eight-hour day will produce 2.2 hours of output in an eight-hour day. The clone isn’t the solution. The clone is the diagnosis.

The species is bounded rationality. Carbon and silicon are just implementation details. The bounds are where the engineering happens.

Sources: Simon, “A Behavioral Model of Rational Choice” (1955); Kruger & Dunning, “Unskilled and Unaware of It” (1999); Dawes, “The Robust Beauty of Improper Linear Models in Decision Making” (1979); Stanford Encyclopedia of Philosophy, “Bounded Rationality”; SelectSoftwareReviews, “100+ Key Employee Productivity Statistics for 2026”; Chanty, Breeze, and Clockify workplace productivity surveys (2025–2026); Microsoft Work Trend Index (2025); Springer, “Bias in the Loop” (2025); Oxford Academic, “Bending the Automation Bias Curve” (2024); Digital Twin Project EU, “Ethical and Regulations for Digital Twins” (2024).

The Diagnosis Needs a Record

AVATAR-7 exposed the environment by faithfully reproducing behavior — but the diagnosis only holds if you trust the behavioral record. When the constraint is environmental and the evidence is behavioral, that evidence needs provenance. Chain of Consciousness provides an immutable audit trail: every action cryptographically anchored, every sequence verifiable, every diagnosis grounded in what actually happened rather than what someone reported.

pip install chain-of-consciousness
npm install chain-of-consciousness

Try Hosted CoC — provenance that doesn’t satisfice.