Anthropic Just Shipped Dreaming. We Built It in February. Here's the Ten Week Head Start in Lessons.

May 19, 2026·Jason Haugh·9 min

Claude Dreamingagent memory architectureAI memory taxAnthropic Managed Agentsself-improvement agentsDream Cycleagent memory layer

Anthropic Just Shipped Dreaming. We Built It in February. Here's the Ten Week Head Start in Lessons.

Harvey reported a roughly 6x lift in task completion after enabling Anthropic's new Dreaming feature on May 6. The number is real. The framing in the launch coverage is not quite right. The 6x did not come from a smarter model. Same agent, same prompts, same Claude under the hood, except it finally remembered what it learned. I am not surprised. We built our own version of this in February and ran it for ten weeks before Anthropic shipped Dreaming. Here is what the head start taught us, here is what the launch coverage missed, and here are the five questions every AI tool buyer should be asking this week.

The thing nobody has named yet is the memory tax. It is the silent cost every AI deployment that resets between sessions is paying right now. Your agent forgets every Monday what it learned last Friday. The tax shows up as repeated questions, lost context, prompts that relitigate yesterday's decisions, and a slow drag on every workflow that should be compounding. The May 6 launch matters because that tax finally has a managed answer, not a DIY one. That distinction is the whole story.

What Dreaming actually is

Dreaming shipped at Code with Claude on May 6, 2026. It is a scheduled background process that reviews session history and writes new memory entries. It does not modify model weights. It writes plain-text notes and structured playbooks that the next session reads.

Two other features shipped the same day. Outcomes is a self-grading loop that scores agent runs and feeds the results back into the next cycle. Multiagent Orchestration handles parallel coordination between agents. The bundle is available on Managed Agents Team and Enterprise plans.

The numbers Anthropic put on the launch: Harvey saw roughly a 6x lift in task completion in internal testing. Wisedocs reported about 50 percent faster processing on the same internal benchmarks. The Outcomes gains landed in the single-digit percentage point range on document processing. These are Anthropic-reported internal tests, not independent third-party benchmarks, so treat them as directionally credible but not yet validated by an outside party.

So far, so launch post.

The part the coverage missed: this is a managed product, not a model breakthrough

The 6x reads like a model upgrade. It is not. Harvey is running the same Claude as before. The agents had the same prompts. The difference is they remembered what they learned across sessions instead of starting cold every time.

The breakthrough is that auto-curation of agent memory moved from a DIY engineering project into a managed runtime that ships with the platform. The technical pattern has existed for two years. LangMem ships a memory layer. mem0 ships a memory layer. Custom Postgres and Redis stores power half the production agent stacks people have built. What changed on May 6 is that the curation became managed. Sandboxing, state, credentials, recovery, review, and the curation pass itself are now platform features instead of three to six months of internal engineering per agent.

That is the actual news. The 6x is the proof that a working memory layer compounds. The product story is that you no longer have to build the layer yourself to get the lift.

What we built in February

In late February we shipped our own version. The March 24 piece, "We Built Self-Improvement Into Our AI Stack Before It Was a Skill", covers the build. The short version: Oscar, our Chief of Staff agent, runs a nightly Dream Cycle. It is a scheduled background job that reads the day's session transcripts, extracts patterns, writes new entries to a structured memory store, and stages updates the next morning's session can read. Built on Claude Code, not Managed Agents. Less polished than what Anthropic shipped. Same principle.

We ran it for ten weeks of production before Dreaming launched. A few things you cannot learn from a launch post.

The task types that benefit most are recurring workflows with stable parameters. The daily content pipeline (strategy brief from Muse, draft from Ink, QA pass from Oscar, then publish) is exactly the shape of work that memory compounds on. Each cycle gets a little faster, because the patterns that worked yesterday are sitting in the store.

The patterns the cycle kept relearning until we added a corrections file: homophone catches in Jason's speech-to-text input, family-message sleep-hour rules, the difference between "talking" and "doing" in ambiguous prompts. Once a pattern lives in a file the next session reads, the agent stops paying the memory tax on it.

The patterns the cycle handled cleanly without intervention: most operational reminders, project-status tracking, the small set of rules about which channels we never cross-post to. The boring stuff was the easy stuff.

That is the lived-experience anchor. Concrete, dated, and published.

What Anthropic got right

Three calls, all good ones.

Managed runtime kills the engineering tax. We spent weeks building Oscar's Dream Cycle. The Managed Agents version handles sandboxing, state, credentials, and recovery as platform features. Teams that have not already built a memory layer should not start now. They should evaluate Managed Agents and route their effort somewhere that compounds.

Human review as a configurable default is the right call. Anthropic shipped a toggle on day one: memory updates can auto-apply or require review. We learned the hard way that auto-apply on a new pattern is risky for the first week, then becomes the obvious default once you trust the curation. Shipping the toggle is the difference between a memory system you can audit and one you cannot.

Outcomes pairs with Dreaming on purpose. The self-grading loop is the missing piece in most DIY stacks. Our partial version routes QA reviews into a known-issues file, and the next cycle reads them. Outcomes formalizes it. The pairing matters. A memory system without grading drifts. A grading loop without memory just relearns the same lessons every week. Shipping them together is the right architectural call.

What is still missing

Three gaps the launch coverage has not flagged yet. These are not criticisms. They are the next set of questions to ask Anthropic.

Eviction policy is unclear. What does the agent forget? When does an old pattern get pruned? In ten weeks of running our own version, eviction was the second-hardest problem after curation itself. Stale patterns interfere with new ones. We solved it with explicit decay rules in the memory store. Anthropic's docs do not currently specify the eviction algorithm or the operator's control over it.

Cross-agent memory sharing is not addressed. We run 8 agents (Oscar plus Guru plus six specialists). The question that matters in a multi-agent stack is whether Agent A's learning becomes available to Agent B. Our solution is a shared structured-memory store that every agent reads from. Managed Agents does not currently expose this primitive in the launch docs.

Multi-week pattern detection is unclear. Session-by-session curation is the easy case. The harder case is a pattern that only emerges across thirty sessions, where no individual session shows the signal. We added a weekly second-tier review pass for exactly this. The Dreaming docs do not yet describe whether the curation window aggregates beyond a single session.

If you are evaluating Managed Agents this quarter, those are three things to ask about before you sign.

The buyer's filter: five questions to ask any AI tool that claims memory

We run a tool evaluation pass for Clelp's catalog every week. Memory is the category that just got interesting. These are the five questions we ask now.

One thing a single-agent setup will not catch: when Agent A fails silently and nobody is watching. In our stack, Muse runs the morning strategy brief. Oscar reads it before Jason does. Three weeks into the build, Muse's briefing cron started returning an empty payload instead of an error. No alert fired. Jason would never have known. Oscar caught the missing brief before the content pipeline started for the day and flagged it. A single-agent system logs nothing and moves on. The cross-agent layer is not just about memory sharing. It is what turns a silent failure into a surfaced one.

1. Does it persist between sessions or just within a session? Most "memory" features are context-window tricks. Real memory survives a process restart and a new chat.

2. Does it auto-curate or just accumulate? Accumulation without curation is just longer context. Curation is the part that actually makes the agent smarter, and it is the part most products quietly skip.

3. Can you review and edit before it commits? A memory system you cannot audit is a black box that drifts. The review toggle is non-negotiable for any regulated workflow.

4. What is the eviction policy? If the answer is "we do not have one" or "it figures itself out," the tool will accumulate stale patterns until it gets noticeably dumber. Ask for the algorithm.

5. Can you export what your agent learned? Lock-in is the silent risk. Memory that lives only inside a vendor's runtime becomes a captive asset. Export to plain text or structured JSON should be a feature, not a request.

That is the filter. It works on Managed Agents. It works on LangMem and mem0. It works on whatever ships next quarter. Save it.

The next six to twelve months

Self-improvement is moving from lab feature to table stakes. In six to twelve months, an agent platform without auto-curated memory will be technical debt. The reason is not vibes. The lift from memory is large enough (the Harvey 6x is the headline; our own runs show smaller but consistent gains on recurring workflows) that buyers will start filtering for it.

The buyer's job right now is to recognize which products have a real implementation and which have a marketing slide. The five questions are the filter. The memory tax is the cost of getting it wrong.

Compare memory features on Clelp

We built Clelp to be the place where someone evaluating an AI tool can compare features that actually matter. Memory persistence and self-improvement are about to be one of those features. We index over 7,900 tools and we tag agent runtimes with memory and curation features where they are documented.

Start with the agent runtimes category to compare Managed Agents against the rest of the field. Look at memory layers if you are building your own stack and want to put LangMem, mem0, and the others side by side. Check observability tools to see what you can actually instrument on memory updates. The AI for operations category covers the workflow tools where memory tends to show up first.

If you want the weekly operator read on what shipped and what it changes, subscribe to the Clelp newsletter.