Why Every AI Platform Needs a Memory Layer
The problem showed up in the most ordinary way: an agent asked a returning user a question they had answered the week before. Nothing was broken. The LLM is stateless, the session was new, and nobody had built anything to carry information across the gap. The agent behaved exactly as designed — which was the problem.
Once you run agents in production for real workflows, this gap stops being a papercut. Users repeat themselves. Agents re-fetch the same context on every run. Two agents serving the same customer know nothing about each other's conversations. I'm building Context Engine, a memory and context layer for our agents, to close this gap — and it has turned out to be a much deeper problem than I expected going in.
A bigger context window is not memory
The tempting fix is to stuff more history into the prompt — context windows keep growing, so why not? We tried versions of this. It's expensive, it gets slow, and past a certain size the model starts attending to the pile badly. You're paying more per request for worse answers.
A context window is working memory. What we needed was the other kind: something persistent, queryable, and selective. The selective part is what everyone underestimates, me included.
The storage part is easy. The ranking part is the product.
My first mental model was 'embed everything, vector-search on read'. It works in a demo and degrades in production, in a predictable way: pure similarity keeps surfacing stale-but-similar memories, and the store fills with near-duplicates of the same fact.
What actually decides whether the memory layer is useful is the ranking: given a task and a token budget, which memories deserve the space? We score on a blend of semantic similarity, recency decay, and importance — and the blend has to differ by memory type. A user preference ("prefers email over calls") stays true for months. An event ("was frustrated on Tuesday's call") matters for about a week. Treating those the same is how you end up with an agent that brings up Tuesday in September.
Be picky about writes
Storing every message felt thorough. It was hoarding. The version that works runs observations through an extraction step that distills them into structured memories (facts, preferences, episodes) with provenance and timestamps. When a new observation contradicts or updates an old memory, it supersedes it rather than piling up next to it.
One rule I'd carry to any similar system: extraction runs async, off the response path. The user is waiting for an answer. They are not waiting for your memory bookkeeping.
And the counterintuitive one: forgetting needs as much design as remembering. Decay and consolidation are what keep retrieval sharp. A memory layer that only accumulates turns into noise with an API in front of it.
The multi-agent surprise
The part I didn't see coming: memory quietly became our coordination layer. Once agents read and write scoped, shared context, a lot of coordination problems dissolve. What one agent learns about a customer, the next one knows. Task-level state — who is handling what — fits in the same substrate.
There's a lot of energy right now around agent-to-agent communication protocols. Maybe they'll matter eventually. So far, for us, agents sharing a notebook has gotten further than agents talking to each other.
Context Engine is still in progress, and some of what I've written here will probably turn out to be wrong. Ask me again in six months.