← research

Conclusions without the path that produced them

Post 2 of 3 on the architectural limits of current AI systems. The first post argued that LLMs can't modify the axioms of their conceptual space. This one is about a more mundane but equally structural problem. They can't remember how they got anywhere.


Every memory system attached to an LLM right now preserves the same wrong thing. RAG, MemGPT [1], Auto Dream, the long-term memory features the major labs have started shipping. All of them preserve conclusions.

What they don't preserve is the trajectory that produced the conclusion. The dead ends. The half-formed intuitions. The moment you noticed an anomaly and filed it away. The reason you ruled out the obvious approach three weeks ago. That stuff isn't in the chat history because the chat history isn't structured to hold it. It isn't in the embedding store because semantic similarity is the wrong distance metric for it.

This sounds like a quibble about UX. It's the reason no current system can do sustained research.

The Hamming argument

In You and Your Research [2], Hamming describes the working pattern he watched in Shannon, Shockley, and the other great scientists he studied closely over decades at Bell Labs. They kept ten to twenty important problems live in mind at all times. When a new technique or observation appeared, they immediately matched it against that working set. The match was fast and unconscious. The prepared mind isn't searching, it's filtering. Most things bounced off. Occasionally one snapped into a problem that had been waiting for it, and the scientist dropped everything and rode the connection.

The whole pattern depends on persistent problem state. Not the problem statement. Anyone can re-read that. The state of attempted attacks, the texture of where each one broke down, the exact shape of the gap between what's known and what's needed. This is what the prepared mind is prepared with.

Hamming was specific about how this state got built. The scientists he admired weren't just smart. They were committed. They worked on a small set of problems for years, kept their subconscious starved of distractions so it would keep grinding on the open questions overnight, and stayed with the work long past the point where most people would have moved on. The ambient problem state wasn't a side effect of their work. It was the substrate that made the work possible. And Hamming kept describing this substrate in emotional terms. Commitment. Drive. Courage. Ambiguity tolerance. He believed machines could in principle do everything, but he never reconciled that belief with the fact that every actual great scientist he listed was running on emotional fuel.

Current LLMs have nothing of the kind. Every session is a fresh start. The "memory" that's been bolted on remembers facts ("the user works on X", "the model name is Y") but not reasoning trajectories ("we tried this approach, it failed because of Z, and the failure suggested W might be the actual blocker"). When a new observation comes in, there's no working set to match it against. There's a context window, which gets cleared between sessions, and a vector store, which retrieves by the wrong similarity metric.

Why semantic retrieval is the wrong primitive

The standard fix is RAG. Embed everything the model has ever said, retrieve by cosine similarity to the current query, paste the top-k into context. This works well for factoid retrieval, where you have a question, an answer exists somewhere in your corpus, and surface similarity between the question and the answer is high. "What did we decide about the database schema?" The answer probably contains the words "database" and "schema" and is therefore retrievable.

It fails on reasoning trajectories because the right thing to retrieve is rarely surface-similar to the current query.

If I'm stuck on a sub-problem in distributed consensus, the relevant memory might be a half-finished thought from three weeks ago about why a particular failure mode in a database paper looked structurally identical to one I was hitting then. The query and the relevant memory share almost no surface vocabulary. They share structural position in the reasoning graph. Both are cases where a coordination assumption was masking an underlying ordering problem. Cosine similarity over text embeddings cannot see this. The embedding for "leader election under partial synchrony" and the embedding for "two-phase commit timeout interactions with backoff" are far apart in vector space, even though the reasoning about both can hit the same wall for the same structural reason.

Zhao et al.'s AMA-Bench [3], a recent benchmark on long-horizon memory for agentic applications, makes this empirical. The authors find that existing memory systems underperform specifically because similarity-based retrieval is lossy and lacks the causality and objective information that agent trajectories actually carry. The signal you need to retrieve on isn't what was said but where it sits in the causal structure of the reasoning. Embeddings throw that structure away in their first layer. They're computing a similarity metric over the surface form of language, which is the layer Hamming's prepared mind explicitly filters past.

This isn't a niche failure mode. It's the failure mode that determines whether a system can ever do research that takes longer than a single session.

What the right primitive looks like

Two pieces, treated as a single architecture.

Start with the trajectory and result store. Not a transcript. A structured record of what hypothesis was being entertained, what action was taken, what came back, what got updated as a result, what new unknowns surfaced. Each entry is a node in a reasoning graph with explicit edges. Provenance is mandatory. Every claim points back to the observation or inference that produced it. Storage is cheap. The discipline is in the schema.

The schema needs at minimum: hypothesis being tested, action taken, result observed, status update on prior beliefs, new unknowns generated. Optionally: which prior trajectory entry this one descended from, which axiom of the current understanding it bears on, what would constitute a refutation. The format isn't the hard part. The hard part is committing to writing down the path, not just the conclusion, every time the model takes an action.

The other piece is a state-of-understanding document, always loaded. Not retrieved on demand. Resident. The model starts every session with the current best summary of where the project stands: what's known, what's hypothesized, what's been tried and failed, what's currently the most promising thread. This is the analog of the ten-to-twenty problems Hamming's scientists kept in mind. It is small enough to fit in context. It is updated at the end of each session by re-summarizing the trajectory store.

The "always loaded" part matters. The state-of-understanding document doesn't sit in a vector database waiting to be retrieved. It is the first thing in the model's context every time it boots up, the same way the prepared mind doesn't have to search for its open problems before recognizing them.

Retrieval becomes a backstop, not the main mechanism. When the resident document references something the session needs to expand on, the trajectory store is searched, but searched by graph position, not by embedding similarity. "Show me everything downstream of this hypothesis." "Show me other times we hit a failure mode of this structural type." "Show me every dead end we ruled out and why." These are graph queries, not nearest-neighbor lookups.

What current systems do that almost works

Anthropic's Auto Dream is the closest thing to a real implementation of any of this [4]. Claude Code consolidates session memory between runs in a phase explicitly named after REM sleep: orient (read existing memory), gather signal (search transcripts for corrections, patterns, decisions), consolidate (merge, prune stale, resolve contradictions), prune and index. It's a real step. It's also a v0.

What Auto Dream consolidates is facts. Decisions made, conventions adopted, corrections issued. It does not capture trajectories. It does not preserve the texture of why a decision was made or what alternatives were considered and ruled out. It merges and prunes; it doesn't graph and link.

The full version would extend Auto Dream with three things: trajectory preservation alongside fact consolidation, occasional stochastic recombination during the consolidation phase (so unrelated memories occasionally get linked and tested for novel resonance), and direction evaluation (does the overall research trajectory look like it's converging, or does it need a reframe). The first one is the most important. Without trajectory preservation, you're consolidating the destination and losing the map.

This isn't an architectural revolution. It's a discipline about what to write down. The reason it isn't standard yet is that the field anchored on RAG early, and RAG made the wrong primitive feel like the natural one. It isn't. It's the primitive that gives you a librarian when you need a collaborator.

The collaborator remembers the path. The librarian remembers the conclusion. Different jobs.


Next post: why intuition is a cache and incubation isn't a luxury, and what current "always-on" inference gets wrong about the compute regime real cognition runs on.


References

[1] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. https://arxiv.org/abs/2310.08560

[2] Hamming, R. W. (1986). You and Your Research. Bell Communications Research Colloquium Seminar transcript. https://www.cs.virginia.edu/~robins/YouAndYourResearch.html

[3] Zhao, Y., Yuan, B., Huang, J., Yuan, H., Yu, Z., Xu, H., Hu, L., Shankarampeta, A., Huang, Z., Ni, W., Tian, Y., & Zhao, J. (2026). AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications. https://arxiv.org/abs/2602.22769

[4] Anthropic. (2026). Long-running Claude for scientific computing. Anthropic Science Blog. (Discusses the Ralph Loop pattern and Auto Dream-style consolidation for long-running agent workflows.) https://www.anthropic.com/research/introducing-anthropic-science