LLMs Don't Have Memory. So How Do They Remember?
You ask ChatGPT or Claude a question. It answers perfectly, referencing context from three messages back. You think: "it remembered." But if you open a new browser tab and start a fresh conversation, all of that is gone. The model does not know you. It has no recollection of anything you discussed. You are a stranger.
This is not a bug. It is not an oversight. It is the fundamental architecture of how Large Language Models work. LLMs are stateless functions. They take tokens in. They produce tokens out. Between one invocation and the next, they retain nothing. Every call is mathematically identical to the first call ever made to that model.
So how does it feel like the model is remembering? How does your AI coding assistant know your preferred language? How does a customer support bot remember your ticket number from last week? The answer is not memory inside the model โ it is a carefully engineered illusion of memory built around the model.
๐ง TL;DR - LLMs are stateless. Every call starts fresh โ the model itself stores nothing between conversations. What looks like memory is always injected into the context window. - There are 4 types of memory: Parametric (weights), In-context (active window), Episodic (external history), and Semantic (vector knowledge base). - The context window is the model's only true working memory. Everything else is just pre-loading the right information before the model reads it. - At 100K context length, external memory becomes more cost-efficient than long-context models after roughly 10 conversation turns. - Agentic memory โ where the agent decides what to store, recall, and forget โ is the current frontier. A-MEM (NeurIPS 2025) showed state-of-the-art results using Zettelkasten-style interconnected memory networks.
The Goldfish Problem: Why LLMs Forget by Design
Imagine you have a brilliant consultant. They know everything about software engineering โ 20 years of accumulated knowledge stored in their head. But they have a condition: they cannot form new memories. Every meeting starts from scratch. They bring their expertise, but not their history with you.
That is exactly how an LLM works. The "20 years of knowledge" is parametric memory โ baked into the model's weights during training. That knowledge is permanent and enormous. But anything that happens during a conversation? Gone the moment the session ends. Not stored. Not referenced. Mathematically unreachable.
The model's weights encode a vast compression of the training data. That knowledge does not disappear between calls. What disappears is everything that happened after training โ including your entire conversation history with that model. It is the difference between long-term memory and working memory, and LLMs only natively have the former.
โ ๏ธ Common misconception: "The model remembers what I said earlier in this conversation." It does not. It can read what you said earlier because earlier messages are included in the current context window. The moment that window closes, none of it persists. There is no background process saving your conversation to the model's weights.
The 4 Types of Memory in Modern AI Systems
Before we look at how memory is engineered around LLMs, we need a shared vocabulary. There are 4 distinct memory types in modern AI architectures:
Type 1 โ Parametric Memory (Weights): Knowledge encoded into the model's billions of parameters during training. Includes language, reasoning, facts, code, and common sense. Fixed after training โ you cannot update it at runtime without fine-tuning. Analogous to a person's education and expertise.
Type 2 โ In-Context Memory (Context Window): The active working memory. Every token currently visible to the model โ system prompt, conversation history, retrieved documents, tool outputs. Finite, expensive, and temporary. Cleared at session end. Like your current train of thought.
Type 3 โ Episodic Memory (Conversation History): Past conversations stored externally โ in a database, vector store, or summary cache. Retrieved and injected into the context window before the model reads them. The model does not "remember" them; it reads them fresh each time. Like a diary you read before each meeting.
Type 4 โ Semantic Memory (External Knowledge): Domain knowledge, documents, and facts stored in vector databases or knowledge graphs. Retrieved via similarity search or graph traversal. The foundation of RAG systems. Completely external to the model's weights. Like a library your assistant can search.
Most real AI applications use all 4 types simultaneously. The art is deciding which memory type to use for which problem.
The Context Window: The Only Memory That Is Real
Here is the one fact that makes all AI memory engineering make sense: the context window is the only thing the model can actually see. Everything else โ conversation history, retrieved documents, user preferences, external databases โ only matters if it gets injected into the context window before the model generates a response.
This is not a limitation being worked around. It is the mechanism. Every "memory" system for LLMs is fundamentally a system for deciding: what do we put into this context window right now, for this specific query?
Every token in the context window costs money at inference time. That budget is finite. The model with 200K context tokens sounds enormous, but fill it with irrelevant conversation history and the model's effective reasoning quality drops sharply. The research term for this is "context poisoning" โ irrelevant or stale information in the context actively degrades response quality.
๐ก Context window sizes in 2026: Claude supports 200K tokens. Gemini 2.5 Pro reaches 2M tokens. GPT-4.5 provides 256K tokens. These numbers are real, but filling them entirely is rarely optimal. Retrieval-augmented context โ where only the relevant subset of a large document set is injected โ consistently outperforms "stuff everything in" approaches.
How AI Memory Is Actually Engineered
If the model has no memory, and everything must go through the context window, then memory engineering is the discipline of deciding what gets injected into that window โ and when, and how much of it. There are 3 principal techniques.
Technique 1 โ Full History Injection: The simplest approach: append every previous message to the current context. The model reads the entire conversation every time. This works perfectly up to the context limit, then fails catastrophically when the conversation grows too long. Cost grows linearly with every turn.
from anthropic import Anthropic
client = Anthropic()
conversation_history = [] # grows with every turn
def chat(user_message: str) -> str:
conversation_history.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant.",
messages=conversation_history # ALL history, every time
)
assistant_message = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Works great for short sessions.
# Fails when conversation_history exceeds the context window.
# Cost grows linearly with every turn.Technique 2 โ Summarization + Rolling Window: When history gets too long, summarize it. Keep the last N messages verbatim. Compress everything older into a summary. Inject both into the next context. Cost is now bounded: O(window) instead of O(total turns).
from anthropic import Anthropic
client = Anthropic()
WINDOW_SIZE = 20 # keep last 20 messages verbatim
def summarize_old_messages(messages: list) -> str:
"""Use Claude to compress old conversation into a summary."""
summary_prompt = "Summarize the following conversation concisely. Preserve key facts, decisions, and user preferences:
"
for m in messages:
summary_prompt += f"{m['role'].upper()}: {m['content']}
"
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512,
messages=[{"role": "user", "content": summary_prompt}]
)
return response.content[0].text
def build_context(full_history: list) -> tuple[str, list]:
if len(full_history) <= WINDOW_SIZE:
return "", full_history
old = full_history[:-WINDOW_SIZE]
recent = full_history[-WINDOW_SIZE:]
summary = summarize_old_messages(old)
return summary, recent
# The system prompt gets the summary injected:
# "Summary of previous conversation: [summary]"
# Then recent messages follow as normal.Technique 3 โ External Memory + Semantic Retrieval: The most powerful approach: store memories externally in a vector database. When a new query arrives, embed it and retrieve only the most relevant memories โ specific facts, past decisions, user preferences semantically related to the current question.
from anthropic import Anthropic
import chromadb
client = Anthropic()
db = chromadb.Client()
memory_collection = db.get_or_create_collection("user_memories")
def save_memory(user_id: str, content: str, mem_type: str):
"""Persist a memory to the vector store after a conversation turn."""
memory_collection.add(
documents=[content],
metadatas=[{"user_id": user_id, "type": mem_type}],
ids=[f"{user_id}_{hash(content)}"]
)
def retrieve_memories(user_id: str, query: str, k: int = 5) -> list[str]:
"""Retrieve the k most relevant memories for this query."""
results = memory_collection.query(
query_texts=[query],
n_results=k,
where={"user_id": user_id}
)
return results["documents"][0]
def chat_with_memory(user_id: str, user_message: str) -> str:
relevant_memories = retrieve_memories(user_id, user_message)
memory_block = "
".join(f"- {m}" for m in relevant_memories)
system = f"""You are a helpful assistant with memory of past interactions.
Relevant memories about this user:
{memory_block}
Use these memories naturally to personalize your response."""
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user_message}]
)
result = response.content[0].text
save_memory(user_id, f"User asked: {user_message}", "episodic")
return result
# Now the system remembers across sessions.
# The model never stored anything โ the database did.๐ก Key Insight: The model did not learn anything. It just read its notes before the meeting. The illusion of persistent memory is entirely constructed by the infrastructure around the model.
Letta: Treating Memory Like an Operating System
The Letta framework takes the most rigorous engineering approach to LLM memory. Rather than treating it as a conversational feature, Letta treats memory like a computer's memory hierarchy โ exactly as an OS manages RAM versus disk storage. The agent can autonomously move information between layers: reading a key preference โ storing it to Core Memory; conversation growing too long โ archiving to Recall Memory.
__LLM_LETTA_LAYERS__
What makes Letta powerful is that the agent manages its own memory hierarchy the same way an OS manages virtual memory. It reads a key user preference during conversation โ stores it to Core Memory. Conversation grows too long โ older exchanges get archived to Recall Memory. A new piece of domain knowledge is shared โ agent explicitly writes it to Archival Memory.
Agentic Memory: The Model Decides What to Remember
The latest frontier is moving beyond pre-programmed memory rules into truly agentic memory โ systems where the LLM agent itself decides what deserves to be remembered, how memories should be linked to each other, and when a stored memory should be updated or discarded.
A-MEM, published at NeurIPS 2025, introduced one of the most sophisticated approaches. It uses Zettelkasten principles โ the note-taking method favored by researchers for building interconnected knowledge networks โ applied to agent memory. When a new memory is added, the system generates a comprehensive note with contextual descriptions, keywords, and tags. It then finds links to existing memories and updates them accordingly.
from agentic_memory.memory_system import AgenticMemorySystem
# Initialize A-MEM with embedding model + LLM backend
memory = AgenticMemorySystem(
model_name="all-MiniLM-L6-v2",
llm_backend="openai",
llm_model="gpt-4o-mini"
)
# Agent adds a memory โ system automatically:
# 1. Generates contextual description
# 2. Extracts keywords and tags
# 3. Finds related existing memories
# 4. Creates bidirectional links between them
memory_id = memory.add_note(
content="User prefers Python over JavaScript for backend work",
tags=["preference", "python", "backend"],
category="user_preference"
)
# Later: semantic search retrieves related memories
results = memory.search_agentic("what language should I use for this API?", k=3)
# Returns: Python preference + any related language/tooling memories
# The graph traversal finds connections vector search alone might missThe key advance is memory evolution. When a new memory contradicts an existing one โ say, the user switched from Python to Rust for performance reasons โ the system can update the older memory's contextual representation rather than just appending a contradictory entry.
Long-Context vs External Memory: The Cost Crossover
With Gemini 2.5 Pro offering 2 million token context windows, a reasonable question is: why bother with external memory at all? A systematic analysis published in early 2026 compared long-context models against fact-based external memory systems. The result: long-context models have advantages in factual recall, but the cost structure is brutally unfavorable.
At 100K context length, an external memory system is more cost-efficient than a long-context LLM after approximately 10 turns. Cost grows linearly with every turn for long-context approaches versus bounded cost with retrieval. Long-context is great for one-shot deep document analysis. External memory is correct for any system where users have ongoing conversations over days or weeks.
๐ก The practical rule: Use the context window for the current task's immediate context. Use external memory for everything that needs to persist across sessions. Never use the context window as a substitute for a memory architecture. Context is scarce; external storage is not.
Memory Approaches โ Full Comparison
__LLM_MEMORY_TABLE__
The entire memory industry around LLMs exists because of one architectural fact: the context window is the model's only sense organ. Memory architecture is competitive advantage disguised as plumbing.
What You Now Know
LLMs are stateless. They know what they were trained on, and they can read what you put in their context window โ and that is everything. There is no background process forming memories. There is no learning happening as you chat.
What creates the appearance of memory is a system built around the model: a context window that gets carefully assembled before each generation call, an external store that holds conversation history and user facts, a retrieval mechanism that selects what is most relevant to the current query, and a summarization layer that compresses what does not fit.
In 2026, the frontier is agentic memory โ systems where the agent autonomously manages its own memory hierarchy, decides what is worth storing, how memories should be linked, and when outdated memories should be revised. A-MEM showed this works at NeurIPS 2025. Letta showed the OS-inspired architecture is production-viable. The gap between a stateless chatbot and a genuinely persistent AI assistant is now primarily an engineering gap, not a model capability gap. That is good news for builders.