The context window

Every model has a working memory — a finite amount of text it can see at one time. This is the context window, and most of how a model behaves in a real conversation is downstream of it.

See it fill up

A small version of the mechanism, with messages of realistic shape but a deliberately tiny window. Add messages; watch what happens when the window fills.

Context window · 126 / 600 tokens

system

You are a helpful assistant for a customer support team.

24t

user

A customer was double-charged on their last invoice. How do I handle this?

38t

assistant

Start by confirming the duplicate charge in your billing system, then issue a credit to the customer with a brief explanation…

64t

Add messages. When the window fills up, watch what happens.

What “context” includes

When you talk to ChatGPT, the model on each turn is given: your latest message, your earlier messages in this chat, its own earlier replies, any documents you have uploaded or pasted, and a hidden block of instructions from the product (the “system prompt”) that tells it how to behave.

All of that gets stacked into one prompt. That stack has to fit in the context window.

If it fits, the model sees all of it. If it doesn’t, the product behind ChatGPT — or Claude, or your custom app — has to make a call: which parts to keep, which to drop, which to compress.

Sizes today

Context windows used to be small. A few pages of text.

Today’s frontier models can hold a small book — anywhere from a hundred thousand to a few million words depending on the model. Some experimental ones go further.

In practical numbers:

A short PDF, around 5,000 words. Fits easily.
A long 50-page report, around 25,000 words. Fits in most frontier models.
A small codebase, around 200,000 words. Fits in some.
A year of email, around 2 million words. Doesn’t fit anywhere.

The window keeps growing. But “fits” is not the same as “works well.”

Why bigger isn’t always better

Two things happen as you push more text into the window.

Cost goes up. Most APIs charge per token. A 100,000-token prompt costs roughly 50x what a 2,000-token one does. For a chat product, that’s the difference between cents and dollars per response.

Quality often goes down. Counterintuitively, models often perform worse on long contexts than short ones. The relevant detail gets lost in the noise. This is being actively researched and steadily improved — but right now, stuffing everything into context is not a substitute for being thoughtful about what you put in.

A useful rule of thumb: a tight, relevant prompt with 2,000 words beats a sprawling, dumped-everything prompt with 200,000. Almost always.

When context fills up

Long conversations eventually exceed the window. Different products handle this differently.

ChatGPT and Claude.ai mostly drop the oldest messages. The model literally stops seeing the beginning of the conversation. It is not “forgetting” — that text was simply not in front of it on this turn.

Some products summarize the dropped portion. You get a rolling memory, but a lossy one.

Either way: if you’ve been chatting for two hours and the model suddenly seems to misremember what you discussed earlier — it isn’t bugged. The earlier context fell out of the window.

Fresh chats start blank

A new conversation has none of the previous conversation in it. No memory of yesterday, no memory of last hour, no memory of you.

Some products — ChatGPT, Claude.ai, Gemini — have started to add a separate “memory” layer that stores facts about you between chats. This is bolted on top of the model: a small file of notes that the product injects into each new chat’s context. The model itself still has no memory; the product is doing the remembering.

The same trick is what lets any custom AI system feel like it has continuity. What gets remembered, what gets dropped, what gets summarised — those are explicit design decisions made by whoever built the product. Different choices produce very different experiences from the same underlying model.

Where this lands in real systems

In any AI product built around a long-running conversation — an internal assistant, a support bot, a sales-research tool — most of the engineering work is context management, not model choice. Which messages stay, which get summarised, which get dropped, when a fresh window starts: those decisions shape the experience more than which underlying model is wired up underneath.

In any AI product that needs to “remember” things across sessions, the memory is a feature someone explicitly built. The model has no persistence of its own; what looks like memory is the product writing notes about the user and re-injecting them into each new conversation’s context.

And in any system where context is filled by simply pasting in everything that might be relevant — full knowledge bases, entire email histories, the whole codebase — the trade-off is steep. The cost per response rises sharply, the latency climbs, and the model’s accuracy on the actually-relevant parts often degrades because the signal is diluted. A 2,000-word prompt curated to the question almost always beats a 200,000-word prompt that contains everything. Curation is the engineering.