Understanding Context Windows and Token Limits: Why AI 'Forgets' Earlier Instructions

A developer I was helping debug a long-running support chatbot kept reporting that the AI “randomly forgot” a formatting rule he had established early in very long conversations, assuming this was inconsistent or careless model behavior, until we worked through what was actually happening — his conversation had grown long enough that the original instruction had genuinely fallen outside what the model could still see, not because the model chose to ignore it, but because it was no longer present in what got sent for that specific request.

What a Context Window Actually Is

A context window is the maximum amount of text, measured in tokens rather than words or characters, that a model can process at once for a given request. This includes everything — the system prompt, the entire visible conversation history, and the current message — combined into a single total that cannot exceed the model’s specific limit.

This distinction matters because the model is not maintaining some separate persistent memory of your entire conversation history across requests the way a human conversation partner might. Each request is effectively reprocessed from whatever text actually fits within the context window for that specific call, meaning anything that falls outside this window genuinely is not available to the model for that particular response, regardless of whether it was discussed earlier in the conversation.

Why Tokens Are Not the Same as Words

This is a common source of miscalculation. A token is roughly three-quarters of a word on average for English text, but this varies considerably — common short words are often a single token, while less common words, unusual formatting, or non-English text can require multiple tokens per word. Code, with its punctuation-heavy syntax, also typically consumes more tokens per visible character than plain prose does.

Worth checking directly if you are managing context length carefully: Many providers offer a tokenizer tool or library that shows the actual token count for a given piece of text, which is considerably more reliable than estimating based on word or character count alone, particularly for technical content or non-English text where the word-to-token ratio differs meaningfully from typical English prose.

Why “Forgetting” Is Usually Truncation, Not a Mistake

In the chatbot case, the conversation had grown long enough over many exchanges that, combined with the system prompt and ongoing message history, the total token count exceeded what could fit in a single request. Depending on how the integration handles this, the oldest parts of the conversation — including that original formatting instruction — were the parts most likely to have been dropped to keep the total within the limit.

How to confirm this is the cause: If a previously consistent instruction or established fact suddenly stops being followed specifically in longer conversations, while remaining reliable in shorter ones, length-related truncation is a strong candidate, particularly if you can identify roughly how long the conversation needs to get before the issue appears.

Why Placement Within the Window Also Matters

Beyond simply fitting within the limit, instructions placed in a persistent system prompt (covered in our system prompts guide) are generally far more resistant to this kind of effective “forgetting” than instructions stated once in an early user message, since well-designed integrations typically prioritize keeping the system prompt intact even when trimming older conversation history to fit within the context window.

This is a genuinely practical reason, beyond general best practice, to place persistent behavioral rules in an actual system prompt rather than only stating them once early in a long conversation — a system prompt is considerably less likely to be the part that gets trimmed when a conversation’s total length needs to be reduced to fit.

Practical Strategies for Working Within Context Limits

Periodically restate critical instructions, particularly in very long conversations, rather than assuming a rule established many exchanges ago is still definitely present in what the model currently sees.

Summarize and compress earlier context when a conversation grows long, replacing extensive earlier exchanges with a concise summary of the relevant decisions or established facts, preserving the important content while using meaningfully fewer tokens than the original full exchanges.

Move genuinely persistent rules into the system prompt rather than relying on them being remembered from an earlier point in the conversation history, for the placement-priority reason discussed above.

Break large tasks into smaller, separately-scoped requests when working with very large source material (a long document, a large codebase), rather than attempting to fit an entire large input alongside extensive conversation history within a single request.

A Quick Reference for Diagnosing This Issue

Symptom	Likely Cause
Instruction followed reliably in short conversations, inconsistently in long ones	Context window truncation of earlier history
Model seems to “forget” a fact stated many exchanges ago	That exchange has likely fallen outside the current window
Issue appears consistently after a similar conversation length each time	Worth checking actual token count against the model’s specific limit
Behavioral rule followed consistently regardless of conversation length	Rule is likely correctly placed in the system prompt

What Fixed the Chatbot’s Inconsistent Formatting

Once we moved the formatting rule into the actual system prompt and added a periodic restatement of key context for unusually long conversations, the inconsistency the developer had been troubleshooting resolved, not because the model had started behaving differently, but because the instruction was now considerably less likely to fall outside what the model could actually see for any given response.

This experience is a useful reminder that an AI model is not failing to recall something it has access to — it is working entirely from whatever text is actually present in that specific request, and understanding this distinction reframes “the model forgot” into a considerably more solvable problem of managing what stays within the window.

Are you seeing an instruction or fact get inconsistently followed specifically in longer conversations? Describe roughly how long the conversation gets before the issue appears, and I can help you think through whether context length is the actual cause.

🔗 Recommended Reading