Tokens and Context Windows

Tokens are the currency LLMs think in. Context windows are their working memory. Here's what both mean for you in practice.

March 30, 20265 min read3 / 3

Before writing better prompts, you need to understand the medium: LLMs don't read words — they process tokens. And they don't have memory — they have a context window.

What Is a Token?

A token is roughly 0.75 words on average. But that's just an average.

  • Punctuation, spaces, and code may tokenize differently than plain prose
  • Capitalization matters: JavaScript and javascript are different tokens
  • Code tends to use more tokens per character than natural language

LLMs use token IDs, not raw text. When you write a prompt, your words are first converted to token IDs, processed, then converted back to text for the response. You'll never need to count tokens manually for normal prompting — but understanding that this conversion exists helps explain why models can behave unexpectedly with unusual capitalization, symbols, or niche jargon.

Token estimator: ~1,000 words ≈ 750 tokens. For quick estimates, multiply your word count by 0.75.

What Is a Context Window?

An LLM has no persistent memory. It doesn't "remember" you from last week, or even from the start of your chat session — unless that history is re-sent with every message.

The context window is the maximum number of tokens a model can process at once. This window contains:

  1. The system message (set by the provider — you usually can't see it)
  2. Your entire conversation history (every message, both directions)
  3. Any attached files or code you've pasted in

Every time you send a message, the full conversation history goes along with it. That's how the model appears to "remember" earlier turns.

Plain text
Message 1: "What color is the sky?" Model: "Blue." Message 2: "What about at sunset?" What the model actually receives: [Your message 1] + [Model response 1] + [Your message 2]

What Happens When You Hit the Limit

When the cumulative tokens in the conversation exceed the context window, the oldest content drops off silently. No warning. No error message.

This is dangerous when you've front-loaded critical instructions early in a conversation:

Plain text
You (message 1): "Never add extra features I didn't ask for." ... [800 more tokens of conversation] ... You (later): "Add a save button." Model: "I added save, export, search, and a favorites tab!"

The original constraint was gone. The model didn't disobey — it literally couldn't see it anymore.

What to do

  • Repeat important constraints periodically in long sessions
  • Start a new chat when you notice quality degrading — key context has likely been lost
  • Summarize before switching: Ask the model to summarize the conversation, paste the summary into a new chat
  • Be selective with context: Don't paste your entire codebase when you only need two files

The System Message

Every AI application (Claude.ai, GitHub Copilot, Cursor, ChatGPT) has an invisible system message that runs before your conversation. You can't see it. It:

  • Defines how the model behaves ("you are a helpful coding assistant")
  • Takes up part of your context window
  • Never drops off — it's always present

This is why the same model (Claude Sonnet 4.6) behaves differently in Claude chat vs. Copilot. Different tools set different system messages. It's also why you should never put API keys, sensitive business logic, or confidential information in a system message — they can be extracted through prompt injection.

Practical Context Limits

Modern models have large windows — often 100k to 1 million tokens. For most conversations, you'll never hit the limit. But context problems happen sooner than you'd think because:

  • The system message is already consuming some tokens
  • In coding tools (Copilot, Cursor), attached files compound quickly
  • Pasting @codebase in a monorepo can immediately fill a context window

Rule of thumb: Provide the minimal context needed to get a good output. If you only need a test file and a frontend component, add just those — not the entire repo.

The "Lost in the Middle" Effect

Even when content is technically inside the context window, LLMs don't attend to everything equally. Research from the paper "Lost in the Middle: How Language Models Use Long Contexts" found that models perform worse when the relevant information is buried in the middle of a long context — sometimes worse than if they'd been given no context at all.

Position of relevant infoModel accuracy
Beginning of contextHigh
End of contextHigh
Middle of contextLower than baseline

This isn't a bug — it mirrors human psychology. We have a primacy bias (remembering the start of a list) and a recency bias (remembering the end). LLMs, trained on human-generated data, exhibit the same pattern.

Practical implications

  • Put critical instructions at the start of your prompt
  • Put supporting details at the end
  • If a conversation runs long, re-state the important stuff — don't assume it's still being "heard"
  • For any task with a clear single answer, shorter context often beats longer context

The continue Pattern

If a model cuts off mid-response (output truncated), just type continue and press send. Most models will pick up where they left off. Some providers show a "Continue" button directly in the UI.


Quick reference

ConceptWhat it meansWhat to do
Token~0.75 words; unit LLMs processEstimate length at 0.75× word count
Context windowMax tokens the model holds at onceKeep prompts focused; don't paste whole repos
Context drop-offOldest content disappears silently when limit is hitRe-state critical info; start new chats
System messageInvisible behavior config; takes context spaceNever put secrets in it
Lost in the middleModel ignores middle contentPut key info at start and end

Enjoyed this? Get more like it.

Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.