What Is Prompt Engineering?
What prompt engineering actually is, how LLMs work as token predictors, and why the same prompt can give 10 different answers.
"Prompt engineering is the process of writing effective instructions for a model such that it consistently generates content that meets your requirements." — OpenAI
The word consistently is doing a lot of work in that sentence. LLMs are nondeterministic — the same prompt can produce 10 different outputs. Prompt engineering is the science (and some art) of reducing that randomness to get reliable, high-quality results.
The Myth
Prompt engineering is not magic. You can't make LLMs deterministic. They have real limitations — training cutoff dates, context window caps, a tendency to hallucinate. What prompting techniques do is give you measurable, repeatable improvements within those limits.
It's also the most accessible AI tool you have. No RAG setup, no fine-tuning, no extra infrastructure — just the words you write. Good prompting gets you 70–80% of the way there.
How LLMs Actually Work
An LLM is a pattern predictor that generates one token at a time.
When you type something, it doesn't "think" about the full answer then write it out. It predicts the next most likely token, then the next, then the next — never planning ahead.
Think of it like your phone's autocomplete, but trained on terabytes of human text instead of just your message history. The key breakthrough was the transformer architecture from the 2017 paper "Attention Is All You Need" (Google), which gave models the ability to pay attention to thousands of tokens at once instead of just 5–10 words.
The scaling laws that followed were remarkable: 10x the model size → 100x the capability. That's why context windows grew from ~4,000 tokens a few years ago to over 1 million today.
Why They're Nondeterministic
When predicting the next token, the model doesn't always pick the #1 most likely option. Sometimes it picks #2, #3, or lower — by design. This is what makes responses feel natural rather than robotic.
Ask "what color is the sky?" and you'll likely get blue — but you might get a paragraph about how sunsets are orange and pink. Ask the same question 10 times, get 10 different answers.
This has practical implications:
- What you get from a prompt today you may not get tomorrow
- Two people using the same model with the same prompt will get different results
- "It worked once" is not the same as "it works reliably"
What LLMs Are NOT
Not a calculator. When you ask an LLM what 2 + 2 is, it's not computing — it's predicting that 4 is the most likely next token after that sequence. For simple math this works fine. For complex reasoning chains ("if Sally has 5 red apples and John has 2 green ones, how many non-red apples are there?"), the model can fail because it's not doing arithmetic — it's completing a pattern.
Not random. Ask an LLM to "flip a coin" and you won't get a true 50/50 split. Its training data makes certain outputs more likely than others, so one result tends to dominate.
Not capable of reasoning while silent. A model only "thinks" while it's generating output. This is why techniques like chain-of-thought prompting (covered later) work so well — they force the model to write out its reasoning, which actually improves accuracy.
Tokens vs Words
LLMs operate on tokens, not words. A token is roughly 0.75 words on average — but not always. Code, punctuation, and capitalization all affect tokenization differently.
Fun fact: JavaScript (capital J, capital S) is a different token ID than javascript. This usually doesn't matter in practice, but it's useful to know that the model perceives these as distinct.
The Context Window
LLMs have no persistent memory. The way they "remember" a conversation is by receiving the entire conversation history as input on every message — your inputs, its outputs, everything.
The context window is the maximum number of tokens a model can hold at once. When you hit the limit, the oldest context drops off silently — no warning, no error.
You: Please don't add extra features.
...
[1000 tokens later — that instruction dropped off]
Model: I added search, export, and a favorites tab!Critical instructions belong at the beginning of your prompt (and worth repeating if a conversation runs long). More on this in the context placement post.
Practical Starting Points
- Use full sentences, not Google-style keyword searches. These models work better with natural language.
- Expect variation. Your result and your colleague's result won't be identical even if you use the same model.
- If performance starts degrading mid-conversation, start a fresh chat — important context has likely dropped off or been "lost in the middle."
- You can always ask the model to summarize the conversation before starting fresh: "Summarize what we've been building so I can paste it into a new chat."
Keep reading
Enjoyed this? Get more like it.
Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.