Chain-of-Thought Prompting
Five words -- "Let's think step-by-step" -- took model accuracy from 17.7% to 78.7% on math problems. Chain-of-thought is the most impactful technique per character written.
Add five words to almost any prompt and you'll get meaningfully better results:
"Let's think step-by-step."
That's chain-of-thought (CoT) prompting in its simplest form. It asks the model to show its reasoning process -- and the act of reasoning out loud dramatically improves accuracy.
The Research
The paper "Large Language Models are Zero-Shot Reasoners" tested what happens when you add "Let's think step-by-step" to prompts, specifically on arithmetic and logical reasoning tasks where LLMs traditionally struggle.
Results on a multi-arithmetic benchmark:
| Prompt type | Accuracy |
|---|---|
| Standard zero-shot | 17.7% |
| Zero-shot + "Let's think step-by-step" | 78.7% |
A 61-percentage-point jump from five words. No examples, no complex setup -- just telling the model to reason out loud before answering.
The study also tested misleading phrases like "Don't think, just feel" and "Abracadabra." Even those outperformed the baseline zero-shot -- suggesting that any instruction that forces the model to produce reasoning tokens improves accuracy, with "Let's think step-by-step" being the best.
Why It Works
Remember: LLMs only "think" while they're generating output. They don't reason silently and then write the answer. They produce the answer token by token.
Without chain-of-thought, the model goes directly from your question to a final answer -- no intermediate reasoning. For simple tasks, this is fine. For multi-step problems, it fails:
Question: John has 5 apples. He gives Mary 2 red ones and keeps the green ones.
Sally has 3 apples, all green. How many green apples total?
Zero-shot model: 6 (wrong -- it pattern-matched without tracking state)With chain-of-thought, the model is forced to work through the problem step by step in the output before concluding:
Let's think step-by-step.
John has 5 apples total, giving away 2 red ones, so he has 3 left.
If the ones he gave were red, his remaining apples are not red.
The problem says he "keeps the green ones" -- so his remaining 3 are green.
Sally has 3 apples, all green.
Total green apples: John's 3 + Sally's 3 = 6.
Answer: 6Wait -- the answer was 6 after all. But the path to get there is the important part. Reasoning out loud catches mistakes that jumping to a conclusion would miss.
Why LLMs Struggle with Math and Logic
LLMs are pattern predictors, not calculators. When you ask "what is 847 × 23?", the model isn't multiplying -- it's predicting the most likely tokens after that sequence. Simple multiplication works because 2 + 2 = 4 appears constantly in training data. Complex multi-step arithmetic fails because the model can't track state across calculation steps without writing them down.
Similarly, "flip a coin" in an LLM doesn't give you a 50/50 result. The model predicts the most likely token -- and if its training data has "heads" appearing more often in similar contexts, you'll get heads more often. LLMs are not random number generators.
Chain-of-thought prompting compensates for this by forcing the model to externalize its reasoning, making each step explicit and checkable.
Example 1: Diagnosing a Plant Problem
Without CoT:
My plant's leaves are turning yellow. What's wrong?
→ "Could be overwatering, underwatering, or nutrient deficiency."With CoT:
My plant's leaves are turning yellow. Diagnose this step-by-step.Output:
Step 1: Check watering issues. Overwatering (most common): Are leaves yellow and wilting? Is the soil constantly wet? Does the pot have drainage holes? Overwatering suffocates roots, preventing nutrient absorption. Underwatering: Are leaves dry and crispy? Is the soil dried out?
Step 2 -- Examine light conditions...
Step 3 -- Assess nutrient deficiency...
Step 4 -- Look for other clues...
Quick diagnostic questions: [list]
The second response is actually useful -- it walks through a diagnostic process you can follow. The first is nearly meaningless.
Example 2: Building Complex Features Step-by-Step
Chain-of-thought isn't just for problem diagnosis. When building complex features, explicitly structuring the steps gets better results:
Let's build a complete export/import system for a Prompt Library, step-by-step.
Step 1: Analyze what data needs to be exported
- All prompts with their metadata, ratings, and notes
Step 2: Design the export JSON schema
- Include version number for future compatibility
- Add export timestamp and statistics
- Provide complete prompts array
Step 3: Create the export function
- Gather all data from localStorage
- Validate data integrity
- Create a Blob and trigger download with timestamp in filename
Step 4: Create the import function
- Read the uploaded JSON file
- Validate structure and version number
- Check for duplicate IDs
- Merge or replace based on user choice
Step 5: Add error recovery
- Back up existing data before import
- Roll back on failure
- Provide detailed error messages
Implement this system. Let's think step-by-step.Adding both the explicit steps and the "let's think step-by-step" instruction is redundant in a human sense -- but the model responds better to explicit instruction. The phrase primes the model to maintain its reasoning chain throughout the implementation.
The Scale Effect
The accuracy gains from chain-of-thought grow with model size. On smaller models, zero-shot and zero-shot CoT perform similarly. On larger models, the gap becomes dramatic.
Since models are trending larger, chain-of-thought prompting will keep improving. Learning it now is investing in a technique that compounds over time.
Variations That Work
From the research, multiple CoT templates were tested. All outperformed the baseline:
| Template | Works well for |
|---|---|
| "Let's think step-by-step" | General purpose -- use this by default |
| "First, ... Next, ... Finally, ..." | Sequential processes |
| "Let's think about this logically" | Reasoning and analysis |
| "Let's solve this by splitting it into steps" | Multi-part problems |
| "Diagnose this step-by-step" | Debugging, troubleshooting |
The specific words matter less than the fact that you're asking for explicit reasoning. But "Let's think step-by-step" is the most researched and most reliable default.
Practical Application
Add it to prompts where:
- The answer involves multiple steps
- You're debugging something
- You're doing any math or logic
- You want to understand why the model reached a conclusion
- The output keeps being wrong and you don't know why
It works in code generation too. When a model is generating a complex algorithm, "let's think step-by-step" often produces better code and inline explanations of the reasoning.
Further Reading and Watching
- Video: A Survey of Techniques for Maximizing LLM Performance -- OpenAI
- Paper: Large Language Models are Zero-Shot Reasoners -- Kojima et al., 2022
Practice
0/5 doneKeep reading