Choosing the Right Model
Different models have different strengths, costs, and deprecation timelines. Picking the right model for the job is a core prompt engineering skill.
Most people find one model they like and use it for everything. This is understandable — once you've gotten comfortable with how a model talks, switching feels like learning a new tool. But in professional contexts, model selection is a real engineering decision.
Models Are Different
All major LLMs are powerful. But they differ in:
| Factor | What it affects |
|---|---|
| Speed | How long you wait for a response |
| Cost | How much an API call costs (matters for apps running thousands of calls/day) |
| Capability | Complex reasoning, code quality, creative writing |
| Context window | How much conversation/code it can hold |
| Training cutoff | How current its knowledge is |
| Strengths | Some excel at code, others at analysis or creative writing |
The Cost Consideration
When building AI applications, model cost scales directly with usage. A real example:
An AI fintech app was using Haiku (smaller, faster, cheaper) for classifying financial transactions. When Sonnet 4 was released, it was ~80% more expensive — and running thousands of classifications per day.
The question wasn't "which model is smarter?" It was "does Sonnet 4 classify transactions measurably better than Haiku for this specific task?" For a classification task where Haiku was already 99% accurate, the answer was no. Haiku stayed.
Choosing the most capable model by default isn't engineering — it's just spending more money.
Deprecation is Real
Models get deprecated. Haiku could be gone tomorrow. GPT-4 is already deprecated. This has a practical implication:
When you build an AI application, write it so you can swap the model without rewriting everything.
// Flexible: model name is a config, not hardcoded
const MODEL = process.env.AI_MODEL ?? "claude-sonnet-4-6";
const response = await anthropic.messages.create({
model: MODEL,
messages: [...]
});This way, when your model gets deprecated (and it will), you change one line and test with the new model.
Try Different Providers
The psychological pattern: once you find a model you like at age 18-24, you're a loyal customer for life. The same is true for AI tools — most people stick with their first comfortable model forever.
Push against this. Each provider and model has genuine strengths:
- Claude (Anthropic): Strong at nuanced reasoning, following complex instructions, long context
- GPT series (OpenAI): Widely used, strong code generation, great ecosystem
- Gemini (Google): Strong multimodal capabilities, improving rapidly
- Haiku / smaller models: Fast and cheap for high-volume, lower-complexity tasks
What's worth trying:
- Use a different model for a week and see how it handles your specific tasks
- When cost matters, benchmark smaller models against larger ones for your specific use case
- For AI applications, A/B test model outputs before committing
Same Model, Different Behavior
The same model (e.g., Claude Sonnet 4.6) will behave differently in:
- Claude.ai chat
- GitHub Copilot
- Cursor
- Your own API call
Why? The system message. Each tool sets a different invisible system prompt that shapes how the model responds. This isn't a bug — it's by design.
It also means you can't directly compare "Claude in chat" vs. "Claude in Copilot" — you're comparing Claude with different system messages.
Which Model to Use for What
A rough guide for everyday decisions:
Complex architecture decisions, detailed code review:
→ Use the most capable model available (Sonnet 4.6, GPT-5, etc.)
Routine code generation, documentation:
→ Mid-tier (Claude Haiku, GPT-4o mini)
High-volume classification, extraction, summaries:
→ Smallest model that meets your accuracy bar (often Haiku-class)
Creative writing, brainstorming:
→ Try different models — creative quality varies noticeablyThe Prompting Technique Connection
Your prompting technique should also change based on model size:
- Smaller models respond better to simpler, very explicit prompts. Few-shot examples become more important to guide behavior.
- Larger models can handle more ambiguity and implicit context. A well-structured zero-shot prompt often works.
Chain-of-thought prompting (coming up) shows dramatically better results on larger models than smaller ones. Keep this in mind when debugging unexpected outputs from a smaller/cheaper model.
Practical Checklist
When choosing a model for a new project:
- What's the task? (reasoning, code, classification, creative)
- How often will this run? (once vs. thousands/day)
- What's the acceptable error rate?
- What's my token budget per call?
- Can I swap the model if this one gets deprecated?
- Have I actually benchmarked this on real examples?
Don't commit to a model because it's the most popular or the one you used last time. Commit because you've tested it for your specific task.
Enjoyed this? Get more like it.
Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.