Choosing the Right Model

Different models have different strengths, costs, and deprecation timelines. Picking the right model for the job is a core prompt engineering skill.

March 30, 20264 min read5 / 5

Most people find one model they like and use it for everything. This is understandable — once you've gotten comfortable with how a model talks, switching feels like learning a new tool. But in professional contexts, model selection is a real engineering decision.

Models Are Different

All major LLMs are powerful. But they differ in:

FactorWhat it affects
SpeedHow long you wait for a response
CostHow much an API call costs (matters for apps running thousands of calls/day)
CapabilityComplex reasoning, code quality, creative writing
Context windowHow much conversation/code it can hold
Training cutoffHow current its knowledge is
StrengthsSome excel at code, others at analysis or creative writing

The Cost Consideration

When building AI applications, model cost scales directly with usage. A real example:

An AI fintech app was using Haiku (smaller, faster, cheaper) for classifying financial transactions. When Sonnet 4 was released, it was ~80% more expensive — and running thousands of classifications per day.

The question wasn't "which model is smarter?" It was "does Sonnet 4 classify transactions measurably better than Haiku for this specific task?" For a classification task where Haiku was already 99% accurate, the answer was no. Haiku stayed.

Choosing the most capable model by default isn't engineering — it's just spending more money.

Deprecation is Real

Models get deprecated. Haiku could be gone tomorrow. GPT-4 is already deprecated. This has a practical implication:

When you build an AI application, write it so you can swap the model without rewriting everything.

TypeScript
// Flexible: model name is a config, not hardcoded const MODEL = process.env.AI_MODEL ?? "claude-sonnet-4-6"; const response = await anthropic.messages.create({ model: MODEL, messages: [...] });

This way, when your model gets deprecated (and it will), you change one line and test with the new model.

Try Different Providers

The psychological pattern: once you find a model you like at age 18-24, you're a loyal customer for life. The same is true for AI tools — most people stick with their first comfortable model forever.

Push against this. Each provider and model has genuine strengths:

  • Claude (Anthropic): Strong at nuanced reasoning, following complex instructions, long context
  • GPT series (OpenAI): Widely used, strong code generation, great ecosystem
  • Gemini (Google): Strong multimodal capabilities, improving rapidly
  • Haiku / smaller models: Fast and cheap for high-volume, lower-complexity tasks

What's worth trying:

  • Use a different model for a week and see how it handles your specific tasks
  • When cost matters, benchmark smaller models against larger ones for your specific use case
  • For AI applications, A/B test model outputs before committing

Same Model, Different Behavior

The same model (e.g., Claude Sonnet 4.6) will behave differently in:

  • Claude.ai chat
  • GitHub Copilot
  • Cursor
  • Your own API call

Why? The system message. Each tool sets a different invisible system prompt that shapes how the model responds. This isn't a bug — it's by design.

It also means you can't directly compare "Claude in chat" vs. "Claude in Copilot" — you're comparing Claude with different system messages.

Which Model to Use for What

A rough guide for everyday decisions:

Plain text
Complex architecture decisions, detailed code review: → Use the most capable model available (Sonnet 4.6, GPT-5, etc.) Routine code generation, documentation: → Mid-tier (Claude Haiku, GPT-4o mini) High-volume classification, extraction, summaries: → Smallest model that meets your accuracy bar (often Haiku-class) Creative writing, brainstorming: → Try different models — creative quality varies noticeably

The Prompting Technique Connection

Your prompting technique should also change based on model size:

  • Smaller models respond better to simpler, very explicit prompts. Few-shot examples become more important to guide behavior.
  • Larger models can handle more ambiguity and implicit context. A well-structured zero-shot prompt often works.

Chain-of-thought prompting (coming up) shows dramatically better results on larger models than smaller ones. Keep this in mind when debugging unexpected outputs from a smaller/cheaper model.

Practical Checklist

When choosing a model for a new project:

  • What's the task? (reasoning, code, classification, creative)
  • How often will this run? (once vs. thousands/day)
  • What's the acceptable error rate?
  • What's my token budget per call?
  • Can I swap the model if this one gets deprecated?
  • Have I actually benchmarked this on real examples?

Don't commit to a model because it's the most popular or the one you used last time. Commit because you've tested it for your specific task.

Enjoyed this? Get more like it.

Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.