Which LLM model are you using? <-- Wrong question!
By far, the most common question I get when discussing a project involving LLMs is which model did I use.
In my opinion, it denotes a deep misunderstanding of what and how to work with LLMs1.
Choosing the right model is part of a much larger set of questions that is nowadays called context engineering. I like Anthropic's definition:
Context engineering is the process of considering the holistic state available to the LLM at any given time and what potential behaviors that state might yield.
Context engineering is about asking: given this input and context, can I reliably trust the model to produce the output I need? It is about designing a system where you understand the relationship between your inputs and the model's capabilities well enough to trust the outputs.
Here are the actual questions we should ask ourselves:
- What's your prompt structure? LLMs themselves are very good at improving prompts.
- Are there ambiguities? At the very least, copy-paste your complete prompt and just ask another LLM to highlight ambiguities.
- How are you handling context window limits?
- What's your evaluation/testing strategy? You used to conduct evaluations when using traditional ML algorithms, right? Well, it's the same with LLMs. You can't skip that part.
- How are you managing state/memory?
- What tools/functions are you providing?
- How are you handling errors/retries? Without a strategy here, your system will fail in production no matter which model you choose.
Context goes far beyond providing the right textual context to the model. Again, it is about deeply understanding the problem you are trying to solve, and designing a prompt so that the LLM operates at a level where you can trust it (where you can be confident enough the model will perform appropriately).
Next time someone shows you their LLM project, skip the model question. Instead, ask about context engineering (in particular, evaluation strategies). That's where the real engineering work lives.
-
It's as if picking the right (best) LLM model would just fix it, like picking the right ML algorithm would fix it. ↩