The real bottleneck isn't training. It's knowing what good looks like.
The bottleneck people see first
When teams think about improving model behavior, the first image that comes to mind is usually infrastructure. GPUs, training runs, cloud costs, provisioning delays. Those are real barriers. Access to compute is genuinely uneven, and the cost of running fine-tuning or evaluation jobs can be hard to predict. It makes sense that teams focus there.
But for most teams working on real products, the confusion starts earlier. Before the question of how to run a training job, there is a more fundamental question that often goes unanswered: what does a better model actually look like for this product?
Why so much improvement work feels random
Most teams start improving their models by changing things. They adjust prompts, swap in a different model, tweak retrieval parameters, restructure context. These are all legitimate moves. Prompting is a real tool. Retrieval design matters. None of this is wrong.
The problem is iterating without a target. When there is no stable definition of what success looks like, changes are easy to make but hard to judge. An output might feel better to the person who made the change, but that feeling is difficult to verify and nearly impossible to repeat. Improvement without a target becomes guesswork. Guesswork compounds. Each change introduces new uncertainty without resolving the old kind.
Model quality problems often hide specification problems. The output looks wrong, but the real issue is that no one has written down what right looks like.
Knowing what good looks like is the real work
Before model improvement can become systematic, a team needs to define a few things clearly: what the desired behavior looks like, what the common failure modes are, what tradeoffs are acceptable, and what they are actually optimizing for. Together, these form a stable definition of success.
This is closer to product work than ML work. It does not require a background in machine learning. It requires the same kind of thinking that goes into defining acceptance criteria, writing test cases, or deciding what a feature is supposed to do. The discipline is familiar. It just has not been applied consistently to model behavior.
Once a team has that definition, improvement becomes a different kind of activity. Changes can be evaluated against something real. Progress can be measured. Regressions become visible. The work stops feeling like guessing.
Evaluation as a routing mechanism
Evaluation is not just for scoring final output. It is also a routing mechanism for deciding where to intervene, and that distinction matters a great deal in practice.
When a model behaves badly, the cause could be in several different places: the prompt design, the context being passed in, the retrieval setup, the tools available to the model, the output constraints, the model itself, or the overall workflow structure. Each of these has a different fix. Prompt issues call for prompt work. Retrieval issues call for retrieval work. Only some problems call for fine-tuning or distillation.
This is what intervention routing means: using evaluation not just to measure quality, but to identify which layer of the system is responsible for the observed failure. Without that, teams often reach for the most visible lever (usually the prompt or the model) when the actual problem is somewhere else entirely.
In real products, eval is downstream-aware
In a research context, "better" usually means higher scores on a benchmark. In a product context, it means something more specific and more varied. Product success, not just model elegance, is what the evaluation target should reflect.
Depending on the product, what better means might be: more consistent outputs across similar inputs, acceptable latency under real load, predictable cost per request, safer failure behavior when the model is uncertain, better fit with the surrounding workflow, less operational burden on the team, or higher trust from the people using it. These are product requirements. They belong in the evaluation target just as much as output quality does.
A model that scores well on an abstract quality metric but introduces unpredictable latency or breaks downstream parsing is not an improvement. Evaluation that ignores those dimensions is incomplete.
'Good' varies by product. But it still has to be made explicit.
There is no universal standard for model behavior. What counts as good in a customer support tool is different from what counts as good in a code assistant or a document summarizer. This is not a problem to solve. It is just the nature of building products.
But variation across products does not make the definition optional. If what good looks like stays implicit (shared informally, assumed rather than written down), then model improvement stays fragile. Changes get made based on intuition. Regressions go unnoticed. The team loses confidence in its own progress.
The goal is not to define good for every team. It is to make the act of defining it feel like a normal part of building, something that happens early, gets revisited as the product evolves, and informs every decision about where to invest in improvement.
Why this matters for developers
One reason model improvement still feels like ML work is that the surrounding practices (defining success, structuring feedback loops, deciding where to intervene) are still wrapped in ML assumptions. The tooling assumes ML expertise. The vocabulary assumes familiarity with training pipelines. The workflows assume a specialist is involved.
That distance is what needs to shrink. Developers who already work with LLMs, who understand prompting, structured outputs, and retrieval, are not missing the technical foundation. What they are often missing is a clear path from "this output is wrong" to "here is what I should change and why."
Model improvement should feel like a natural extension of product development, not a detour into a specialist domain. The practices that make it systematic (defining success, evaluating against it, routing interventions to the right layer) are not exotic. They are just not yet treated as defaults.
The hardest part of improving a model in a real product is often not executing a training run. It is defining what improvement means well enough to guide real decisions: about what to measure, where to look, and which lever to pull. If LLMs are now a normal part of software, then model improvement cannot remain a side activity reserved for specialists. Making it accessible starts with a simpler discipline than most people expect: being explicit about what success actually looks like.