LLM Evals Are Just Integration Tests
LLM evals are having a moment.
There are dashboards, frameworks, benchmarks, blog posts, and an entire cottage industry forming around the idea that evaluating LLM systems is a fundamentally new problem.
I don’t think it is. (At least, it hasn’t been in my experience.)
In fact, the more time I spend working with LLMs in production, the more convinced I am that LLM evals are just integration tests — with a few uncomfortable twists.
Once you see that, a lot of the confusion clears up.
Why LLM Evals Feel Hard
LLMs make people uncomfortable because they break a few assumptions we’ve leaned on for years:
- Outputs aren’t deterministic
- “Correct” is often subjective
- Behavior changes as models drift
- Failures are embarrassing, not catastrophic (wrong answers, awkward responses, trust erosion)
So instead of asking:
“Did this function return the right value?”
We ask:
“Does this response feel acceptable?”
That shift makes people reach for new tools, new abstractions, and new terminology.
But under the hood, we’re still testing systems, not models.
What You’re Actually Evaluating
Very few production failures are caused by the model itself.
They usually come from:
- Prompt changes
- Retrieval failures
- Bad context
- Tool misuse
- Missing guardrails
- Silent latency regressions
In other words: integration bugs.
The model just happens to be the loudest component when something goes wrong.
This is why “benchmarking the model” rarely tells you what you want to know. Your users aren’t interacting with a model — they’re interacting with a pipeline.
The Mental Model That Actually Helped Me
I’ve gone looking for a “proper” eval solution more than once.
Across a few teams, I’ve tried to formalize this with real tooling — things like DeepEval, which I liked because it felt familiar and test-like, and could plug into platforms like MLflow, Datadog, or LangSmith for tracking.
Each time, I ended up in roughly the same place: useful signal, real insights — and tests that were fundamentally slower, more expensive, and more situational than anything I’d ever want running on every change.
That tension is what finally forced a rethink and the breakthrough for me was simple:
Stop treating evals as a new discipline.
Start treating them like integration tests for probabilistic systems.
That framing immediately answers a few questions:
- Why they’re slow
- Why they’re expensive
- Why they’re never complete
- Why they don’t belong entirely in CI
It also resets expectations.
Integration tests don’t prove correctness.
They reduce the chance of embarrassing failure.
That’s the real job of LLM evals too.
What This Changes
Once you internalize this:
- You stop chasing perfect scores
- You stop expecting determinism
- You stop over-investing in tooling early
- You start collecting examples, not just metrics
Most importantly, you stop asking:
“Is the model good?”
And start asking:
“Is this system safer than it was yesterday?”
That’s a question engineering teams actually know how to answer.
Final Thought
LLMs didn’t invalidate software testing.
They just removed our illusion of certainty.
Evals aren’t magic.
They’re integration tests — messier, fuzzier, and more honest.
Treat them that way, and they become useful instead of overwhelming.