Your LLM Problem Isn't Tooling. It's Effort
LLM Eval Frameworks Are Not What You Need
I spend my time solving problems with AI and LLMs. Both at Stackfix as a software/AI engineer, and in previous AI companies that I started.
Everyone wants great output. But getting the right eval framework or LLM infrastructure is not the solution.
The Problem With Specialized LLM Tools
I see this with companies like Atla.
But I question the value of what Atla and others are building (There are lots: e.g., Confident, Athina, Capitol, Openlayer, Langfuse, Lunary).
To get outstanding results from LLMs, you need to focus on your specific business problem.
Getting more tools won't solve this for you.
And particularly not specialized LLMs to evaluate LLMs; how useful can a specialized model be to evaluate output against Gemini 2.5 or future models?
For LLM-as-a-judge, you could just use the latest LLM. Or an LLM that's good, very fast and cheap; I've added Gemini 2 Flash Lite for us.
I don't think this addresses the main problem
The Real Solution: Doing the work, not More Tools
Writing great tests and specific evals is the solution.
As engineers, we often fall into this trap of thinking that we need more tools to help you do the work. Actually, you don't need more tools; you just need to do the work.
When Specialized Infrastructure might make Sense
I think the latest companies building LLM or eval infrastructure are not very useful for most practitioners.
But they might be useful at massive scale. A bit like having a logging or observability service (this is how Arize AI describe themselves).
Probably at larger scale you'll want to have some sort of dedicated provider. But this seems something to add after you've done the real work of building whatever evals fit your use case. Not from the start.
The Unavoidable Work
Fundamentally, an LLM and eval provider is not going to do the work for you.
You need to do the work yourself.
Actually figuring out the custom evals that you need for your specific business solution is the real LLM work.
This is quite complex, and you can't outsource it to new tools.