AI bots are sprouting up everywhere like weeds right now. Fine and good – but how good are the answers of my own implementation, really? How does quality move when I tweak the prompt or the retrieval? And how do I improve it on purpose? To judge any of that, you classically reach for text metrics like BLEU or ROUGE.
Those only count word overlap in the end. For open-ended answers that says little about actual quality – two perfectly correct answers can be worded completely differently. Humans judge it reliably but don't scale to tens of thousands of requests. LLM as a Judge sits exactly in between: fast like a metric, close to human judgment in substance.
The idea: a language model rates the outputs of another (or the same) model against criteria you define in the prompt. It doesn't write the answer itself, it only judges – "is this answer good, and why?".
Why it works at all: judging is simply the easier job. When generating, a model has to juggle correctness, tone, the instructions, the context and admitting uncertainty all at once. Hand it a finished answer and a narrow question ("is this backed by the context?") instead, and the task becomes focused – the judgments come out far more stable.
How reliable is it? The most-cited validation is Zheng et al. 2023 (MT-Bench and Chatbot Arena): strong judges like GPT-4 reach roughly 80 % agreement with human preferences – about the level at which two humans agree with each other.
⚠️ But: that 80 % holds for preference and chat comparisons. On individual hard dimensions – factual faithfulness above all – the correlation with experts is often only moderate. An LLM judge is no substitute for an expert review when content is safety- or fact-critical.
Depending on the goal, you land on one of four patterns in practice. Two hand out absolute scores, one compares, one decides binary.
The judge grades a single output on a scale (1–5 or 1–10) against a rubric, with a written justification on request. It's the most common entry point. Catch: models calibrate their scale differently – a "4" doesn't mean the same thing in every run.
Two candidate answers side by side, the judge picks the better one. The calibration problem disappears because it only decides relatively instead of assigning an absolute number. The classic for testing prompt variants or model versions against each other. Costs twice the API calls and gives no absolute quality level.
The judge gets the output plus a gold-standard answer and checks how close it is – essentially smart fuzzy matching. Strong for factual Q&A and structured outputs, provided you already have labeled ground-truth data. But it punishes valid rewordings that sound different from the reference. (Logically it's the same pointwise scoring as the direct variant, just with a reference in the prompt.)
The verdict is reduced to pass / fail on one concrete property: grounded in context yes/no, contains personal data yes/no, tone ok yes/no. These checks run faster, cost less per evaluation and give more consistent results than numeric scales. In return they lose nuance on edge cases.
| Approach | When to use | Strength | Watch out for |
|---|---|---|---|
| Direct scoring | Ongoing quality monitoring, trends over time | Easy to trend, works on single outputs | Calibration varies by model |
| Pairwise comparison | A/B tests, ranking prompt and model variants | More reliable ranking than absolute scores | Double the cost, no absolute level |
| Reference-based | Factual Q&A, structured outputs with ground truth | Clear reference makes the verdict straightforward | Needs labeled data, punishes alternative phrasings |
| Binary classification | Safety, hallucination and compliance checks | Low ambiguity, easy to alert on | Loses edge cases |
Rule of thumb: no ground truth at hand? Direct or pairwise. Ranking models and prompts? Pairwise. Safety and hallucinations? Binary. Factual Q&A with labeled data? Reference-based.
For RAG you sensibly split quality into a retriever side and a generator side. Four core metrics have become standard – frameworks like RAGAS or DeepEval compute each of them internally via an LLM judge:
It gets interesting in combination:
Faithfulness and relevancy are RAG-typical, but a judge can be pointed at any property you can describe cleanly in a rubric. Three show up almost every time:
The known weaknesses are all documented – and each has a countermeasure:
And a few things that hold up in production:
H@ppy H@cking ⚖️🤖