LLM as a Judge

Approach	When to use	Strength	Watch out for
Direct scoring	Ongoing quality monitoring, trends over time	Easy to trend, works on single outputs	Calibration varies by model
Pairwise comparison	A/B tests, ranking prompt and model variants	More reliable ranking than absolute scores	Double the cost, no absolute level
Reference-based	Factual Q&A, structured outputs with ground truth	Clear reference makes the verdict straightforward	Needs labeled data, punishes alternative phrasings
Binary classification	Safety, hallucination and compliance checks	Low ambiguity, easy to alert on	Loses edge cases

¶ The problem