Abstract

LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=375), and four bias types. Our headline practical finding is that a mid-tier model with the right debiasing can outperform frontier judges at a fraction of the cost: Gemini 2.5 Flash with the Combined Budget strategy achieves the highest agreement of any configuration we tested (71.0%, κ = 0.549, p < 0.0001) at ~$0.001 per evaluation, roughly 15× cheaper than the strongest frontier configuration (Claude Sonnet 4 with the same strategy at 69.5%, ~$0.015 per evaluation). Our other key findings: (1) Style bias is the dominant bias (0.10–0.76 baseline across all models, mostly favoring markdown over plain prose), far exceeding position bias (≤ 0.04), yet has received minimal research attention. (2) Verbosity bias is heterogeneous across models when measured length-aware: Llama, Gemini Pro, and Gemini Flash show classical verbosity bias (+0.24 to +0.44, prefer longer), Claude Sonnet 4 shows the opposite (-0.12, prefer concise), and GPT-4o is essentially neutral (-0.04); on truncation controls all models correctly prefer the genuinely complete response (0.88–1.00 accuracy), so the expansion-pair preferences cannot be reduced to length-only effects. (3) Debiasing is statistically beneficial for multiple models: Claude S8 (+11.5 pp, p<0.0001), Flash S8 (+7.5 pp, p<0.0001), Claude S5 (+7.3 pp, p=0.0009) survive Holm-Bonferroni correction; Flash S1 (+4.7 pp, p=0.004) and Llama S8 (+4.5 pp, p=0.011) are significant before correction; Pro and GPT-4o show smaller, non-significant directional gains. We release our evaluation framework, the 375-pair controlled dataset (now including round-robin MODEL_ORIGIN and position-mirrored STYLE pairs), and per-instance cached results for all 9 strategies.

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Sadman Kabir Soumik

Video

Paper PDF

Abstract