MTSAIR SBS Leaderboard
MTSAIR leaderboard evaluating the LLM's performance as a judge for a side-by-side assessment in terms of:
- correlation with manual judgement,
- robustness to positional bias.
- "headers": [
- "Model",
- "Avg. Correlation โฌ๏ธ",
- "APCC",
- "MPCC",
- "PCon@AB",
- "MPCC Consistency",
- "MPCC Swap Delta",
- "Architecture",
- "Precision",
- "Hub License",
- "#Params (B)",
- "Hub โค๏ธ",
- "Model sha"
- "data": [
- [
- "<a target="_blank" href="https://huggingface.co/anthropic/Claude-3-opus" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">anthropic/Claude-3-opus</a>",
- 64.16,
- 68.1,
- 60.22,
- 43.12,
- 59.8,
- -12.52,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">deepseek-ai/DeepSeek-V3</a>",
- 59.73,
- 62.44,
- 57.02,
- 27.06,
- 16.43,
- 7.74,
- "DeepseekV3ForCausalLM",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/deepseek-ai/deepseek-r1-distill-llama-70b-awq" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">deepseek-ai/deepseek-r1-distill-llama-70b-awq</a>",
- 43.32,
- 22.68,
- 63.97,
- 49.97,
- 84.58,
- -1.19,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/yandex/YandexGPT-4-pro" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">yandex/YandexGPT-4-pro</a>",
- 40.14,
- 23.05,
- 57.23,
- 40.95,
- 31.87,
- 9.25,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/meta-llama/llama-3.3-70b-instruct-awq" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/llama-3.3-70b-instruct-awq</a>",
- 31.8,
- 4.08,
- 59.53,
- 47.58,
- 59.86,
- -24.86,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/t-tech/T-pro-it-1.0" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">t-tech/T-pro-it-1.0</a>",
- 28.24,
- 6.96,
- 49.51,
- 39.96,
- 36.27,
- 17.95,
- "Qwen2ForCausalLM",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/openai/GPT-4" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">openai/GPT-4</a>",
- 27.99,
- -8.22,
- 64.2,
- 33.86,
- 67.44,
- -25.98,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/miqudev/miqu-1-70b" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">miqudev/miqu-1-70b</a>",
- 27.16,
- 2.18,
- 52.13,
- 23.3,
- 18.72,
- -33.8,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/t-tech/T-lite-it-1.0" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">t-tech/T-lite-it-1.0</a>",
- 18.83,
- 23.83,
- 13.82,
- 12.24,
- -37.03,
- -9.28,
- "Qwen2ForCausalLM",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/openai/GPT-4o" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">openai/GPT-4o</a>",
- 17.55,
- -14.38,
- 49.47,
- 32.88,
- 66.57,
- -3.45,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/anthropic/Claude-3-5-sonnet" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">anthropic/Claude-3-5-sonnet</a>",
- 15.29,
- -12.17,
- 42.74,
- 10.65,
- -27.48,
- -24.1,
- "?",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "<a target="_blank" href="https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">meta-llama/Llama-3.1-405B-Instruct</a>",
- 13.92,
- -22.87,
- 50.7,
- 33.58,
- 52.57,
- -5.84,
- "LlamaForCausalLM",
- "?",
- "?",
- 0,
- 0,
- "main"
- [
- "metadata": null
This leaderboard displays the metrics indicating the adequacy of various LLM-as-a-judge systems during the side-by-side evaluation of model generations from Qwen2.5-32B-Instruct and GPT-4o. In our comparisons, we ask the judge model to determine whether:
- The response from the candidate model is better than the baseline,
- Vice versa,
- Both responses are good, or
- Both responses are bad.
Instead of randomizing the order of model responses, we conduct two runs through the dataset. In the first run, the candidate model's response is presented first, followed by the baseline model's response; in the second run, the order is reversed. The scores are averaged after both runs are completed.
After this, we calculate metrics showcasing two aspects:
APCC and MPCC demonstrate the correlation of LLM-as-a-judge assessments with expert evaluations.
- Aggregated Pearson Correlation Coefficient (APCC): We count the number of verdicts in each class (A/B/C/D) and calculate the correlation between LLM-as-judge and expert assessments based on these four values. This metric sacrifices detailed verdict information but can estimate how closely the model aligns with experts in delivering a final verdict for the entire benchmark.
- Median Pearson Correlation Coefficient (MPCC): We apply a sliding window with a size of 10 and a stride of 5 across all benchmark verdicts. For each batch, we calculate the median using the formula:
$$ \text{Median} = \frac{\sum{\textbf{A}}+\sum{\textbf{C}}}{\sum{\textbf{A}}+\sum{\textbf{B}}+2\cdot\sum{\textbf{C}}} $$ This provides a set of medians for expert and model verdicts, and we calculate the PCC between them. This method retains most verdict information but imposes a linear relationship between verdict classes, which may not be entirely accurate.
Metrics of Positional Bias: We introduce the metric PCon@AB, which indicates the presence of bias in evaluator models. $$ \textbf{PCon@AB} = \frac{I(J{\text{swap}=0} = J{\text{swap}=1}| J=\textbf{A} \vee \textbf{B})}{ I\left((J{\text{swap}=0}=\textbf{A} \vee \textbf{B}) \vee (J{\text{swap}=1}=\textbf{A} \vee \textbf{B})\right)} $$ This metric shows the consistency of the model's answers without swap and with swap, indicating the proportion of matching answers among A and B given different model response orders.
The metric MPCC-Consistency is calculated as the Pearson correlation coefficient between two sets of medians obtained for verdicts with and without swap, while the metric MPCC-โ is the difference between the MPCC calculated separately for verdicts obtained with and without swap.
PCon@AB, MPCC-Consistency, and MPCC-โ do not rely on manual annotation, allowing us to determine the model's susceptibility to positional bias without expert involvement.
Currently, our leaderboard doesn't support automatic running of models from the HF Hub through our benchmarkโwe're working on it! However, you can send a request with the model name, revision, and precision, and we'll run your LLM-as-a-judge and update the leaderboard!
Additionally, you can use our methodology to evaluate models on another open benchmark using the code available in the repository.