Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Varun221 · 2024-06-24T16:07:37Z

Hello, I wanted to ask if there is any known regression of the correlation of chatbot arena elo scores and alpaca length controlled evaluations (from 0.98). I was using the alpaca-bench for a project and any comments on this would be helpful.

Here are some troubling instances:
Chatbot Arena: https://arena.lmsys.org/
Leadeboard: https://tatsu-lab.github.io/alpaca_eval/
Scores (As of 24 June, 16:00 UTC)

Model	Alpaca Eval LC Win rate	Chatbot ELO
GPT-4o	57.5%	1287
Yi-Large	51.9%	1240
LLaMA-3-70B-Instruct	34.4%	1207
Gemini-Pro (Bard)	24.4%	1208
Qwen-1.5-14B-chat	23.9%	1109
LLaMA-2-70B-chat	14.7%	1093

Both the columns tell different stories about model's abilities and performance gaps. Can we still use Alpaca Eval as a proxy for Chatbot arena ELO?
I understand that differences can arise due to different sampling configurations used when doing inference using this model in chatbot arena, whereas these evals are performed with fixed hyperparameter settings, but is there any other reason?

Thanks in Advance!

YannDubs · 2024-06-26T11:13:07Z

Hi @Varun221 ! The only model that seems different is LLaMA-3-70B-Instruct vs Gemini-Pro (Bard) and this is because Gemini in AlpacaEval is Gemini 1.0 but in ChatBot arena I believe that it's 1.5 (?)

Note that the absolute differences are not directly comparable as those are different metrics (win rate vs ELO). You can also get ELO scores from win rates but I would suggest only using AlpacaEval for comparing the ranking of models (which is what the 0.98 Spearman correlation measures) rather than the absolute scores.

YannDubs closed this as completed Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Varun221 commented Jun 24, 2024 •

edited

Loading

YannDubs commented Jun 26, 2024

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Comments

Varun221 commented Jun 24, 2024 • edited Loading

YannDubs commented Jun 26, 2024

Varun221 commented Jun 24, 2024 •

edited

Loading