Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Closed
Varun221 opened this issue Jun 24, 2024 · 1 comment
Closed

Discrepancy between alpaca leaderboard and Chatbot arena ELO #352

Varun221 opened this issue Jun 24, 2024 · 1 comment

Comments

@Varun221
Copy link

Varun221 commented Jun 24, 2024

Hello, I wanted to ask if there is any known regression of the correlation of chatbot arena elo scores and alpaca length controlled evaluations (from 0.98). I was using the alpaca-bench for a project and any comments on this would be helpful.

Here are some troubling instances:
Chatbot Arena: https://arena.lmsys.org/
Leadeboard: https://tatsu-lab.github.io/alpaca_eval/
Scores (As of 24 June, 16:00 UTC)

Model Alpaca Eval LC Win rate Chatbot ELO
GPT-4o 57.5% 1287
Yi-Large 51.9% 1240
LLaMA-3-70B-Instruct 34.4% 1207
Gemini-Pro (Bard) 24.4% 1208
Qwen-1.5-14B-chat 23.9% 1109
LLaMA-2-70B-chat 14.7% 1093

Both the columns tell different stories about model's abilities and performance gaps. Can we still use Alpaca Eval as a proxy for Chatbot arena ELO?
I understand that differences can arise due to different sampling configurations used when doing inference using this model in chatbot arena, whereas these evals are performed with fixed hyperparameter settings, but is there any other reason?

Thanks in Advance!

@YannDubs
Copy link
Collaborator

Hi @Varun221 ! The only model that seems different is LLaMA-3-70B-Instruct vs Gemini-Pro (Bard) and this is because Gemini in AlpacaEval is Gemini 1.0 but in ChatBot arena I believe that it's 1.5 (?)

Note that the absolute differences are not directly comparable as those are different metrics (win rate vs ELO). You can also get ELO scores from win rates but I would suggest only using AlpacaEval for comparing the ranking of models (which is what the 0.98 Spearman correlation measures) rather than the absolute scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants