You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I wanted to ask if there is any known regression of the correlation of chatbot arena elo scores and alpaca length controlled evaluations (from 0.98). I was using the alpaca-bench for a project and any comments on this would be helpful.
Both the columns tell different stories about model's abilities and performance gaps. Can we still use Alpaca Eval as a proxy for Chatbot arena ELO?
I understand that differences can arise due to different sampling configurations used when doing inference using this model in chatbot arena, whereas these evals are performed with fixed hyperparameter settings, but is there any other reason?
Thanks in Advance!
The text was updated successfully, but these errors were encountered:
Hi @Varun221 ! The only model that seems different is LLaMA-3-70B-Instruct vs Gemini-Pro (Bard) and this is because Gemini in AlpacaEval is Gemini 1.0 but in ChatBot arena I believe that it's 1.5 (?)
Note that the absolute differences are not directly comparable as those are different metrics (win rate vs ELO). You can also get ELO scores from win rates but I would suggest only using AlpacaEval for comparing the ranking of models (which is what the 0.98 Spearman correlation measures) rather than the absolute scores.
Hello, I wanted to ask if there is any known regression of the correlation of chatbot arena elo scores and alpaca length controlled evaluations (from 0.98). I was using the alpaca-bench for a project and any comments on this would be helpful.
Here are some troubling instances:
Chatbot Arena: https://arena.lmsys.org/
Leadeboard: https://tatsu-lab.github.io/alpaca_eval/
Scores (As of 24 June, 16:00 UTC)
Both the columns tell different stories about model's abilities and performance gaps. Can we still use Alpaca Eval as a proxy for Chatbot arena ELO?
I understand that differences can arise due to different sampling configurations used when doing inference using this model in chatbot arena, whereas these evals are performed with fixed hyperparameter settings, but is there any other reason?
Thanks in Advance!
The text was updated successfully, but these errors were encountered: