-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmlu benchmark 无法复现根据当前代码 #217
Comments
我们的测试在bf16部署,simple evals测试,没有额外的加入系统提示词 单卡载入一个完整的模型 BF16,sequence-length 设置长一点试试吗,这个看上去没有回答完? |
A100, BF16, seqlen=8192, mmlu=68.24 inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}], |
以hf为准,server没有tokenize是因为要传入给vLLM,要用hf调用的方式,trans_cli_demo这个文件 |
params_dict = { |
repetition_penalty是1 |
你的意思是,你们report benchmark 用的基于transformers的,我应该参考trans_cli_demo这个文件, vllm你们有测试的结果,我想看看能不能对齐,repetition_penalty=1, seqlen=8192, got mmlu=0.7124 ,比较接近你们report benchmark 还有"top_p": top_p,,设置多少合适? server.py里面"top_k": -1,, hf给的gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1} |
top_k 是 1 top_p 0.8 tools不要加入任何内容,这会影响构建提示词 |
{'n': 1, 'best_of': 1, 'presence_penalty': 1.0, 'frequency_penalty': 0.0, 'temperature': 0.6, 'top_p': 0.8, 'top_k': 1, 'repetition_penalty': 1.0, 'use_beam_search': False, 'length_penalty': 1, 'early_stopping': False, 'stop_token_ids': [151329, 151336, 151338], 'ignore_eos': False, 'max_tokens':2500, 'logprobs': None, 'prompt_logprobs': None, 'skip_special_tokens': True} 根据你给的信息,这种配置,got mmlu= 0.7232 , 接近report 72.4了,有这点儿精度差异正常吗? 后者还有那个里可以适当改一下? tools=[]了 |
这个精度可以接受, tools 只要不带有内容我们构造提示词都不会涉及到function call的,就不会影响跑分 |
还有个问题, 就是openai的simples-eval, 只取了2500samples,从他提供的mmlu.csv里面, 你们report的测评也是吗?每次test的结果现在不一致了 |
你是 说这个嘛 ttps://github.com/openai/simple-evals/blob/294cb1fb18f7aed4e21dc567350b0761a9e6f699/mmlu_eval.py simples-eval是有prompt的,没有加入别的提示词语,这个跑分误差有多大 |
是的, 你是指 |
System Info / 系統信息
vLLM Version: 0.5.0.post1
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pytorch-triton-rocm==2.2.0
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.3.0
[pip3] torchaudio==2.2.1+cu118
[pip3] torchvision==0.18.0
[pip3] transformers==4.40.0
[pip3] triton==2.3.0
[conda] intel-extension-for-pytorch 2.2.0 pypi_0 pypi
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu11 2.19.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] pytorch-triton-rocm 2.2.0 pypi_0 pypi
[conda] sentence-transformers 3.0.1 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.2.1+cu118 pypi_0 pypi
[conda] torchvision 0.18.0 pypi_0 pypi
[conda] transformers 4.40.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
python==3.10
Tesla-V100
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
根据最新的代码,和openai : https://github.com/openai/simple-evals, 以及https://github.com/THUDM/GLM-4/blob/main/basic_demo/README.md,
mmlu只有45.4, 与 report GLM-4-9B-Chat=72.4, 差距很大, 由于硬件限制,单卡,model_dtype=fp16
在A100上测试,4卡, mmlu =45.7, model_dtype=bf16
Expected behavior / 期待表现
复现精度细节, 或者直接给出复现代码
The text was updated successfully, but these errors were encountered: