Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmlu benchmark 无法复现根据当前代码 #217

Open
2 tasks done
chunniunai220ml opened this issue Jun 20, 2024 · 12 comments
Open
2 tasks done

mmlu benchmark 无法复现根据当前代码 #217

chunniunai220ml opened this issue Jun 20, 2024 · 12 comments
Assignees

Comments

@chunniunai220ml
Copy link

chunniunai220ml commented Jun 20, 2024

System Info / 系統信息

vLLM Version: 0.5.0.post1
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pytorch-triton-rocm==2.2.0
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.3.0
[pip3] torchaudio==2.2.1+cu118
[pip3] torchvision==0.18.0
[pip3] transformers==4.40.0
[pip3] triton==2.3.0
[conda] intel-extension-for-pytorch 2.2.0 pypi_0 pypi
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu11 2.19.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] pytorch-triton-rocm 2.2.0 pypi_0 pypi
[conda] sentence-transformers 3.0.1 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.2.1+cu118 pypi_0 pypi
[conda] torchvision 0.18.0 pypi_0 pypi
[conda] transformers 4.40.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi

python==3.10
Tesla-V100

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

image

根据最新的代码,和openai : https://github.com/openai/simple-evals, 以及https://github.com/THUDM/GLM-4/blob/main/basic_demo/README.md,
mmlu只有45.4, 与 report GLM-4-9B-Chat=72.4, 差距很大, 由于硬件限制,单卡,model_dtype=fp16

在A100上测试,4卡, mmlu =45.7, model_dtype=bf16

Expected behavior / 期待表现

复现精度细节, 或者直接给出复现代码

@zRzRzRzRzRzRzR
Copy link
Collaborator

我们的测试在bf16部署,simple evals测试,没有额外的加入系统提示词

单卡载入一个完整的模型 BF16,sequence-length 设置长一点试试吗,这个看上去没有回答完?

@chunniunai220ml
Copy link
Author

chunniunai220ml commented Jun 20, 2024

A100, BF16, seqlen=8192, mmlu=68.24
sever端的process message 获取的文本格式, 跟HF不一致,有影响吗? 或者说以那个为准

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True
)
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

@zRzRzRzRzRzRzR
Copy link
Collaborator

以hf为准,server没有tokenize是因为要传入给vLLM,要用hf调用的方式,trans_cli_demo这个文件

@chunniunai220ml
Copy link
Author

params_dict = {
"n": 1,
"best_of": 1,
"presence_penalty": 1.0,
"frequency_penalty": 0.0,
"temperature": temperature,
"top_p": top_p,
"top_k": -1,
"repetition_penalty": repetition_penalty,
"use_beam_search": False,
"length_penalty": 1,
"early_stopping": False,
"stop_token_ids": [151329, 151336, 151338],
"ignore_eos": False,
"max_tokens": max_new_tokens,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
sever里的这些参数是不是都不要改,能复现? client 请求,参考了request.py, 里面传输的client参数有啥需要注意的吗?

@zRzRzRzRzRzRzR
Copy link
Collaborator

repetition_penalty是1
其他应该都不用管,不锁我们server是vLLM的我们自己测跑分是transformers的

@chunniunai220ml
Copy link
Author

chunniunai220ml commented Jun 20, 2024

你的意思是,你们report benchmark 用的基于transformers的,我应该参考trans_cli_demo这个文件, vllm你们有测试的结果,我想看看能不能对齐,repetition_penalty=1, seqlen=8192, got mmlu=0.7124 ,比较接近你们report benchmark
打印process_message的时候, 发现tools添加了额外的信息,这个对benchmark有影响吗? copy的request.py,里面tools:
self.tools= [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location.",
},
},
"required": ["location", "format"],
},
}
},
]

还有"top_p": top_p,,设置多少合适? server.py里面"top_k": -1,, hf给的gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

@zRzRzRzRzRzRzR
Copy link
Collaborator

top_k 是 1 top_p 0.8 tools不要加入任何内容,这会影响构建提示词

@chunniunai220ml
Copy link
Author

{'n': 1, 'best_of': 1, 'presence_penalty': 1.0, 'frequency_penalty': 0.0, 'temperature': 0.6, 'top_p': 0.8, 'top_k': 1, 'repetition_penalty': 1.0, 'use_beam_search': False, 'length_penalty': 1, 'early_stopping': False, 'stop_token_ids': [151329, 151336, 151338], 'ignore_eos': False, 'max_tokens':2500, 'logprobs': None, 'prompt_logprobs': None, 'skip_special_tokens': True}

根据你给的信息,这种配置,got mmlu= 0.7232 , 接近report 72.4了,有这点儿精度差异正常吗? 后者还有那个里可以适当改一下? tools=[]了

@zRzRzRzRzRzRzR
Copy link
Collaborator

这个精度可以接受, tools 只要不带有内容我们构造提示词都不会涉及到function call的,就不会影响跑分

@chunniunai220ml
Copy link
Author

chunniunai220ml commented Jun 24, 2024

还有个问题, 就是openai的simples-eval, 只取了2500samples,从他提供的mmlu.csv里面, 你们report的测评也是吗?每次test的结果现在不一致了

@zRzRzRzRzRzRzR
Copy link
Collaborator

zRzRzRzRzRzRzR commented Jun 26, 2024

你是 说这个嘛 ttps://github.com/openai/simple-evals/blob/294cb1fb18f7aed4e21dc567350b0761a9e6f699/mmlu_eval.py

simples-eval是有prompt的,没有加入别的提示词语,这个跑分误差有多大

@chunniunai220ml
Copy link
Author

ttps://github.com/openai/simple-evals/blob/294cb1fb18f7aed4e21dc567350b0761a9e6f699/mmlu_eval.py

是的, 你是指
prompt_messages = [ sampler._pack_message(content=format_multichoice_question(row), role="user") ]
这句话吗? 我检查过, 输入没有问题额外添加提示词,
def _pack_message(self, role: str, content: Any): return {"role": str(role), "content": content}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants