Add interfaces to the pipeline to obtain logits and ppl #1652

irexyc · 2024-05-24T02:43:21Z

Motivation

Add interfaces to the pipeline to obtain logits and ppl

Use cases (Optional)

from lmdeploy import pipeline
import numpy as np
import torch
from lmdeploy.vl import load_image

pipe = pipeline('/nvme/shared/llava-v1.5-7b/')
im = load_image('tiger.jpeg')

out = pipe.prepare_inputs([('hello', im)])
logits = pipe.get_logits(out['input_ids'], out['input_embeddings'], out['input_embedding_ranges'])

lvhan028 · 2024-05-24T04:37:00Z

May update both the llm pipeline and vlm pipeline user guide by adding an example of calculating logits and ppl respectively.
Regarding the example presented in the note, I suggest using internlm2-7b, xcomposer2-7b as the candidate model.
As for the tokenizer, how about using AutoTokenizer instead?
My concern is users know AutoTokenizer very well and it's not necessary to introduce our tokenizer.

lvhan028 · 2024-06-04T08:55:27Z

lmdeploy/serve/utils.py

+    return event_loop
+
+
+class LogitsMixin:


Do we have strong reasons to make a new class?
Is there any concern if we put the following APIs to AsyncEngine?

The motivation is to make AsyncEngine class clean, if it is not necessary, I can move these functions into AsyncEngine

I understand your concern. But in my opinion, the motivation is not strong.
The "Mixin" pattern is often used to achieve some of the benefits of multiple inheritance while avoiding the complexities and potential issues associated with it
But in our cases, there is no multiple in inheritance

@lzhangzz @AllentDan any comments?

@irexyc If you insist on making logitsMixin, I suggest renaming the file as logits_mixin.py instead of utils.py

grimoire

LGTM

lmdeploy/serve/utils.py

irexyc · 2024-06-06T06:39:41Z

There may be some bugs:

2024-06-06 06:30:36,737 - lmdeploy - ERROR - Engine loop failed with error: CUDA error: an illegal memory access was encountered

from lmdeploy.turbomind import TurboMind
from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=PytorchEngineConfig(session_len=33000))

g = pipe.engine.create_instance()
g.decode([[100] * 10000], sequence_end=False)
g.decode([[100] * 10000], sequence_start=False)

nan logits (both pytorch and turbomind backend)

from lmdeploy.turbomind import TurboMind
from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=TurbomindEngineConfig(session_len=33000))
# pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=PytorchEngineConfig(session_len=33000))
g = pipe.engine.create_instance()
g.decode([list(range(9700))])

tensor([[[-5.2604,  4.2658,  6.0810,  ..., -0.2098, -1.4354, -0.3216],
         [-7.8710,  2.2585,  3.1645,  ..., -2.7675, -5.0839, -3.0246],
         [-2.0298,  9.4127,  2.9417,  ..., -0.4258, -1.0950, -0.9109],
         ...,
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]]],
       device='cuda:0')

grimoire · 2024-06-06T10:05:33Z

The first error is caused by

lmdeploy/lmdeploy/pytorch/engine/model_agent.py

Line 274 in 9fd9c8c

block_offsets = self.block_offsets[:, :block_end]

[:, :block_end] should be removed.

lmdeploy/pytorch/engine/engine_instance.py

lmdeploy/pytorch/engine/engine.py

This reverts commit 0b0508a.

RunningLeon · 2024-06-19T02:05:46Z

@irexyc Could fix conflict with main branch.

Conflicts: lmdeploy/pytorch/engine/engine_instance.py

src/turbomind/models/llama/LlamaBatch.cc

lzhangzz · 2024-06-21T10:43:26Z

src/turbomind/models/llama/LlamaBatch.cc

-            (float*)allocator_->malloc(sizeof(float) * model_->vocab_size_padded_ * max_context_token_num_);
-        const auto tp = model_->tensor_para_.world_size_;
+        context_logits_buf_ = (float*)allocator_->malloc(sizeof(float) * model_->vocab_size_padded_ * num_token);
+        const auto tp       = model_->tensor_para_.world_size_;


In current implmetation, these buffers are only going to be allocated once. num_token is just for this single iteration.

lvhan028 · 2024-06-24T04:09:01Z

There may be some bugs:

2024-06-06 06:30:36,737 - lmdeploy - ERROR - Engine loop failed with error: CUDA error: an illegal memory access was encountered

from lmdeploy.turbomind import TurboMind
from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=PytorchEngineConfig(session_len=33000))

g = pipe.engine.create_instance()
g.decode([[100] * 10000], sequence_end=False)
g.decode([[100] * 10000], sequence_start=False)

nan logits (both pytorch and turbomind backend)

from lmdeploy.turbomind import TurboMind
from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=TurbomindEngineConfig(session_len=33000))
# pipe = pipeline('/nvme/shared/vicuna-7b-v1.5/', log_level='INFO', backend_config=PytorchEngineConfig(session_len=33000))
g = pipe.engine.create_instance()
g.decode([list(range(9700))])

tensor([[[-5.2604,  4.2658,  6.0810,  ..., -0.2098, -1.4354, -0.3216],
         [-7.8710,  2.2585,  3.1645,  ..., -2.7675, -5.0839, -3.0246],
         [-2.0298,  9.4127,  2.9417,  ..., -0.4258, -1.0950, -0.9109],
         ...,
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]]],
       device='cuda:0')

Does this issue still exist?

lvhan028 · 2024-06-24T04:16:16Z

lmdeploy/serve/utils.py

+    """Helper class to calculate logits and ppl."""
+
+    def prepare_inputs(self, prompts: Union[PromptType, List[PromptType]]):
+        if hasattr(self, '_convert_prompts'):


So, we always apply the chat template for the VLM models, but we don't do it on LLMs, right?
If this is the case, I suggest we use AutoTokenizer.apply_chat_template as the example in pipeline.md, so that we won't bother by explaining to users whether we apply the chat template.

lvhan028 · 2024-06-24T04:20:56Z

lmdeploy/serve/utils.py

+        for prompt in prompts:
+            out = _get_event_loop().run_until_complete(
+                self._get_prompt_input(prompt,
+                                       do_preprocess=True,


We'd better not hardcode do_preprocess for LLMs

lvhan028 · 2024-06-24T10:11:54Z

lmdeploy/pytorch/engine/model_agent.py

@@ -289,7 +289,7 @@ def split(self, split_size: int, block_size: int):
            if overlap:
                block_end += 1

-            block_offsets = self.block_offsets[:, :block_end]
+            block_offsets = self.block_offsets


@RunningLeon Is this OK?

Right. Because here is a bug as mentioned in #1652 (comment)

lmdeploy/serve/utils.py

RunningLeon · 2024-06-24T11:29:59Z

docs/zh_cn/inference/pipeline.md

+logits = pipe.get_logits(input_ids)
+
+# ppl
+ppl = pipe.get_ppl(input_ids)


Just an interesting result: ppl is different (around 4%) between pytorch and turbomind for this example

Turbomind PyTorch

ppl 5.5916224 5.3524413

It's possible. They use different cuda kernels

import torch import fire def main(model_path, backend='turbomind'): from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) messages = [ {"role": "user", "content": "Hello, how are you?"}, ] inputs = tokenizer.apply_chat_template(messages, return_tensors='pt', return_dict=True) input_ids = inputs["input_ids"][0].tolist() if backend == 'turbomind': from lmdeploy import pipeline, TurbomindEngineConfig pipe = pipeline(model_path, backend_config=TurbomindEngineConfig(session_len=33000)) ppl = pipe.get_ppl(input_ids) print(ppl) elif backend == 'pytorch': from lmdeploy import pipeline, PytorchEngineConfig pipe = pipeline(model_path, backend_config=PytorchEngineConfig(session_len=33000)) ppl = pipe.get_ppl(input_ids) print(ppl) elif backend == 'transformers': # from transformers.models.llama import LlamaForCausalLM # model = LlamaForCausalLM.from_pretrained( from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( model_path, attn_implementation='flash_attention_2', torch_dtype=torch.float16, trust_remote_code=True) model.to("cuda") inputs.to(model.device) with torch.no_grad(): outputs = model( **inputs, use_cache=False, labels=inputs["input_ids"], ) logits = outputs.logits.squeeze(0) print(outputs.loss) if __name__ == "__main__": fire.Fire(main)

turbomind: 5.5916224
pytorch: 5.365595
transformers: 5.57444953918457

docs/en/inference/pipeline.md

docs/en/inference/vl_pipeline.md

pipeline ppl

89f4ecb

irexyc added the WIP label May 24, 2024

lvhan028 added the enhancement New feature or request label May 24, 2024

irexyc added 5 commits May 28, 2024 09:45

turbomind decode support embeddings input

b1707e2

pipeline get_logtis support embeddings input

bf1c8c1

add prepare_inputs

eb795a6

Merge remote-tracking branch 'origin/main' into pipeline-ppl

08e168e

update docs

6774bf7

lvhan028 removed the WIP label Jun 4, 2024

lvhan028 requested review from grimoire, lzhangzz and lvhan028 June 4, 2024 08:50

lvhan028 reviewed Jun 4, 2024

View reviewed changes

grimoire approved these changes Jun 5, 2024

View reviewed changes

irexyc added 3 commits June 5, 2024 11:07

fix long session ppl

1043d15

fix lint

2ab8e3d

fix unequal session_len of turbomind and pipeline

0b0508a

lvhan028 reviewed Jun 5, 2024

View reviewed changes

lmdeploy/serve/utils.py Outdated Show resolved Hide resolved

lvhan028 reviewed Jun 5, 2024

View reviewed changes

lmdeploy/serve/utils.py Outdated Show resolved Hide resolved

reduce memory

449977d

RunningLeon reviewed Jun 6, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine_instance.py Show resolved Hide resolved

irexyc added 2 commits June 7, 2024 02:56

Merge remote-tracking branch 'origin/main' into pipeline-ppl

baa8b20

fix pytorch engine crush

4682d71

lvhan028 reviewed Jun 7, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Outdated Show resolved Hide resolved

irexyc added 3 commits June 7, 2024 09:18

pytorch engine decode embeddings

761746e

remove do_preprocess

2e0dfa1

Revert "fix unequal session_len of turbomind and pipeline"

dc7643d

This reverts commit 0b0508a.

Merge remote-tracking branch 'origin/main' into pipeline-ppl

7344331

irexyc added 2 commits June 19, 2024 02:50

Merge remote-tracking branch 'origin/main' into pipeline-ppl

8ab573e

Conflicts: lmdeploy/pytorch/engine/engine_instance.py

fix template

5ee8c73

lzhangzz reviewed Jun 21, 2024

View reviewed changes

src/turbomind/models/llama/LlamaBatch.cc Outdated Show resolved Hide resolved

fix size

98f1cfe

lzhangzz reviewed Jun 21, 2024

View reviewed changes

fix

a851ce6

lvhan028 reviewed Jun 24, 2024

View reviewed changes

irexyc added 2 commits June 24, 2024 07:44

update docs

1fc53e4

Merge remote-tracking branch 'origin/main' into pipeline-ppl

cec3c46

lvhan028 reviewed Jun 24, 2024

View reviewed changes

irexyc mentioned this pull request Jun 24, 2024

[Feature] How to support bf16 when inferencing Internvl-chat #1839

Open

RunningLeon reviewed Jun 24, 2024

View reviewed changes

lmdeploy/serve/utils.py Outdated Show resolved Hide resolved

fix steps

f62b85e

RunningLeon reviewed Jun 24, 2024

View reviewed changes

lmdeploy/serve/utils.py Outdated Show resolved Hide resolved

RunningLeon reviewed Jun 24, 2024

View reviewed changes

remove convert to numpy

d1aa8be

lvhan028 reviewed Jun 24, 2024

View reviewed changes

docs/en/inference/pipeline.md Outdated Show resolved Hide resolved

lvhan028 reviewed Jun 24, 2024

View reviewed changes

docs/en/inference/pipeline.md Outdated Show resolved Hide resolved

RunningLeon reviewed Jun 25, 2024

View reviewed changes

docs/en/inference/vl_pipeline.md Outdated Show resolved Hide resolved

update docs

928f2d8

lvhan028 mentioned this pull request Jun 25, 2024

fix cogvlm vl template #1842

Closed

lvhan028 approved these changes Jun 25, 2024

View reviewed changes

lvhan028 merged commit c59a704 into InternLM:main Jun 25, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interfaces to the pipeline to obtain logits and ppl #1652

Add interfaces to the pipeline to obtain logits and ppl #1652

irexyc commented May 24, 2024 •

edited

Loading

lvhan028 commented May 24, 2024

lvhan028 Jun 4, 2024

irexyc Jun 5, 2024

lvhan028 Jun 5, 2024

lvhan028 Jun 5, 2024

lvhan028 Jun 24, 2024

grimoire left a comment

irexyc commented Jun 6, 2024

grimoire commented Jun 6, 2024

RunningLeon commented Jun 19, 2024

lzhangzz Jun 21, 2024

lvhan028 commented Jun 24, 2024

lvhan028 Jun 24, 2024 •

edited

Loading

lvhan028 Jun 24, 2024

lvhan028 Jun 24, 2024

RunningLeon Jun 24, 2024

RunningLeon Jun 24, 2024

lvhan028 Jun 24, 2024

lvhan028 Jun 24, 2024

Add interfaces to the pipeline to obtain logits and ppl #1652

Add interfaces to the pipeline to obtain logits and ppl #1652

Conversation

irexyc commented May 24, 2024 • edited Loading

Motivation

Use cases (Optional)

lvhan028 commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grimoire left a comment

Choose a reason for hiding this comment

irexyc commented Jun 6, 2024

grimoire commented Jun 6, 2024

RunningLeon commented Jun 19, 2024

Choose a reason for hiding this comment

lvhan028 commented Jun 24, 2024

lvhan028 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irexyc commented May 24, 2024 •

edited

Loading

lvhan028 Jun 24, 2024 •

edited

Loading