Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Medusa weights conversion #1180

Closed
5 tasks
zhyncs opened this issue Feb 22, 2024 · 2 comments
Closed
5 tasks

[Feature] Medusa weights conversion #1180

zhyncs opened this issue Feb 22, 2024 · 2 comments

Comments

@zhyncs
Copy link
Contributor

zhyncs commented Feb 22, 2024

Motivation

In order to support FasterDecoding/Medusa for LMDeploy, we may need

  • 1. Medusa weights conversion

  • 2. Medusa weights loading

  • 3. Porting FasterDecoding/Medusa Heads code with LMDeploy components and utilities

  • 4. Porting generate_candidates and evaluate_posterior

  • 5. Integrating with LlamaBatch

Before the Chinese New Year, @lzhangzz and I briefly discussed the definitions of weights conversion, loading, head porting, and integration with LlamaBatch.

Using FasterDecoding/medusa-vicuna-13b-v1.3 as an example, here is the details about Medusa weights conversion:

The keys for the FasterDecoding/medusa-vicuna-13b-v1.3 weights are as follows:

['0.0.linear.weight', '0.0.linear.bias', '0.1.weight', '1.0.linear.weight', '1.0.linear.bias', '1.1.weight', '2.0.linear.weight', '2.0.linear.bias', '2.1.weight', '3.0.linear.weight', '3.0.linear.bias', '3.1.weight', '4.0.linear.weight', '4.0.linear.bias', '4.1.weight']

In brief

{medusa_head}.{medusa_layer}.linear.weight
{medusa_head}.{medusa_layer}.linear.bias
{medusa_head}.{medusa_num_layers}.weight

And in this example, medusa_num_heads is 5, medusa_num_layers is 1.
To distinguish from the weights of the base model and to support tensor parallelism, the naming convention will be modified when saving as follows:

medusa.{medusa_head}.{medusa_layer}.linear.{rank}.weight
medusa.{medusa_head}.{medusa_layer}.linear.{rank}.bias
medusa.{medusa_head}.{medusa_num_layers}.{rank}.weight

And they are also saved in workspace/triton_models/weights directory.

The overall code implementation is located at

lmdeploy/lmdeploy/turbomind/deploy/source_model/llama_medusa.py
lmdeploy/lmdeploy/turbomind/deploy/target_model/fp_medusa.py

In the current version, in order to complete the proof of concept (POC), we will initially implement fp16 on the LlamaForCausalLM. Subsequently, we will expand to other types such as fp32, bf16, int8, and so on.

@irexyc @grimoire @lzhangzz @lvhan028 Do you have any suggestions? Thanks.

In addition to weight conversion, we will separately raise issues to detail the subsequent steps. Stay tuned.

Related resources

No response

Additional context

No response

@zhyncs
Copy link
Contributor Author

zhyncs commented Feb 29, 2024

For more detailed specific progress, please refer to #1213.

@zhyncs
Copy link
Contributor Author

zhyncs commented Mar 1, 2024

refer to #1231 just close this

@zhyncs zhyncs closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant