You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3. Porting FasterDecoding/Medusa Heads code with LMDeploy components and utilities
4. Porting generate_candidates and evaluate_posterior
5. Integrating with LlamaBatch
Before the Chinese New Year, @lzhangzz and I briefly discussed the definitions of weights conversion, loading, head porting, and integration with LlamaBatch.
And in this example, medusa_num_heads is 5, medusa_num_layers is 1.
To distinguish from the weights of the base model and to support tensor parallelism, the naming convention will be modified when saving as follows:
In the current version, in order to complete the proof of concept (POC), we will initially implement fp16 on the LlamaForCausalLM. Subsequently, we will expand to other types such as fp32, bf16, int8, and so on.
Motivation
In order to support FasterDecoding/Medusa for LMDeploy, we may need
1. Medusa weights conversion
2. Medusa weights loading
3. Porting FasterDecoding/Medusa Heads code with LMDeploy components and utilities
4. Porting
generate_candidates
andevaluate_posterior
5. Integrating with LlamaBatch
Before the Chinese New Year, @lzhangzz and I briefly discussed the definitions of weights conversion, loading, head porting, and integration with LlamaBatch.
Using FasterDecoding/medusa-vicuna-13b-v1.3 as an example, here is the details about Medusa weights conversion:
The keys for the FasterDecoding/medusa-vicuna-13b-v1.3 weights are as follows:
In brief
And in this example,
medusa_num_heads
is 5,medusa_num_layers
is 1.To distinguish from the weights of the base model and to support tensor parallelism, the naming convention will be modified when saving as follows:
And they are also saved in
workspace/triton_models/weights
directory.The overall code implementation is located at
In the current version, in order to complete the proof of concept (POC), we will initially implement fp16 on the LlamaForCausalLM. Subsequently, we will expand to other types such as fp32, bf16, int8, and so on.
@irexyc @grimoire @lzhangzz @lvhan028 Do you have any suggestions? Thanks.
In addition to weight conversion, we will separately raise issues to detail the subsequent steps. Stay tuned.
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: