🔥 News :
- upgrade python version to 3.12
- add HUGGING_FACE_HUB_TOKEN variable for private model on HuggingFace
- new architecture for more flexiblity !
vLLM is one of the state of the art libraries for deploying a Large Language Model (LLM) and its API with better generation performance. However, vLLM does not currently support all embeddings models for endpoint /v1/embeddings
, although it can be used to deploy an API according to OpenAI conventions (see this discussion).
This repository makes it easy to add the /v1/embeddings
endpoint by deploying an embedding model with HuggingFace Text Embeddings Inference (TEI) and serves it all on a single port. The aim of this repository is to have a complete API that's very light, easy to use and maintain !
API offer the following OpenAI endpoints:
/health
/v1/models
/v1/chat/completions
/v1/completions
/v1/embeddings
You can access the other vLLM API endpoints:
/vllm/health
/vllm/version
And the HuggingFace Text Embeddings Inference API endpoints :
/tei/decode
/tei/embed
/tei/health
/tei/embed_app
/tei/embed_sparse
/tei/embeddings
/tei/info
/tei/metrics
/tei/predict
/tei/rerank
/tei/tokenize
/tei/vertex
Currently, this architecture support almost all LLM and embeddings models. The return of the /v1/models
endpoint adds a new "type" key which takes the value "text-generation" or "text-embeddings-inference" depending on the nature of the model (language or embeddings). These values correspond to the label given to models on Huggingface. Example :
{
"object": "list",
"data": [
{
"model": < language model >,
"type": "text-generation",
...
},
{
"model": < embeddings model >,
"type": "text-embeddings-inference",
...
}
]
}
-
First, configure a .env file or modify the .env.example file in this repository. For more informations about the configuration, please refer to the configuration section.
-
Then, run the containers with Docker compose :
docker compose --env-file env.example up --detach
variable | values |
---|---|
EMBEDDINGS_HF_REPO_ID | HuggingFace repository ID of the embeddings model. Please refer to HuggingFace Text Embeddings Inference documentation to find supported models. |
LLM_HF_REPO_ID | HuggingFace repository ID of the LLM model. Please refer to vLLM documentation to find supported models. |
TEI_ARGS | Arguments for Text Embeddings Inference (format: --arg1 --arg2 ). Please refer to HuggingFace Text Embeddings Inference documentation for more information. |
VLLM_ARGS | Arguments for vLLM (format: --arg1 --arg2 ). Please refer to vLLM documentation for more information. |
HUGGING_FACE_HUB_TOKEN | HuggingFace API token for private model on HuggingFace Hub. |
API_KEY | API key for protect your model. |
You can use the deployed API with Langchain to create embedding vectors for your vector store. For example:
from langchain_community.embeddings import HuggingFaceHubEmbeddings
embeddings = HuggingFaceHubEmbeddings(model=f"http://localhost:8080/tei/")
python tests.py --llm-hf-repo-id TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ --embeddings-hf-repo-id intfloat/e5-small --debug