Skip to content

Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.

License

Notifications You must be signed in to change notification settings

leoguillaume/vLLMembeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM + embeddings

Deploy a full OpenAI API with vLLM that supports all embedding models

🔥 News :

  • upgrade python version to 3.12
  • add HUGGING_FACE_HUB_TOKEN variable for private model on HuggingFace
  • new architecture for more flexiblity !

vLLM is one of the state of the art libraries for deploying a Large Language Model (LLM) and its API with better generation performance. However, vLLM does not currently support all embeddings models for endpoint /v1/embeddings, although it can be used to deploy an API according to OpenAI conventions (see this discussion).

This repository makes it easy to add the /v1/embeddings endpoint by deploying an embedding model with HuggingFace Text Embeddings Inference (TEI) and serves it all on a single port. The aim of this repository is to have a complete API that's very light, easy to use and maintain !

How it works ?

API offer the following OpenAI endpoints:

  • /health
  • /v1/models
  • /v1/chat/completions
  • /v1/completions
  • /v1/embeddings

You can access the other vLLM API endpoints:

  • /vllm/health
  • /vllm/version

And the HuggingFace Text Embeddings Inference API endpoints :

  • /tei/decode
  • /tei/embed
  • /tei/health
  • /tei/embed_app
  • /tei/embed_sparse
  • /tei/embeddings
  • /tei/info
  • /tei/metrics
  • /tei/predict
  • /tei/rerank
  • /tei/tokenize
  • /tei/vertex

⚠️ You can't access of the vLLM and TEI API swaggers. The swagger UI of TEI is available at official documentation.

Models

Currently, this architecture support almost all LLM and embeddings models. The return of the /v1/models endpoint adds a new "type" key which takes the value "text-generation" or "text-embeddings-inference" depending on the nature of the model (language or embeddings). These values correspond to the label given to models on Huggingface. Example :

{
    "object": "list", 
    "data": [
        {
            "model": < language model >,
            "type": "text-generation",
            ...
        },
        {
            "model": < embeddings model >,
            "type": "text-embeddings-inference",
            ...
        }
    ]
}

Installation

  • First, configure a .env file or modify the .env.example file in this repository. For more informations about the configuration, please refer to the configuration section.

  • Then, run the containers with Docker compose :

    docker compose --env-file env.example up --detach

Configuration

variable values
EMBEDDINGS_HF_REPO_ID HuggingFace repository ID of the embeddings model. Please refer to HuggingFace Text Embeddings Inference documentation to find supported models.
LLM_HF_REPO_ID HuggingFace repository ID of the LLM model. Please refer to vLLM documentation to find supported models.
TEI_ARGS Arguments for Text Embeddings Inference (format: --arg1 --arg2 ). Please refer to HuggingFace Text Embeddings Inference documentation for more information.
VLLM_ARGS Arguments for vLLM (format: --arg1 --arg2 ). Please refer to vLLM documentation for more information.
HUGGING_FACE_HUB_TOKEN  HuggingFace API token for private model on HuggingFace Hub.
API_KEY API key for protect your model.

🦜 Lanchain integration

You can use the deployed API with Langchain to create embedding vectors for your vector store. For example:

from langchain_community.embeddings import HuggingFaceHubEmbeddings

embeddings = HuggingFaceHubEmbeddings(model=f"http://localhost:8080/tei/")

🔦 Tests

python tests.py --llm-hf-repo-id TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ --embeddings-hf-repo-id intfloat/e5-small --debug

About

Deployment a light and full OpenAI API for production with vLLM to support /v1/embeddings with all embeddings models.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published