Skip to content

Ask Poddy: Run Open Source LLMs and Embeddings as OpenAI-Compatible Serverless Endpoints (Tutorial)

License

Notifications You must be signed in to change notification settings

blib-la/ask-poddy

Repository files navigation

Ask Poddy

A screenshot of the Ask Poddy web app showing a chat between the user and the AI

Ask Poddy (named after "Poddy", the RunPod bot on Discord) is a user-friendly RAG (Retrieval-Augmented Generation) web application designed to showcase the ease of setting up OpenAI-compatible APIs using open-source models running serverless on RunPod. Built with Next.js, React, Tailwind, Vercel AI SDK, and LangChain, it uses Meta-Llama-3-8B-Instruct as LLM and multilingual-e5-large-instruct for text embeddings.

This tutorial will guide you through deploying Ask Poddy in your environment, enabling it to answer questions related to RunPod effectively, by leveraging the open-source workers worker-vllm and worker-infinity-embedding.





Concept

Ask Poddy is designed to demonstrate the integration of serverless OpenAI-compatible APIs with open-source models. The application runs locally (but it could also be deployed into the cloud), while the computational heavy lifting is handled by serverless endpoints on RunPod. This architecture allows seamless use of existing OpenAI-compatible tools and frameworks without needing to develop custom APIs.

Here's how RAG works in Ask Poddy:

Diagram showing how the RAG process works

  1. User: Asks a question.
  2. Vector Store: The question is sent to LangChain, which uses the worker-infinity-embedding endpoint to convert the question into an embedding using the multilingual-e5-large-instruct model.
  3. Vector Store: Performs a similarity search to find relevant documents based on the question.
  4. AI SDK: The retrieved documents and the user's question are sent to the worker-vllm endpoint.
  5. worker-vllm: Generates an answer using the Meta-Llama-3-8B-Instruct model.
  6. User: Receives the answer.

Tip

You can choose any of the supported models that come with vLLM.



Tutorial: Setting Up "Ask Poddy" in Your Environment

Prerequisites

1. Clone the Repository

  1. Clone the Ask Poddy repository and go into the cloned directory:
git clone https://github.com/blib-la/ask-poddy.git
cd ask-poddy
  1. Clone the RunPod docs repository into ask-poddy/data/runpod-docs.
git clone https://github.com/runpod/docs.git ./data/runpod-docs

Note

The RunPod docs repository contains the RunPod documentation that Ask Poddy will use to answer questions.

  1. Copy the img folder from ./data/runpod-docs/static/img to ./public

Note

This makes it possible for Ask Poddy to include images from the RunPod documentation.


2. Install Dependencies

Navigate to the ask-poddy directory and install the dependencies:

npm install

3. Set Up RunPod Serverless Endpoints

3.1 Network Volumes

  1. Create two network volumes with 15GB storage each in the same data center as the serverless endpoints.
    • Volume for embeddings: infinity_embeddings
    • Volume for LLM: vllm_llama3

Note

Using network volumes ensures that the models and embeddings are stored persistently, allowing for faster subsequent requests as the data does not need to be downloaded or recreated each time.

3.2 Worker-vLLM Endpoint

  1. Follow the guide for setting up the vLLM endpoint, but make sure to use the meta-llama/Meta-Llama-3-8B-Instruct model instead of the one mentioned in the guide. And also make sure to select the network volume vllm_llama3 when creating the endpoint.

Tip

The worker is using worker-vllm.

3.3 Worker-Infinity-Embedding Endpoint

  1. Create a new template
  2. Use the Docker image runpod/worker-infinity-embedding:stable-cuda12.1.0 from worker-infinity-embedding and set the environment variable MODEL_NAMES to intfloat/multilingual-e5-large-instruct.
  3. Create a serverless endpoint and make sure to select the network volume infinity_embeddings.

4. Configure Environment Variables

  1. Generate your RunPod API key
  2. Find the endpoint IDs underneath the deployed serverless endpoints.

Screenshot showing the ID of the worker underneath the title

  1. Create your .env.local based on .env.local.example or by creating a file with the following variables:
RUNPOD_API_KEY=your_runpod_api_key
RUNPOD_ENDPOINT_ID_VLLM=your_vllm_endpoint_id
RUNPOD_ENDPOINT_ID_EMBEDDING=your_embedding_endpoint_id

5. Populate the Vector Store

To populate the vector store, run the following command:

npm run populate

Note

The first run will take some time as the worker downloads the embeddings model (multilingual-e5-large-instruct). Subsequent requests will use the downloaded model stored in the network volume.

This command reads all markdown documents from the ask-poddy/data/runpod-docs/ folder, creates embeddings using the embedding endpoint running on RunPod, and stores these embeddings in the local vector store:

Diagram showing how the vector store gets populated with documents

  1. Documents: The markdown documents from the ask-poddy/data/runpod-docs/ folder are read by LangChain.
  2. Chunks: LangChain converts the documents into smaller chunks, which are then sent to the worker-infinity-embedding endpoint.
  3. worker-infinity-embedding: Receives chunks, generates embeddings using the multilingual-e5-large-instruct model, and sends them back.
  4. Vector Store: LangChain saves these embeddings in the local vector store (HNSWlib).

Tip

A vector store is a database that stores embeddings (vector representations of text) to enable efficient similarity search. This is crucial for the RAG process as it allows the system to quickly retrieve relevant documents based on the user's question.


6. Start the Local Web Server

  1. Start the local web server:
npm run dev
  1. Open http://localhost:3000 to access the UI.

7. Ask Poddy

Now that everything is running, you can ask your RunPod-related question, like:

  • What is RunPod?
  • How do I create a serverless endpoint?
  • What are the benefits of using a network volume?
  • How can I become a host for the community cloud?
  • Can RunPod help my startup to get going?

Note

The first run will take some time as the worker downloads the LLM (Meta-Llama-3-8B-Instruct). Subsequent requests will use the downloaded model stored in the network volume.

About

Ask Poddy: Run Open Source LLMs and Embeddings as OpenAI-Compatible Serverless Endpoints (Tutorial)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages