Skip to content

Releases: runpod-workers/worker-vllm

1.0.1

13 Jun 17:52
Compare
Choose a tag to compare

Hotfix adding backwards compatibility for deprecated max_context_len_to_capture engine argument

1.0.0

12 Jun 19:29
Compare
Choose a tag to compare

Worker vLLM 1.0.0 - What's Changed

  • vLLM version 0.3.3 -> 0.4.2
  • Various improvements and bug fixes

0.3.2

12 Mar 23:11
cee4e48
Compare
Choose a tag to compare

Worker vLLM 0.3.2 - What's Changed

  • vLLM version 0.3.2 -> 0.3.3
    • StarCoder2 support
    • Performance optimization for Gemma
    • 2/3/8-bit GPTQ support
    • Integrate Marlin Kernels for Int4 GPTQ inference
    • Performance optimization for MoE kernel
  • Updated and refactored base image, sampling parameters, etc.
  • Various bug fixes

0.3.1

29 Feb 08:00
Compare
Choose a tag to compare

Bug Fixes

  • Loading downloaded model while HuggingFace is down
  • Building image without GPU
  • Model and Tokenizer revision name

0.3.0

24 Feb 03:37
Compare
Choose a tag to compare

Worker vLLM 0.3.0: What's New since 0.2.3:

  • 🚀 Full OpenAI Compatibility 🚀

    You may now use your deployment with any OpenAI Codebase by changing only 3 lines in total. The supported routes are Chat Completions, Completions, and Models - with both streaming and non-streaming.

  • Dynamic Batch Size - time-to-first token as fast no batching, while maintaining the performance of batched token streaming throughout the request.

  • vLLM 0.2.7 -> 0.3.2

    • Gemma, DeepSeek MoE and OLMo support.
    • FP8 KV Cache support
    • New supported parameters
    • We're working on adding support for Multi-LoRA ⚙️
  • Support for a wide range of new settings for your endpoint.
  • Fixed Tensor Parallelism, baking model into images, and more bugs.
  • Refactors and general improvements.

Full Changelog: 0.2.3...0.3.0

0.2.3

10 Feb 04:14
2941db0
Compare
Choose a tag to compare

Worker vLLM 0.2.3 - What's Changed

Various bug fixes

New Contributors

0.2.2

31 Jan 05:35
Compare
Choose a tag to compare

Worker vLLM 0.2.2 - What's New

  • Custom Chat Templates: you may now specify a Jinja chat template with an environment variable.
  • Custom Tokenizer

Fixes:

  • Tensor Parallel/Multi-GPU Deployment
  • Baking Model into the image. Previously, the worker would download the model every time, ignoring the baked in model.
  • Crashes due to MAX_PARALLEL_LOADING_WORKERS

0.2.1

26 Jan 04:35
Compare
Choose a tag to compare

Worker vLLM 0.2.1 - What's New

  • Added OpenAI Chat Completions formatted output for non-streaming use. (previously only supported for streaming)

0.2.0

26 Jan 04:26
Compare
Choose a tag to compare

Worker vLLM 0.2.0 - What's New

  • You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
  • Over 3x lighter Docker image size.
  • OpenAI Chat Completion output format (optional to use).
  • Fast image build time.
  • Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
  • Support for n and best_of sampling parameters, which allow you to generate multiple responses from a single prompt.
  • New environment variables for various configuration.
  • vLLM Version: 0.2.7

0.1.0

17 Jan 00:51
ed48093
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/runpod-workers/worker-vllm/commits/0.1.0