Releases · runpod-workers/worker-vllm

vLLM version 0.3.2 -> 0.3.3
- StarCoder2 support
- Performance optimization for Gemma
- 2/3/8-bit GPTQ support
- Integrate Marlin Kernels for Int4 GPTQ inference
- Performance optimization for MoE kernel
Updated and refactored base image, sampling parameters, etc.
Various bug fixes

Assets 2

29 Feb 08:00

alpayariyak

0.3.1

d91ccb8

0.3.1

Bug Fixes

Loading downloaded model while HuggingFace is down
Building image without GPU
Model and Tokenizer revision name

Assets 2

24 Feb 03:37

alpayariyak

0.3.0

91167b8

0.3.0

Worker vLLM 0.3.0: What's New since 0.2.3:

🚀 Full OpenAI Compatibility 🚀

You may now use your deployment with any OpenAI Codebase by changing only 3 lines in total. The supported routes are Chat Completions, Completions, and Models - with both streaming and non-streaming.
Dynamic Batch Size - time-to-first token as fast no batching, while maintaining the performance of batched token streaming throughout the request.
vLLM 0.2.7 -> 0.3.2
- Gemma, DeepSeek MoE and OLMo support.
- FP8 KV Cache support
- New supported parameters
- We're working on adding support for Multi-LoRA ⚙️

Support for a wide range of new settings for your endpoint.
Fixed Tensor Parallelism, baking model into images, and more bugs.
Refactors and general improvements.

Full Changelog: 0.2.3...0.3.0

Assets 2

10 Feb 04:14

alpayariyak

0.2.3

2941db0

0.2.3

Worker vLLM 0.2.3 - What's Changed

Various bug fixes

New Contributors

@casper-hansen made their first contribution in #39
@willsamu made their first contribution in #45

Contributors

willsamu and casper-hansen

Assets 2

31 Jan 05:35

alpayariyak

0.2.2

3e2cd08

0.2.2

Worker vLLM 0.2.2 - What's New

Custom Chat Templates: you may now specify a Jinja chat template with an environment variable.
Custom Tokenizer

Fixes:

Tensor Parallel/Multi-GPU Deployment
Baking Model into the image. Previously, the worker would download the model every time, ignoring the baked in model.
Crashes due to MAX_PARALLEL_LOADING_WORKERS

Assets 2

26 Jan 04:35

alpayariyak

0.2.1

9fc8e1e

0.2.1

Worker vLLM 0.2.1 - What's New

Added OpenAI Chat Completions formatted output for non-streaming use. (previously only supported for streaming)

Assets 2

26 Jan 04:26

alpayariyak

0.2.0

4cebe66

0.2.0

Worker vLLM 0.2.0 - What's New

You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
Over 3x lighter Docker image size.
OpenAI Chat Completion output format (optional to use).
Fast image build time.
Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
Support for n and best_of sampling parameters, which allow you to generate multiple responses from a single prompt.
New environment variables for various configuration.
vLLM Version: 0.2.7

Assets 2

17 Jan 00:51

justinmerrell

0.1.0

ed48093

0.1.0

What's Changed

Fixed STREAMING environment variable not being interpreted as boolean. by @vladmihaisima in #4
10x Faster New Worker by @alpayariyak in #18
Update runpod package version by @github-actions in #19
fix: update badge by @justinmerrell in #20
Chat Template Feature, Message List, Small Refactor by @alpayariyak in #27

New Contributors

@vladmihaisima made their first contribution in #4
@alpayariyak made their first contribution in #18
@github-actions made their first contribution in #19
@justinmerrell made their first contribution in #20

Full Changelog: https://github.com/runpod-workers/worker-vllm/commits/0.1.0

Contributors

vladmihaisima, justinmerrell, and alpayariyak

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker vLLM 1.0.0 - What's Changed