-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Dynamic tpm quota (multiple projects) #4124
Comments
for example you'd have to define how long an app is considered "active" after their last call, so you can accurately compute a number of "simultaneous" apps |
can we do this similar to latency ttl? check the number of active apps -> divide the available tpm capacity by number of active apps -> return that as the available tpm for that call |
this probably wouldn't be in router.py as it doesn't really have anything to do with the load balancing between models |
would probably be a call hook on the proxy similar to parallel_request_limiter.py -
|
Basic test cases:
|
For v0 considering tracking 'active apps' by model with ttl = 60s |
E2E case - "I am more concerned about apps gobbling up all the quota" base case: advanced case: |
The Feature
Allow setting a
dynamic_quote: true
flag for a dynamic tpm budget dependent on how many simultaneous apps there areMotivation, pitch
prevent apps from exhausting tpm quota
Twitter / LinkedIn details
cc: @jeromeroussin
The text was updated successfully, but these errors were encountered: