-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More accurate time logging for ImageEncoder and fix concurrent image processing corruption #1765
Conversation
@@ -104,23 +98,22 @@ def forward(self, inputs: List[Image]): | |||
"""Model forward.""" | |||
time_start = time.perf_counter() | |||
outputs = self.model.forward(inputs) | |||
if isinstance(outputs[0], torch.Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to convert to cpu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modification here is mainly to solve the #1759
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it slow down the inference performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with llava-v1.5-7b and profile_restful_api.py (PR 1662)
Move tensor to cpu doesn't affect the performance. But using asyncio.Task may affects the performance.
# with thread
number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 825.260 token/s
token throughput (prompt + completion token): 1677.128 token/s
RPS (request per second): 3.430 req/s
RPM (request per minute): 205.816 req/min
# with asyncio.Task
number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 808.719 token/s
token throughput (prompt + completion token): 1643.513 token/s
RPS (request per second): 3.362 req/s
RPM (request per minute): 201.691 req/min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gpu utility is high, you can add
await asyncio.get_event_loop().run_in_executor(None,
self.stream.synchronize)
after forward so other coroutine can use cpu during forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my test, wrapping forward is better to synchronize the stream.
outputs = await asyncio.get_event_loop().run_in_executor(
None, self.forward, inputs)
# token throughput (completion token): 826.370 token/s
outputs = self.model.forward(inputs)
if isinstance(outputs[0], torch.Tensor):
stream = torch.cuda.current_stream(outputs[0].device)
await asyncio.get_event_loop().run_in_executor(
None, stream.synchronize)
# token throughput (completion token): 815.130 token/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
#1759