[Bug]: Connection reset by peer code 14. #2243

dasantosa · 2024-01-05T11:58:13Z

OpenVINO Version

2023.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

PyTorch

Model used

No response

Issue description

I'm usign GRPC to make requests to predict service. When I run it on a local machine I have no problems, but when I deploy it on AWS, I sometimes get "Connection Reset By Peer Error". It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel.
I am using a python API creating connection and using predict endpoint in this way:

self.channel = grpc.insecure_channel('{host}:{port}'.format(host=self.host, port=self.port),options=self.options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(self.channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = self.model_name
request.model_spec.signature_name = self.signature_name
result = stub.Predict(request, 30, wait_for_ready=True)
return result

I have the following configuration:

grpc.max_message_length = 100 * 1024 *1024
grpc.max_receive_message_length = 128 * 1024 *1024
grpc.enable_http_proxy = 0
grpc.keepalive_time_ms =  2147483647
grpc.max_connection_idle_ms' = 2147483647
grpc.max_connection_age_ms = 2147483647
grpc.max_connection_age_grace_ms = 2147483647
grpc.client_idle_timeout_ms=  2147483647

Step-by-step reproduction

No response

Relevant log output

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Connection reset by peer"
	debug_error_string = "{"created":"@1699622642.536271816","description":"Error received from peer ipv4:x.x.x.x:9000","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Connection reset by peer","grpc_status":14}"

Issue submission checklist

I'm reporting an issue. It's not a question.
I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
There is reproducer code and related data files such as images, videos, models, etc.

The text was updated successfully, but these errors were encountered:

mlukasze · 2024-01-05T12:45:53Z

isn't it a problem of OVMS?
@dtrawins fyi

atobiszei · 2024-01-17T09:00:20Z

@desantosa Could you share OVMS logs with log_level DEBUG? Did you try using OVMS client as an alternative?

dasantosa · 2024-01-19T12:15:06Z

@atobiszei thanks for your response! I tried to use the OVMS client and the error occurs the same way. That was the reason I implemented my own version, which is exactly the same as the ovms client but adding some additional grpc features. On the server side, with log_level DEBUG I didn't receive anything when the error occurred so I can't attach information about it...

However, I add some additional code. I try to check the channel status before sending the request, but it doesn't work as I expected:

state = self.channel._channel.check_connectivity_state(True)

 if state != 0 and state != 2:
      self.channel.close()
      self.reinitchannel_and_checkserverconectivity()

And the libraries that I use with their versions:

grpcio==1.59.3
grpclib==0.4.6
protobuf==3.19.0
requests~=2.31.0
numpy==1.19.5

mzegla · 2024-01-22T15:07:19Z

I don't think it's something to fix on the client side. My guess would be networking especially that you say it always works when you deploy locally and the issue is only on AWS.

It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel.

When you deploy to AWS is it always okay at the beginning - for the first few requests and the it stops working - or completely random?
When you encounter that error, do you do something on the deployment side (on AWS) or just reconnect the client?

dasantosa · 2024-01-22T15:42:27Z

Exactly, when I deploy it on AWS it works fine, I tested it by making a request with the same image for a few hours with a random delay between 1 and 5 minutes. Sometimes it fails and I have to handle the 503 exception. When the exception occurs, I just close the channel and reopen it and it works fine again till the next exception.

dasantosa added the bug Something isn't working label Jan 5, 2024

ilya-lavrenov transferred this issue from openvinotoolkit/openvino Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Connection reset by peer code 14. #2243

[Bug]: Connection reset by peer code 14. #2243

dasantosa commented Jan 5, 2024

mlukasze commented Jan 5, 2024

atobiszei commented Jan 17, 2024 •

edited

Loading

dasantosa commented Jan 19, 2024 •

edited

Loading

mzegla commented Jan 22, 2024

dasantosa commented Jan 22, 2024

[Bug]: Connection reset by peer code 14. #2243

[Bug]: Connection reset by peer code 14. #2243

Comments

dasantosa commented Jan 5, 2024

OpenVINO Version

Operating System

Device used for inference

Framework

Model used

Issue description

Step-by-step reproduction

Relevant log output

Issue submission checklist

mlukasze commented Jan 5, 2024

atobiszei commented Jan 17, 2024 • edited Loading

dasantosa commented Jan 19, 2024 • edited Loading

mzegla commented Jan 22, 2024

dasantosa commented Jan 22, 2024

atobiszei commented Jan 17, 2024 •

edited

Loading

dasantosa commented Jan 19, 2024 •

edited

Loading