Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Connection reset by peer code 14. #2243

Open
3 tasks done
dasantosa opened this issue Jan 5, 2024 · 5 comments
Open
3 tasks done

[Bug]: Connection reset by peer code 14. #2243

dasantosa opened this issue Jan 5, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@dasantosa
Copy link

OpenVINO Version

2023.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

PyTorch

Model used

No response

Issue description

I'm usign GRPC to make requests to predict service. When I run it on a local machine I have no problems, but when I deploy it on AWS, I sometimes get "Connection Reset By Peer Error". It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel.
I am using a python API creating connection and using predict endpoint in this way:

self.channel = grpc.insecure_channel('{host}:{port}'.format(host=self.host, port=self.port),options=self.options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(self.channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = self.model_name
request.model_spec.signature_name = self.signature_name
result = stub.Predict(request, 30, wait_for_ready=True)
return result

I have the following configuration:

grpc.max_message_length = 100 * 1024 *1024
grpc.max_receive_message_length = 128 * 1024 *1024
grpc.enable_http_proxy = 0
grpc.keepalive_time_ms =  2147483647
grpc.max_connection_idle_ms' = 2147483647
grpc.max_connection_age_ms = 2147483647
grpc.max_connection_age_grace_ms = 2147483647
grpc.client_idle_timeout_ms=  2147483647

Step-by-step reproduction

No response

Relevant log output

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Connection reset by peer"
	debug_error_string = "{"created":"@1699622642.536271816","description":"Error received from peer ipv4:x.x.x.x:9000","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Connection reset by peer","grpc_status":14}"

Issue submission checklist

  • I'm reporting an issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.
@dasantosa dasantosa added the bug Something isn't working label Jan 5, 2024
@mlukasze
Copy link

mlukasze commented Jan 5, 2024

isn't it a problem of OVMS?
@dtrawins fyi

@ilya-lavrenov ilya-lavrenov transferred this issue from openvinotoolkit/openvino Jan 8, 2024
@atobiszei
Copy link
Collaborator

atobiszei commented Jan 17, 2024

@desantosa Could you share OVMS logs with log_level DEBUG? Did you try using OVMS client as an alternative?

@dasantosa
Copy link
Author

dasantosa commented Jan 19, 2024

@atobiszei thanks for your response! I tried to use the OVMS client and the error occurs the same way. That was the reason I implemented my own version, which is exactly the same as the ovms client but adding some additional grpc features. On the server side, with log_level DEBUG I didn't receive anything when the error occurred so I can't attach information about it...

However, I add some additional code. I try to check the channel status before sending the request, but it doesn't work as I expected:

state = self.channel._channel.check_connectivity_state(True)

 if state != 0 and state != 2:
      self.channel.close()
      self.reinitchannel_and_checkserverconectivity()

And the libraries that I use with their versions:

grpcio==1.59.3
grpclib==0.4.6
protobuf==3.19.0
requests~=2.31.0
numpy==1.19.5

@mzegla
Copy link
Collaborator

mzegla commented Jan 22, 2024

I don't think it's something to fix on the client side. My guess would be networking especially that you say it always works when you deploy locally and the issue is only on AWS.

It doesn't follow a sequence, that is, It happens ramdomly and I need to reopen the channel.

When you deploy to AWS is it always okay at the beginning - for the first few requests and the it stops working - or completely random?
When you encounter that error, do you do something on the deployment side (on AWS) or just reconnect the client?

@dasantosa
Copy link
Author

Exactly, when I deploy it on AWS it works fine, I tested it by making a request with the same image for a few hours with a random delay between 1 and 5 minutes. Sometimes it fails and I have to handle the 503 exception. When the exception occurs, I just close the channel and reopen it and it works fine again till the next exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants