Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: reduce "connection refused" period during upgrades (storcon deployments cause cplane operation failures (connection refused\nrequest must not be retried)) #8034

Open
problame opened this issue Jun 12, 2024 · 3 comments
Labels
c/storage/controller Component: Storage Controller c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@problame
Copy link
Contributor

problame commented Jun 12, 2024

Context: https://neondb.slack.com/archives/C06K38EB05D/p1718209960490099?thread_ts=1718184799.253779&cid=C06K38EB05D

Problem

In prodlike cloudbench, we have observed that a storcon deployment can, 44s (!) after the storcon logs that it's up again, cause cplane to get connection refused errors when it tries to talk to storcon.

Analysis

@ololobus :

Networking in k8s may take some time to rollout and storcon has only one pod
Same for LB / ingress to discover targets

Impact

When a Cplane client does a POST request, it doesn't retry them when it gets connection refused because it doesn't assume idempotency.

Example cplane log message

{"level":"ERR","ts":"2024-06-11T21:06:56.135Z","logger":"publicapiv2","message":"incoming request finished with internal error","http_meth":"POST","http_path":"/api/v2/projects/broad-boat-65064583/branches","route":"CreateProjectBranch","request_id":"70326701-9c98-4a9a-8fcd-aea2770ec8ed","trace_id":"T9WXJzpEJmibet3VyxrRjh","project_id":"broad-boat-65064583","account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","ingress_duration_ms":7277,"status":500,"account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","error":"could not create project-branch: Get \"http://neon-storage-controller.neon-storage-controller.svc.cluster.local:50051/v1/tenant/50f4379c3e4849ff7025fa4c14dced53/timeline/b7fd1f7c95d710c166c93d1ed0871324\": dial tcp 172.20.8.98:50051: connect: connection refused\nrequest must not be retried"}

Related

  • neondatabase/cloud#13222
    • cplane is tweaking retries there. possibly this specific branch creation API is idempotent and hence safe to retry
    • but @ololobus argument:
      • but retrying for >=1m in the http request handler doesn't look like a feasible approach. So having a zero-downtime rollout for storcon is likely a must-have for GA

@problame problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver c/storage/controller Component: Storage Controller labels Jun 12, 2024
@jcsp
Copy link
Contributor

jcsp commented Jun 13, 2024

Is this distinct from #7797 ?

@problame
Copy link
Contributor Author

#7797 mentions 503, so, storcon was running.

What we observed here was connection refused, i.e., not even able to establish TCP connection.

A (very) narrow-minded solution to #7797 may not address connection refused issue.

But yeah, in spirit this is a dupe of #7797

@jcsp
Copy link
Contributor

jcsp commented Jun 13, 2024

POST is idempotent as long as it includes a timeline ID -- @Bodobolero, until we make the controller more seamlessly available during restarts (in Q3), can you make your client retry past this class of error?

@jcsp jcsp added the triaged bugs that were already triaged label Jun 13, 2024
@jcsp jcsp changed the title storcon: storcon deployments cause cplane operation failures (connection refused\nrequest must not be retried) storcon: reduce "connection refused" period during upgrades (storcon deployments cause cplane operation failures (connection refused\nrequest must not be retried)) Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

2 participants