Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: service stop gets called multiple times in error cases and results in timeouts #7732

Open
sipsma opened this issue Jun 24, 2024 · 1 comment
Assignees
Labels
area/engine About dagger core engine kind/bug Something isn't working

Comments

@sipsma
Copy link
Contributor

sipsma commented Jun 24, 2024

Context: I was working on the final PR for passing sockets as args and had a bug that caused the Tunnel in c2h.go to return an error. That was just a bug in the WIP code, but the failure mode I saw was very weird and repros on main: the session close timed out.

After questioning my fundamental sanity for a while, I finally realized that for some reason the service stop callback got called multiple times in this error case, which is bad because it does a select on an exited channel that will already have been read from on the second call:

This is definitely wrong. But this is a tangent on a tangent and not blocking my efforts atm so I'm just opening this issue so I don't forget to go back and look more and fix it. It's one of those things that might be simple (just save the exited error in a var rather than using a chan) or might be a deeper problem with more wide ranging effects (is it supposed to be okay to call stop multiple times in the first place? etc.). I'm just not sure until I look more.

cc @vito @jedevc, just an FYI in case this rings any bells


To easily repro on main you can apply this diff that simulates an error in the tunneling:

diff --git a/core/c2h.go b/core/c2h.go
index dd6284dbc..1ff842e4a 100644
--- a/core/c2h.go
+++ b/core/c2h.go
@@ -92,11 +92,15 @@ func (d *c2hTunnel) Tunnel(ctx context.Context) (rerr error) {
                                }
 
                                proxyConnPool.Go(func(ctx context.Context) error {
-                                       err := sshforward.Copy(ctx, downstreamConn, upstreamClient, upstreamClient.CloseSend)
-                                       if err != nil {
-                                               connSlog.Error("failed to copy data", "error", err)
-                                       }
-                                       return err
+                                       _ = upstreamClient
+                                       /*
+                                               err := sshforward.Copy(ctx, downstreamConn, upstreamClient, upstreamClient.CloseSend)
+                                               if err != nil {
+                                                       connSlog.Error("failed to copy data", "error", err)
+                                               }
+                                               return err
+                                       */
+                                       return fmt.Errorf("OH NO!")
                                })
                        }
                })
diff --git a/core/service.go b/core/service.go
index 69c8c753c..704519390 100644
--- a/core/service.go
+++ b/core/service.go
@@ -631,6 +631,8 @@ func (svc *Service) startReverseTunnel(ctx context.Context, id *call.ID) (runnin
                        return nil, fmt.Errorf("health check errored: %w", err)
                }
 
+               var stopCount int
+
                return &RunningService{
                        Service: svc,
                        Key: ServiceKey{
@@ -640,6 +642,9 @@ func (svc *Service) startReverseTunnel(ctx context.Context, id *call.ID) (runnin
                        Host:  fullHost,
                        Ports: checkPorts,
                        Stop: func(context.Context, bool) error {
+                               stopCount++
+                               fmt.Printf("STOP COUNT: %d\n", stopCount)
+
                                defer span.End()
                                stop()
                                waitCtx, waitCancel := context.WithTimeout(context.WithoutCancel(svcCtx), 10*time.Second)

You'll see in engine logs that the stop count goes to 2.

@sipsma sipsma added kind/bug Something isn't working area/engine About dagger core engine labels Jun 24, 2024
@sipsma sipsma self-assigned this Jun 24, 2024
@vito
Copy link
Contributor

vito commented Jun 24, 2024

It looks like it's at expected that Stop can be called multiple times:

dagger/core/services.go

Lines 362 to 365 in 36c3d8b

err := running.Stop(ctx2, false)
if context.Cause(ctx2) == cause {
// service didn't terminate within timeout, so force it to stop
err = running.Stop(ctx, true)

So I think we just need to tweak it to do the same var exitErr error + close(exited) pattern that regular container services do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine About dagger core engine kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants