CSI volume bugs in Docker Swarm #47974

beornf · 2024-06-13T14:23:53Z

Description

I have found a few critical bugs while testing the CSI volumes feature. The first bug results in an infinite loop that causes dockerd logs to fill up the local disk:

Jun 13 02:15:26 dev dockerd[3050504]: time="2024-06-13T02:15:20.752699431-04:00" level=info msg="attempting to publish volume" attempt=13703446 module=node/agent/csi volume.id=u4cw74pjuiujgqm3ujj5ui7b6

This is due to the following code for calculating a timer in the volume queue, where the bit shifting calculation can overflow permalink:

func (timerSource) NewTimer(attempt uint) vqTimer {
	var waitFor time.Duration
	if attempt == 0 {
		waitFor = 0
	} else {
		// bit-shifting the base retry interval will raise it by 2 to the power
		// of attempt. this is an easy way to do an exponent solely with
		// integers
		waitFor = baseRetryInterval << attempt
		if waitFor > maxRetryInterval {
			waitFor = maxRetryInterval
		}
	}
	return timer{Timer: time.NewTimer(waitFor)}
}

For example, see tests of the following waitFor calculations with higher attempts:

gomacro> baseRetryInterval << 8
25.6s
gomacro> baseRetryInterval << 16
1h49m13.6s
gomacro> baseRetryInterval << 32
119304h38m49.6s
gomacro> baseRetryInterval << 64
0s

To fix this, we could check against a maximum number of attempt such as 16, 32, or even calculate the exact attempt that exceeds maxRetryInterval.

The second bug relates to early returns in the CSI node plugin code that causes subsequent logic to be skipped:

	// we must unpublish before we unstage. verify here that the volume is not
	// published.
	if v, ok := np.volumeMap[req.ID]; ok {
		if v.isPublished {
			return status.Errorf(codes.FailedPrecondition, "Volume %s is not unpublished", req.ID)
		}
		return nil
	}

In addition, this early return In the same file:

	if v, ok := np.volumeMap[req.ID]; ok {
		v.publishedPath = ""
		v.isPublished = false
		return nil
	}

I have already opened a PR for the second bug here moby/swarmkit#3178.

Reproduce

Create a CSI volume with the command:

docker volume create -d my-csi-plugin:latest --required-bytes 1G --scope single --sharing onewriter --type mount test

Create a service with the command:

docker service create --name csi-test --mount type=cluster,src=test,dst=/data alpine:latest sleep infinity

Remove the service:

docker service rm csi-test

Create a service with the same command:

docker service create --name csi-test --mount type=cluster,src=test,dst=/data alpine:latest sleep infinity

Expected behavior

Bugs for CSI volumes fixed, so swarm is able to handle CSI volume drivers with potential errors gracefully.

docker version

Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:08:17 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:08:17 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.25
  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 7
  Running: 6
  Paused: 0
  Stopped: 1
 Images: 8
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: error
  NodeID:
  Error: write /var/lib/docker/swarm/worker/tasks.db: no space left on device
  Is Manager: false
  Node Address: 1.2.3.4
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
 runc version: v1.1.10-0-g18a0cb0
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-29-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 958.6MiB
 Name: dev
 ID: f91bed3a-e670-47f2-a2da-529346c426d6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

The CSI volume plugin supports staging and the node agent is calling NodeUnpublishVolume and NodeUnstageVolume.

The text was updated successfully, but these errors were encountered:

thaJeztah · 2024-06-13T14:49:15Z

cc @dperny @corhere

beornf · 2024-06-14T13:50:29Z

A possible solution to the bit shifting bug above without repeated calculations or magic values is below:

		waitFor = baseRetryInterval << attempt
		if waitFor > maxRetryInterval || waitFor < baseRetryInterval {
			waitFor = maxRetryInterval
		}

In this solution waitFor is bounded in the interval baseRetryInterval <= waitFor <= maxRetryInterval. Overflow int64 values are all negative or zero.

dperny · 2024-06-17T16:43:40Z

Thank you for debugging this. This is a nightmare bug, I don't know that I would have ever caught it.

beornf · 2024-06-20T16:24:15Z

I have found a third bug that is due to missing internal state in volume publishing and unpublishing, when controller client is unset here permalink:

// PublishVolume calls ControllerPublishVolume to publish the given Volume to
// the Node with the given swarmkit ID. It returns a map, which is the
// PublishContext for this Volume on this Node.
func (p *plugin) PublishVolume(ctx context.Context, v *api.Volume, nodeID string) (map[string]string, error) {
	if !p.publisher {
		return nil, nil
	}
	csiNodeID := p.swarmToCSI[nodeID]
	if csiNodeID == "" {
		log.L.Errorf("CSI node ID not found for given Swarm node ID. Plugin: %s , Swarm node ID: %s", p.name, nodeID)
		return nil, status.Error(codes.FailedPrecondition, "CSI node ID not found for given Swarm node ID")
	}

	req := p.makeControllerPublishVolumeRequest(v, nodeID)
	c, err := p.Client(ctx)
	if err != nil {
		return nil, err
	}
	resp, err := c.ControllerPublishVolume(ctx, req)

	if err != nil {
		return nil, err
	}
	return resp.PublishContext, nil
}

// UnpublishVolume calls ControllerUnpublishVolume to unpublish the given
// Volume from the Node with the given swarmkit ID. It returns an error if the
// unpublish does not succeed
func (p *plugin) UnpublishVolume(ctx context.Context, v *api.Volume, nodeID string) error {
	if !p.publisher {
		return nil
	}

	req := p.makeControllerUnpublishVolumeRequest(v, nodeID)
	c, err := p.Client(ctx)
	if err != nil {
		return err
	}

	// response of the RPC intentionally left blank
	_, err = c.ControllerUnpublishVolume(ctx, req)
	return err
}

On the first call to both of these methods, the value of p.publisher will always be false. To fix this, it is necessary to move p.Client(ctx) to the start, which ensures the internal state is set correctly.

beornf · 2024-06-24T19:17:17Z

I have found a fourth bug that is due to a race condition and is a little harder to reason about. It happens when the csi-test service is running with the test volume published and the node is restarted by the docker daemon with systemctl restart docker. The volume will be unpublished by the swarm leader with the freeVolumes method. The volume is not republished and the service waits for the volume indefinitely.

I was able to determine how this happens by logging changes in internal state in the task scheduler and found the specific area of concern in the scheduler event loop permalink:

	schedule := func() {
		if len(s.pendingPreassignedTasks) > 0 {
			s.processPreassignedTasks(ctx)
		}
		if tickRequired {
			s.tick(ctx)
			tickRequired = false
		}
	}

I believe that tasks that are scheduled within s.tick should instead be run before s.processPreassignedTasks as the current order will have s.processPreassignedTasks unpublish the volume as s.tick has not run yet with the replacement task. By switching the order to below, I was able to fix this race condition every time:

	schedule := func() {
		if tickRequired {
			s.tick(ctx)
			tickRequired = false
		}
		if len(s.pendingPreassignedTasks) > 0 {
			s.processPreassignedTasks(ctx)
		}
	}

This makes intuitive sense that the current batch of tasks should be committed as soon as possible in the event loop.

beornf added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Jun 13, 2024

thaJeztah added area/swarm area/volumes version/24.0 labels Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI volume bugs in Docker Swarm #47974

CSI volume bugs in Docker Swarm #47974

beornf commented Jun 13, 2024

thaJeztah commented Jun 13, 2024

beornf commented Jun 14, 2024

dperny commented Jun 17, 2024

beornf commented Jun 20, 2024

beornf commented Jun 24, 2024

CSI volume bugs in Docker Swarm #47974

CSI volume bugs in Docker Swarm #47974

Comments

beornf commented Jun 13, 2024

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

thaJeztah commented Jun 13, 2024

beornf commented Jun 14, 2024

dperny commented Jun 17, 2024

beornf commented Jun 20, 2024

beornf commented Jun 24, 2024