Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Redis hangs during upgrade from v7.0.10 to v7.2.4 #13306

Open
zygisa opened this issue May 30, 2024 · 0 comments
Open

[BUG] Redis hangs during upgrade from v7.0.10 to v7.2.4 #13306

zygisa opened this issue May 30, 2024 · 0 comments

Comments

@zygisa
Copy link

zygisa commented May 30, 2024

Describe the bug

Redis instance randomly hangs and becomes unresponsive after the upgrade from v7.0.10 to v7.2.4.

To reproduce

The following upgrade process was used when we noticed the issue:

  1. Setup Redis cluster running v7.0.10 - in our case, we've provisioned a cluster with 9 masters, 2 replicas per master
  2. Install Redis v7.2.4 RPM package
  3. Restart all the replicas in a rolling fashion. All replicas are now running v7.2.4, all the masters are still on v7.0.10
  4. Failover masters and restart instances in a rolling fashion. At this point masters are transitioning to the instances running v7.2.4. Some of these new masters become unresponsive.

This isn't happening all the time and not all Redis processes hang after the failover. We haven't established any pattern here. When the process hangs it does not respond to any redis-cli commands, for example, PING, INFO, etc. You can't set or get any keys. Running strace on this Redis instance produces no output. It seems like the process is completely stuck. You need to SIGKILL the process to stop it.
The logs of the Redis process that hangs:

--> Failover happens
2258807:S 30 May 2024 11:00:15.191 * Manual failover user request accepted.
2258807:S 30 May 2024 11:00:15.192 * Received replication offset for paused master manual failover: 9037
2258807:S 30 May 2024 11:00:15.192 * All master replication stream processed, manual failover can start.
2258807:S 30 May 2024 11:00:15.192 * Start of election delayed for 0 milliseconds (rank #0, offset 9037).
2258807:S 30 May 2024 11:00:15.192 * Starting a failover election for epoch 12.
2258807:S 30 May 2024 11:00:15.193 * Failover election won: I'm the new master.
2258807:S 30 May 2024 11:00:15.193 * configEpoch set to 12 after successful failover
2258807:M 30 May 2024 11:00:15.193 * Connection with master lost.
2258807:M 30 May 2024 11:00:15.193 * Caching the disconnected master state.
2258807:M 30 May 2024 11:00:15.193 * Discarding previously cached master state.
2258807:M 30 May 2024 11:00:15.193 * Setting secondary replication ID to c35c915f9851c84d47057806cfba70f81ee97138, valid up to offset: 9038. New replication ID is b7c7225ef698b5f8da7d8960fde1025435dc3d7c
2258807:M 30 May 2024 11:00:15.196 * Replica 10.37.28.117:6552 asks for synchronization
2258807:M 30 May 2024 11:00:15.196 * Partial resynchronization request from 10.37.28.117:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:15.199 * Replica 10.37.19.113:6552 asks for synchronization
2258807:M 30 May 2024 11:00:15.199 * Partial resynchronization request from 10.37.19.113:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:15.844 * Connection with replica 10.37.28.117:6552 lost.
2258807:M 30 May 2024 11:00:16.447 * Replica 10.37.28.117:6552 asks for synchronization
2258807:M 30 May 2024 11:00:16.447 * Partial resynchronization request from 10.37.28.117:6552 accepted. Sending 0 bytes of backlog starting from offset 9038.
2258807:M 30 May 2024 11:00:22.548 * Failover auth granted to bd552e6f185d8040f02e739286b83d9f39cb795b () for epoch 13
2258807:M 30 May 2024 11:00:23.736 * Manual failover requested by replica c49eed33ca2420fa1634f701df494872d9d278fe ().
2258807:M 30 May 2024 11:00:23.737 * Failover auth granted to c49eed33ca2420fa1634f701df494872d9d278fe () for epoch 14
2258807:M 30 May 2024 11:00:23.738 * Connection with replica 10.37.28.117:6552 lost.
2258807:M 30 May 2024 11:00:23.739 * Configuration change detected. Reconfiguring myself as a replica of c49eed33ca2420fa1634f701df494872d9d278fe ()
2258807:S 30 May 2024 11:00:23.739 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
2258807:S 30 May 2024 11:00:23.739 * Connecting to MASTER 10.37.28.117:6552
2258807:S 30 May 2024 11:00:23.739 * MASTER <-> REPLICA sync started
2258807:S 30 May 2024 11:00:23.741 * Connection with replica 10.37.19.113:6552 lost.
2258807:S 30 May 2024 11:00:23.741 * Non blocking connect for SYNC fired the event.
2258807:S 30 May 2024 11:00:23.741 * Master replied to PING, replication can continue...
2258807:S 30 May 2024 11:00:23.741 * Trying a partial resynchronization (request b7c7225ef698b5f8da7d8960fde1025435dc3d7c:10221).
2258807:S 30 May 2024 11:00:23.741 * Successful partial resynchronization with master.
2258807:S 30 May 2024 11:00:23.741 * Master replication ID changed to 4f857c5e8a419baa6d70bfcd5c571c3c29113857
2258807:S 30 May 2024 11:00:23.741 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
--> Redis process hangs

Expected behavior

Upgrade succeeds without any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant