-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in replicationLagModule
of go/vt/throttle
#16078
base: main
Are you sure you want to change the base?
Fix race in replicationLagModule
of go/vt/throttle
#16078
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
e0ced54
to
1455402
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16078 +/- ##
==========================================
+ Coverage 68.22% 68.24% +0.02%
==========================================
Files 1543 1543
Lines 197586 197611 +25
==========================================
+ Hits 134805 134864 +59
+ Misses 62781 62747 -34 ☔ View full report in Codecov by Sentry. |
@@ -76,9 +82,35 @@ func (c *replicationLagCache) add(r replicationLagRecord) { | |||
entry.add(r) | |||
} | |||
|
|||
// maxLag returns the maximum replication lag for the entries in cache. | |||
func (c *replicationLagCache) maxLag() (maxLag uint32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this mostly-unchanged from throttler.go
(.MaxLag(...)
) instead of holding the mu sync.Mutex
lock from a different struct
.MaxLag()
now calls this func
} | ||
|
||
return maxLag | ||
return cache.maxLag() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into replicationLagCache
I think I figured out why we only see this on v15 + v16-v19 This |
Signed-off-by: Tim Vaillancourt <[email protected]>
d3db9fd
to
f53e089
Compare
Squashed commits to make this cherry-pick-able |
FWIW, we do squash and merge by default so you can always cherry-pick the squashed commit that is made to the upstream branch. |
Right 👍. Here my goal is to cherry-pick this to |
@@ -30,6 +31,8 @@ type replicationLagCache struct { | |||
// The map key is replicationLagRecord.LegacyTabletStats.Key. | |||
entries map[string]*replicationLagHistory | |||
|
|||
mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how important it is that these two stay in perfect sync?
slowReplicas map[string]bool
ignoredSlowReplicasInARow map[string]bool
I'm wondering if we couldn't make these atomics instead:
slowReplicas atomic.Pointer[map[string]bool]
ignoredSlowReplicasInARow atomic.Pointer[map[string]bool]
Do we have any idea about perf impact with the mutex? I would think that could potentially become pretty hot, but maybe not. If all we care about is eliminating the race conditions, however, then using atomics instead is an option too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for the iteration, we don't need to iterate over the map. We can instead iterate over a slice of the keys. That slice would be our own (a slice is really metadata) with a backing array that is using the same immutable strings that are used for the map's keys:
// MaxLag returns the max of all the last replication lag values seen across all tablets of
// the provided type, excluding ignored tablets.
func (t *Throttler) MaxLag(tabletType topodata.TabletType) uint32 {
cache := t.maxReplicationLagModule.lagCacheByType(tabletType)
var maxLag uint32
for _, key := range maps.Keys(cache.entries) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played around with it for a bit: https://gist.github.com/mattlord/54fcc8de7b98e23292d69258e29d26f7
The new unit test passes each time but I'm not 100% sure it's fully working as it should:
❯ go test -race -count 1 -timeout 30s -run ^TestThrottlerMaxLag$ vitess.io/vitess/go/vt/throttler
ok vitess.io/vitess/go/vt/throttler 2.627s
The mutex may in fact be the way to go — certainly if the perf is adequate for you (I think Slack is the biggest user of the txthrottler today). We could probably use a RWMutex instead but that may not make a meaningful difference here either (I'm not even sure how hot this mutex may get).
Please let me know how your testing goes! If it goes well for you then that's certainly a good signal. I can come back to this quickly at any time. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattlord this would work just as well with atomic.Pointer
s, yes, the values don't need to be perfectly in sync
I'll make this change because it at least partitions whatever locking atomic
does behind the scenes 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea about perf impact with the mutex?
@mattlord we don't have these numbers. I can write a benchmark to measure the new/old code but I don't think it's possible to keep the racing code, so I've deferred this for now
throttler.MaxLag(tabletType) | ||
} | ||
} | ||
}(&wg, ctx, throttler, tabletType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you don't have to pass these in as function parameters. The groutine has access to the variables from the calling goroutine's stack. Although if you want to support older go versions than is used on main
(which is on 1.22.4), specifically 1.21 and older, then you can/should pass in the loop variable tabletType
.
Description
This PR addresses a race discovered in
replicationLagModule
ofgo/vt/throttle
. The TL;DR is there are concurrent update/delete()
s and concurrent reads/iterations of a map without lockingThis race is causing
vttablet
s with--enable-tx-throttler
to crash after 40-100 minutes of running Vitess v15 w/txthrottler backports from v16-v19:This output includes the
TestThrottlerMaxLag
test I added in my first commit to reproduce the race: b83f057Related Issue(s)
Resolves #16102
Checklist
Deployment Notes