Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in replicationLagModule of go/vt/throttle #16078

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Jun 6, 2024

Description

This PR addresses a race discovered in replicationLagModule of go/vt/throttle. The TL;DR is there are concurrent update/delete()s and concurrent reads/iterations of a map without locking

This race is causing vttablets with --enable-tx-throttler to crash after 40-100 minutes of running Vitess v15 w/txthrottler backports from v16-v19:

fatal error: concurrent map iteration and map write
goroutine 455 [running]:
vitess.io/vitess/go/vt/throttler.(*Throttler).MaxLag(0xc000bb4728?, 0xb94630?)
        vitess.io/vitess/go/vt/throttler/throttler.go:236 +0x9c
vitess.io/vitess/go/vt/vttablet/tabletserver/txthrottler.(*txThrottlerStateImpl).updateMaxLag(0xc0009f85a0)
        vitess.io/vitess/go/vt/vttablet/tabletserver/txthrottler/tx_throttler.go:399 +0x1b4
created by vitess.io/vitess/go/vt/vttablet/tabletserver/txthrottler.newTxThrottlerState in goroutine 99
        vitess.io/vitess/go/vt/vttablet/tabletserver/txthrottler/tx_throttler.go:321 +0x4f6

This output includes the TestThrottlerMaxLag test I added in my first commit to reproduce the race: b83f057

Related Issue(s)

Resolves #16102

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Jun 6, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jun 6, 2024
@github-actions github-actions bot added this to the v21.0.0 milestone Jun 6, 2024
@timvaillancourt timvaillancourt added Type: Bug Type: Regression Component: VTTablet and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 7, 2024
Copy link

codecov bot commented Jun 11, 2024

Codecov Report

Attention: Patch coverage is 88.46154% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.24%. Comparing base (2531cd0) to head (f53e089).
Report is 1 commits behind head on main.

Files Patch % Lines
go/vt/throttler/replication_lag_cache.go 91.30% 2 Missing ⚠️
go/vt/throttler/throttler.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16078      +/-   ##
==========================================
+ Coverage   68.22%   68.24%   +0.02%     
==========================================
  Files        1543     1543              
  Lines      197586   197611      +25     
==========================================
+ Hits       134805   134864      +59     
+ Misses      62781    62747      -34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timvaillancourt timvaillancourt marked this pull request as ready for review June 11, 2024 00:44
@timvaillancourt timvaillancourt removed NeedsIssue A linked issue is missing for this Pull Request Type: Regression labels Jun 11, 2024
@@ -76,9 +82,35 @@ func (c *replicationLagCache) add(r replicationLagRecord) {
entry.add(r)
}

// maxLag returns the maximum replication lag for the entries in cache.
func (c *replicationLagCache) maxLag() (maxLag uint32) {
Copy link
Contributor Author

@timvaillancourt timvaillancourt Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this mostly-unchanged from throttler.go (.MaxLag(...)) instead of holding the mu sync.Mutex lock from a different struct

.MaxLag() now calls this func

}

return maxLag
return cache.maxLag()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into replicationLagCache

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jun 11, 2024

I think I figured out why we only see this on v15 + v16-v19 txthrottler backports:

This txthrottler regression is increasing the rate of healthcheck streams, thus .add() calls to replicationLagCache

@timvaillancourt
Copy link
Contributor Author

Squashed commits to make this cherry-pick-able

@mattlord
Copy link
Contributor

mattlord commented Jun 11, 2024

Squashed commits to make this cherry-pick-able

FWIW, we do squash and merge by default so you can always cherry-pick the squashed commit that is made to the upstream branch.

@mattlord mattlord self-requested a review June 11, 2024 15:41
@timvaillancourt
Copy link
Contributor Author

Squashed commits to make this cherry-pick-able

FWIW, we do squash and merge by default so you can always cherry-pick the squashed commit that is made to the upstream branch.

Right 👍. Here my goal is to cherry-pick this to slackhq pre-review/merge, so a squash makes that just 1 commit

@@ -30,6 +31,8 @@ type replicationLagCache struct {
// The map key is replicationLagRecord.LegacyTabletStats.Key.
entries map[string]*replicationLagHistory

mu sync.Mutex
Copy link
Contributor

@mattlord mattlord Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how important it is that these two stay in perfect sync?

	slowReplicas map[string]bool
	ignoredSlowReplicasInARow map[string]bool

I'm wondering if we couldn't make these atomics instead:

slowReplicas atomic.Pointer[map[string]bool]
ignoredSlowReplicasInARow atomic.Pointer[map[string]bool]

Do we have any idea about perf impact with the mutex? I would think that could potentially become pretty hot, but maybe not. If all we care about is eliminating the race conditions, however, then using atomics instead is an option too.

Copy link
Contributor

@mattlord mattlord Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the iteration, we don't need to iterate over the map. We can instead iterate over a slice of the keys. That slice would be our own (a slice is really metadata) with a backing array that is using the same immutable strings that are used for the map's keys:

// MaxLag returns the max of all the last replication lag values seen across all tablets of
// the provided type, excluding ignored tablets.
func (t *Throttler) MaxLag(tabletType topodata.TabletType) uint32 {
	cache := t.maxReplicationLagModule.lagCacheByType(tabletType)

	var maxLag uint32

	for _, key := range maps.Keys(cache.entries) {

Copy link
Contributor

@mattlord mattlord Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with it for a bit: https://gist.github.com/mattlord/54fcc8de7b98e23292d69258e29d26f7

The new unit test passes each time but I'm not 100% sure it's fully working as it should:

❯ go test -race -count 1 -timeout 30s -run ^TestThrottlerMaxLag$ vitess.io/vitess/go/vt/throttler
ok  	vitess.io/vitess/go/vt/throttler	2.627s

The mutex may in fact be the way to go — certainly if the perf is adequate for you (I think Slack is the biggest user of the txthrottler today). We could probably use a RWMutex instead but that may not make a meaningful difference here either (I'm not even sure how hot this mutex may get).

Please let me know how your testing goes! If it goes well for you then that's certainly a good signal. I can come back to this quickly at any time. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattlord this would work just as well with atomic.Pointers, yes, the values don't need to be perfectly in sync

I'll make this change because it at least partitions whatever locking atomic does behind the scenes 👍

Copy link
Contributor Author

@timvaillancourt timvaillancourt Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea about perf impact with the mutex?

@mattlord we don't have these numbers. I can write a benchmark to measure the new/old code but I don't think it's possible to keep the racing code, so I've deferred this for now

throttler.MaxLag(tabletType)
}
}
}(&wg, ctx, throttler, tabletType)
Copy link
Contributor

@mattlord mattlord Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you don't have to pass these in as function parameters. The groutine has access to the variables from the calling goroutine's stack. Although if you want to support older go versions than is used on main (which is on 1.22.4), specifically 1.21 and older, then you can/should pass in the loop variable tabletType.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: --enable-tx-throttler can lead to crash due to race in go/vt/throttle
2 participants