Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd worker tick stuck up to 2 minutes periodically when cdc can't connect to Kafka server #11340

Open
asddongmen opened this issue Jun 24, 2024 · 3 comments · May be fixed by #11339
Open

etcd worker tick stuck up to 2 minutes periodically when cdc can't connect to Kafka server #11340

asddongmen opened this issue Jun 24, 2024 · 3 comments · May be fixed by #11339
Assignees
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 area/ticdc Issues or PRs related to TiCDC. found/customer Customers have encountered this bug. severity/major This is a major bug. type/bug This is a bug.

Comments

@asddongmen
Copy link
Contributor

What did you do?

  1. Creata a changefeed sink to kafka.
  2. Inject a network partition between cdc and kafka.

What did you expect to see?

changefeed stucks but processor will not be stuck.

What did you see instead?

[2024/06/21 19:03:54.617 +08:00] [WARN] [etcd_worker.go:293] ["EtcdWorker reactor tick took too long"] [duration=50.822188739s] [role=processor]
[2024/06/21 20:04:52.975 +08:00] [WARN] [processor.go:540] ["processor tick took too long"] [namespace=default] [changefeed=kafka-task] [capture=4556c672-6e93-4054-a705-c598fcf64b84] [duration=2m1.979526934s]


Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)

TiCDC version (execute cdc version):

master
@asddongmen asddongmen added type/bug This is a bug. area/ticdc Issues or PRs related to TiCDC. labels Jun 24, 2024
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Jun 24, 2024
@asddongmen asddongmen added the severity/major This is a major bug. label Jun 24, 2024
@asddongmen asddongmen self-assigned this Jun 24, 2024
@asddongmen asddongmen changed the title processor tick stuck up to 2 minutes periodically when cdc can't connect to Kafka server etcd worker tick stuck up to 2 minutes periodically when cdc can't connect to Kafka server Jun 24, 2024
@asddongmen
Copy link
Contributor Author

The severity of this issue is set to major because if there are multiple changefeeds in cdc, the delay of other changefeeds will also be affected, resulting in a cyclical lag of about 2 minutes.

The root cause of this issue is that when a kafka sink encounters an error, it will retry within the sinkManager, at which point it calls initSinkFactory to attempt to reconnect to kafka:

sinkFactoryErrors, sinkFactoryVersion := m.initSinkFactory()

This function internally will hold m.sinkFactory.Lock() until the function exits.
When kafka cannot be connected, this function will block for about 2 minutes, until the underlying call throws kafka: client has run out of available brokers to talk to: dial tcp 10.99.219.92:9092: i/o timeout" before it exits.

Meanwhile, In another goroutine, Processor calls GetTableSpanStatus at each Tick to collect table progress information and advance the overall task progress:

func (p *processor) GetTableSpanStatus(span tablepb.Span, collectStat bool) tablepb.TableStatus {

When the downstream of changefeed is kafka, the interior of GetTableSpanStatus will call the needsStuckCheck function of sinkManager:

func (m *SinkManager) needsStuckCheck() bool {

This function will try to get m.sinkFactory.Lock(), but since this lock has already been held by initSinkFactory, the Processor Tick will be blocked in needsStuckCheck before it is released.
Therefore, a relatively simple solution is to let the needsStuckCheck function no longer try to get m.sinkFactory.Lock(), so no blocking will occur.

@asddongmen
Copy link
Contributor Author

Based on advice from @hicqu, after the merge of #9852, the needsStuckCheck mechanism can be removed, which can also resolve this issue.

@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the found/customer Customers have encountered this bug. label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 affects-7.5 affects-8.1 area/ticdc Issues or PRs related to TiCDC. found/customer Customers have encountered this bug. severity/major This is a major bug. type/bug This is a bug.
Projects
2 participants