Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Potential Deadlock in HodlInvoice logic #8803

Open
ziggie1984 opened this issue Jun 2, 2024 · 11 comments · May be fixed by #8827
Open

[bug]: Potential Deadlock in HodlInvoice logic #8803

ziggie1984 opened this issue Jun 2, 2024 · 11 comments · May be fixed by #8827
Assignees
Labels
bug Unintended code behaviour hodl-invoices P1 MUST be fixed or reviewed
Milestone

Comments

@ziggie1984
Copy link
Collaborator

Version: LND 17.3

Received a goroutine dump where the invoice system was in the deadlock. I could not reproduce locally and could also not come up with a reasonable explanation why the hodlqueue.ChanIn() is not avaialable.

Let's first look at the goroutine dump:

goroutine 42960673 [select, 246 minutes]: 1 times: [[42960673]
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).notifyHodlSubscribers(0xc000364240, {0x19b18e0, 0xc073120620})
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1714 +0x271
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).cancelInvoiceImpl(0xc000364240, {0x71, 0xc8, 0x51, 0xbd, 0x11, 0x3a, 0x5e, 0x63, 0xbc, ...}, ...)
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1371 +0x5ac
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).CancelInvoice(...)
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1293
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).CancelInvoice(0xc0073d8370, {0x280?, 0xc01daf56f8?}, 0x1189079?)
        github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:309 +0x96
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_CancelInvoice_Handler.func1({0x19c15f8?, 0xc0576bc420?}, {0x154f260?, 0xc02c290c40?})
        github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:207 +0xcb
...

This is the call which initiates the deadlock, the notifyHodlSubscribers cannot write into the hodlqueue channel of the specific channel link, locking the hodlSubscriptionsMux forever.

Moreover the cancelInvoiceImpl locks the InvoiceRegistry mutex as well so both mutexs are locked.

for subscriber := range subscribers {
select {
case subscriber <- htlcResolution:
case <-i.quit:
return
}
delete(
i.hodlReverseSubscriptions[subscriber],
htlcResolution.CircuitKey(),
)
}

So somehow the hodlqueue channel is never read, which only happens when we somehow stopped the concurrent queue otherwise the incoming channel is always read in a loop:

lnd/queue/queue.go

Lines 62 to 97 in 613bfc0

for {
nextElement := cq.overflow.Front()
if nextElement == nil {
// Overflow queue is empty so incoming items can be pushed
// directly to the output channel. If output channel is full
// though, push to overflow.
select {
case item, ok := <-cq.chanIn:
if !ok {
break readLoop
}
select {
case cq.chanOut <- item:
// Optimistically push directly to chanOut
default:
cq.overflow.PushBack(item)
}
case <-cq.quit:
return
}
} else {
// Overflow queue is not empty, so any new items get pushed to
// the back to preserve order.
select {
case item, ok := <-cq.chanIn:
if !ok {
break readLoop
}
cq.overflow.PushBack(item)
case cq.chanOut <- nextElement.Value:
cq.overflow.Remove(nextElement)
case <-cq.quit:
return
}
}
}

So I am waiting for additional goroutine dumps from the noderunner to verify this behaviour and investigate further.

@ziggie1984 ziggie1984 added bug Unintended code behaviour needs triage labels Jun 2, 2024
@ziggie1984
Copy link
Collaborator Author

Goroutine Dump:
lnd_issue_8803_goroutine_dump_1.txt

@Roasbeef
Copy link
Member

Roasbeef commented Jun 3, 2024

Is this present in 0.18?

@ziggie1984
Copy link
Collaborator Author

This noderunner still runs 17.3 and will update soon to 18.0, so we don't know yet but he will provide the necessary information.

@ziggie1984
Copy link
Collaborator Author

Analysing further I think we have a race condition which we need to guard for:

In the peerTerminationWatcher we remove the link hence stopping the channel link (RemoveLink) without making sure that it's not in the process of removing an HTLC (hodl htlc for example):

lnd/server.go

Lines 4081 to 4083 in a832371

for _, link := range links {
s.htlcSwitch.RemoveLink(link.ChanID())
}

before we should probably make sure that the htlcmanger is shutdown correctly.

ec55831#diff-9926ec0b86d09ad93deb44807ccee9a52ed42535007f96ae497478c07b76843aL1449-L1466

so the above code base change with Release 18, so this bug might not be part of the code base anymore.

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Jun 5, 2024

Though I don't see that we make sure our htlcManager is not in the process of handling a msg hence accepting a new htlc when shutting down the link which will switch off the ConcurrentQueue

I think we need to make sure the htlcManager for a channel link is in a clean state before stopping the channel link.

Maybe we need to signal the quit channel before before stopping the hodlqueue (and wait until the htlcManager goroutine is stopped) so we don't run into this race condition.

lnd/htlcswitch/link.go

Lines 1443 to 1445 in a832371

case <-l.quit:
return
}

lnd/htlcswitch/link.go

Lines 572 to 574 in a832371

l.hodlQueue.Stop()
close(l.quit)

otherwise there might be the race condition where we try to process an HTLC and all of a sudden the peer disconnects and we stop the hodlqueue never being able to read the hodlqueue.ChanIn()

what do you think @ProofOfKeags ?

We potentially might reintroduce the l.shutdownRequest channel maybe to trigger a clean shutdown of the htlcManager ?

@saubyk saubyk added the P2 should be fixed if one has time label Jun 5, 2024
@ProofOfKeags
Copy link
Collaborator

Oof. Yeah if there is any necessary cleanup that the channelLink has to perform then we should be doing that in the post loop section of the htlcManager thread. I would need to take a closer look at how the hodlQueue works. I'm somewhat concerned that we do direct channel exposure via NotifyExitHopHtlc.

From what I can tell though, I can see a bit of an architectural crack wherein we leak out the hodlQueue go-channel all the way into the invoice registry. The problem with that is that this is essentially a dangling reference that we cannot detect has been destroyed in that context.

I think the "correct" fix of this will have to involve either preventing that go-channel from leaking directly, or also attaching with it some mechanism for being able to tell when the resource dies (a copy of the link's quit channel).

@saubyk saubyk added P1 MUST be fixed or reviewed and removed P2 should be fixed if one has time labels Jun 6, 2024
@saubyk saubyk added this to the v0.18.1 milestone Jun 6, 2024
@ziggie1984
Copy link
Collaborator Author

Problem confirmed with LND 18.0, working on a fix.
lnd_issue_8803_goroutine_dump_4_lnd_18.txt

@ziggie1984
Copy link
Collaborator Author

I think the "correct" fix of this will have to involve either preventing that go-channel from leaking directly, or also attaching with it some mechanism for being able to tell when the resource dies (a copy of the link's quit channel).

Thank you for the ideas, I will try the first design with just adding the link's quit channel to the subscriber, but maybe we need to really isolate the concurrent queue with Queue and Dequeue methods which guard read and writes against the above behaviour tho I think it might me more work because we use the concurrent queue not just for hodlinvoices, lets discuss based on the first draft.

@ProofOfKeags
Copy link
Collaborator

but maybe we need to really isolate the concurrent queue with Queue and Dequeue methods which guard read and writes against the above behaviour tho I think it might me more work because we use the concurrent queue not just for hodlinvoices, lets discuss based on the first draft

Big Concept ACK here. I think the point here is that we want to make sure that reading from a pipe whose writers have exited or writing to a pipe whose readers have exited should (ideally) not deadlock but cause some sort of other "exception" on the side that's still live. If you can solve this at the level of the concurrent queue itself that would be a huge win.

@ziggie1984 ziggie1984 linked a pull request Jun 11, 2024 that will close this issue
@Roasbeef
Copy link
Member

Also see this issue, back when I fixed a concurrency issue here, dug deeper, and saw many more: #8023

@ProofOfKeags
Copy link
Collaborator

ProofOfKeags commented Jun 14, 2024

In the process of investigating some of the changes I've been wanting to make it occurs to me that the existence of the hodlQueue within the link itself is another symptom of a fundamental architectural crack I noticed a few months ago. I think patching over it to hunt down this deadlock is fine but I want to take a moment to further educate on the issue.

The following is my opinion but it will be presented as fact so as to make as direct of a case as possible.


The main problem that sits at the bottom of many of the issues we are encountering is an improper division of responsibility between the link and the switch.

Right now we do exit hop processing in the link itself as opposed to the switch. This is a mistake for a number of reasons. First and foremost it means we have to duplicate traffic controls between the links and the switch. Further it requires invoice registry references in every channel link which will cause synchronization issues to proliferate.

Instead what we should be doing is as soon as we have received a revocation from our counterparty we take the batch of remote updates, package them up and hand them to the switch. From there the switch can make the determination as to what to do with it. If we are the recipient and it is not a hodl invoice we would immediately drop an htlc packet back into the link and we are on our merry way. If it is a hodl invoice, the switch can await the confirming signal from the invoice registry before doing so.

There are many more problems that this improper division of responsibility creates but I will save those for a different forum. Changing this architectural issue is a considerable effort and not one we should undertake to solve immediate production issues so in my view, we should fix it in whatever way can work.

I include this detail because I will be making an ongoing case that we need to change the responsibility division between the link and the switch and I keep finding odd side-effects of the current architecture and want to document them.

CC @Roasbeef @yyforyongyu @saubyk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour hodl-invoices P1 MUST be fixed or reviewed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants