Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRAME_TYPE_STREAMS_BLOCKED ignored? #1917

Open
larseggert opened this issue May 30, 2024 · 11 comments · May be fixed by #1929
Open

FRAME_TYPE_STREAMS_BLOCKED ignored? #1917

larseggert opened this issue May 30, 2024 · 11 comments · May be fixed by #1929

Comments

@larseggert
Copy link
Collaborator

larseggert commented May 30, 2024

This code

Frame::StreamsBlocked { .. } => {
stats.streams_blocked += 1;
// We send an update evry time we retire a stream. There is no need to
// trigger flow updates here.
}

says that there is no need to trigger sending a FRAME_TYPE_MAX_STREAMS when we get an FRAME_TYPE_STREAMS_BLOCKED. I wonder if that is actually correct, if the FRAME_TYPE_MAX_STREAMS triggered by retiring streams was lost?

The RPS bench failure for #1903 seems to stem from the client timing out while waiting for FRAME_TYPE_MAX_STREAMS updates that end up in PMTUD probes that are repeatedly getting dropped. That idle timeout sometimes happens before we'd RTX the FRAME_TYPE_MAX_STREAMS.

@larseggert
Copy link
Collaborator Author

larseggert commented May 31, 2024

I'm attaching a client and server log where this seems to happen, and a pcap and TLS key log file: Archive.zip

The server receives a StreamsBlocked in pn=29 from the client. The server then sends a MaxStreams in pn=28 which ends up in a PMTU probe that is too big and hence is lost. The server gets another StreamsBlocked from the client in pn=31, but ignores it. There is a PING-ACK exchange, but that doesn't make the server retransmit the MaxStreams and hence the client hits the idle timeout.

@KershawChang
Copy link
Collaborator

I'm attaching a client and server log where this seems to happen, and a pcap and TLS key log file: Archive.zip

The server receives a StreamsBlocked in pn=29 from the client. The server then sends a MaxStreams in pn=28 which ends up in a PMTU probe that is too big and hence is lost. The server gets another StreamsBlocked from the client in pn=31, but ignores it. There is a PING-ACK exchange, but that doesn't make the server retransmit the MaxStreams and hence the client hits the idle timeout.

I think the problem that the server doesn't retransmit the MaxStreams frame is that the previous call to process returns Output::None. At this point, the server has no timeout timer armed and is just waiting for the client to send something. However, the client is blocked because it's waiting for the MaxStreams frame. In the end, the client ends due to the idle timeout.

@larseggert
Copy link
Collaborator Author

larseggert commented Jun 17, 2024

Why is the call to process returning Output::None?

I looked into this a bit, and it seems like the send profile limit is 255, i.e.,`SendProfile::ack_only() is true, so we allow no other frames than ACKs to be sent.

@KershawChang
Copy link
Collaborator

Why is the call to process returning Output::None?

I looked into this a bit, and it seems like the send profile limit is 255, i.e.,`SendProfile::ack_only() is true, so we allow no other frames than ACKs to be sent.

Because sender.cwnd_avail() is 0 at that point.

This is the lost packet that contains MaxStream frame.

0s376ms DEBUG Sending PMTUD probe of size 16383, count 3
0s377ms DEBUG [Server 98ed348f599f66344de7878d9c] pn=28 type=Short pri-path: 127.0.0.1:4433->127.0.0.1:52160 IpTos(Cs0, Ect0) len 16355
  TX -> Ping
  TX -> MaxStreams { stream_type: BiDi, maximum_streams: 32 }
  TX -> Padding { len: 16334 }
0s377ms DEBUG [LossRecovery] packet ap-28 sent

Then, process() returns Output::None because the cwnd_avail() is 0.

sender.cwnd_avail()=0
0s391ms DEBUG [Server 98ed348f599f66344de7878d9c] output_path send_profile SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s391ms DEBUG Building Short dcid 
0s391ms DEBUG TX blocked, profile=SendProfile { limit: 255, pto: None, probe: (), paced: false } 
0s391ms DEBUG [Server 98ed348f599f66344de7878d9c] delay duration 531.65µs

In the end, the server stalls.

@larseggert
Copy link
Collaborator Author

I'm probably missing something, but shouldn't the server not stall here, and instead gain some cwnd after a PTO?

@KershawChang
Copy link
Collaborator

OK, I can confirm that the Output::None is from here.

I think this is a bug on the server side because the underlying connection actually returns a delay, but the server code decides to return Output::None.
I'll keep digging.

@KershawChang
Copy link
Collaborator

I think the problem is about this line:

pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {
        if self.wake_at.map_or(false, |c| c <= now) {
            self.wake_at = None;
        }

        dgram
            .and_then(|d| self.process_input(d, now))
            .or_else(|| self.process_next_output(now))
            .map(|d| {
                qtrace!([self], "Send packet: {:?}", d);
                Output::Datagram(d)
            })
            .or_else(|| self.wake_at.take().map(|c| Output::Callback(c - now)))
            .unwrap_or_else(|| {
                qtrace!([self], "Go dormant");
                Output::None
            })
    }

This is what happened:

  1. neqo_transport::server::Process is called by neqo_http3::server::Process here.
  2. self.wake_at is set to the next delay.
  3. self.wake_at is taken and set to None at this line.
  4. neqo_transport::server::Process is called by neqo_http3::server::Process again here.
  5. neqo_transport::server::Process returns Output::None because self.wake_at is None.

@larseggert
Copy link
Collaborator Author

Any idea for a fix?

@mxinden
Copy link
Collaborator

mxinden commented Jun 17, 2024

Sharing some preliminary thoughts. Taking a deeper look now.

I see how #1926 fixes the issue. That said, say a connection returns an Output::Callback and later wants to invalidate that timeout by returning Output::None, the previous Callback would persist until its timeout value is in the past.

neqo_transport::server::Process returns Output::None because self.wake_at is None.

Shouldn't the Connection of the server ideally again return Output::Callback? In other words, shouldn't Connection::process_output be idempotent in the sense that it always returns Output::Callback when it needs to be called back?

@KershawChang
Copy link
Collaborator

It seems that Connection::process_output is not called again because self.waiting is empty.

@mxinden
Copy link
Collaborator

mxinden commented Jun 17, 2024

It seems that Connection::process_output is not called again because self.waiting is empty.

Great catch.

As discussed in the call just now, I will test out an alternative to #1926 using your finding. Will hopefully submit a pull request in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants