Project

General

Profile

Feature #4362

Feature #1624: Design and Implement Congestion Control

Congestion detection by observing socket queues

Added by Klaus Schneider almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Faces
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:

Description

Routers should be able to detect congestion by measuring their queue (either backlog size, or queuing delay).

This congestion measurement should then be used to signal consumers by putting congestion marks into packets (#3797).

In our Hackathon project we figured out, that we can use the ioctl command TIOCOUTQ to receive the buffer backlog size of all three socket types: TCP, UDP, and unix sockets.

The current design includes:

  • Check if buffer is above THRESHOLD = MAX_BUF_SIZE / 3 (defaults to 200KB / 3)
  • Use some default INTERVAL as rough estimation of RTT (default: 100ms)
  • If buffer > THRESHOLD for at least one INTERVAL -> Insert first congestion mark into packet.
  • As long as the buffer stays above THRESHOLD: Mark the following packets in a decreasing interval, described in the CoDel paper (start at INTERVAL then decrease by inverse sqrt(drops) )
retreat_cc_final_2017.pdf (249 KB) retreat_cc_final_2017.pdf Klaus Schneider, 12/18/2017 02:45 PM

Related issues

Related to NFD - Bug #4407: Large queuing delay in stream-based facesClosed

History

#1 Updated by Klaus Schneider almost 2 years ago

  • Description updated (diff)

#2 Updated by Klaus Schneider almost 2 years ago

  • Related to Feature #1624: Design and Implement Congestion Control added

#3 Updated by Klaus Schneider almost 2 years ago

Can someone make this a subtask of #1624?

I can't figure out how to do that.

#4 Updated by Klaus Schneider almost 2 years ago

  • Parent task set to #1624

Never mind, I figured it out :)

#5 Updated by Davide Pesavento almost 2 years ago

  • Tracker changed from Task to Feature
  • Subject changed from Congestion Detection by observing router queues. to Congestion Detection by observing router queues

#6 Updated by Klaus Schneider over 1 year ago

We made some progress on this task on the 5th NDN Hackathon: https://github.com/5th-ndn-hackathon/congestion-control

I attached our slides.

#7 Updated by Klaus Schneider over 1 year ago

  • % Done changed from 0 to 10

#8 Updated by Klaus Schneider over 1 year ago

  • Assignee set to Eric Newberry

#9 Updated by Klaus Schneider over 1 year ago

  • Description updated (diff)

I've updated our current version of the design.

It is very simple, and can later be replaced by a more sophisticated AQM scheme like CODEL.

#10 Updated by Klaus Schneider over 1 year ago

  • Related to Bug #4407: Large queuing delay in stream-based faces added

#11 Updated by Klaus Schneider over 1 year ago

This should also fix #4407

#12 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

In our Hackathon project we figured out, that we can use the ioctl command TIOCOUTQ to receive the buffer backlog size of all three socket types: TCP, UDP, and unix sockets.

When is the ioctl called?

#13 Updated by Davide Pesavento over 1 year ago

  • Subject changed from Congestion Detection by observing router queues to Congestion detection by observing socket queues
  • Category changed from Forwarding to Faces
  • Target version set to v0.7

#14 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

In our Hackathon project we figured out, that we can use the ioctl command TIOCOUTQ to receive the buffer backlog size of all three socket types: TCP, UDP, and unix sockets.

When is the ioctl called?

The code is here: https://github.com/5th-ndn-hackathon/congestion-control/blob/master/daemon/face/stream-transport.hpp

doSend(Transport::Packet&& packet) calls checkCongestionLevel(packet);

The questions of where it should be called is still up for debate.

#15 Updated by Davide Pesavento over 1 year ago

I don't disagree in principle with placing it there. But an ioctl is a system call. What's the performance impact of adding an extra system call for each sent packet?

#16 Updated by Klaus Schneider over 1 year ago

No idea. Do you think the impact is significant compared to all the other NFD overhead?

#17 Updated by Davide Pesavento over 1 year ago

I did a very rough test: that ioctl seems to take less then 1 µs on average, so no need to worry about it.

#18 Updated by Klaus Schneider over 1 year ago

  • Description updated (diff)

I added the part about the marking Interval to the design. Shouldn't be too complicated. You only need to keep a time stamp "firstTimeAboveLimit" and a counter of drops since the time has been above the limit.

See also https://github.com/schneiderklaus/ns-3-dev/blob/ndnSIM-v2/src/internet/model/codel-queue2.cc

#19 Updated by Eric Newberry over 1 year ago

  • Status changed from New to In Progress

#20 Updated by Eric Newberry over 1 year ago

  • Status changed from In Progress to Code review
  • % Done changed from 10 to 70

#21 Updated by Eric Newberry over 1 year ago

  • % Done changed from 70 to 100

#22 Updated by Eric Newberry over 1 year ago

For reliable transports (like TCP), do we want the queue length to be the total number of bytes not yet sent plus the number of unacked bytes or do we just want to use the total number of unsent bytes? I know how to do the first on both Linux and macOS (the mechanism is different) but I am unsure if the second is possible on macOS.

#23 Updated by Klaus Schneider over 1 year ago

I think we want only the number of packets not sent.

The number of unacked packets (= inflight packets) depends on the propagation delay, which we don't want to mix with the queuing delay.

For example, if we set a threshold equivalent of 5ms total delay (not sent + not acked), then tunnels with 10ms propagation delay would always cause congestion marks.

Thus, it's clear that we want the threshold to only apply to queuing delay (not sent).

Also, isn't it easier to implement "not sent" for both? According to https://android.googlesource.com/kernel/msm.git/+/android-6.0.1_r0.21/include/linux/sockios.h

  • TIOCOUTQ output queue size (not sent + not acked) is Linux-specific.
  • SIOCOUTQNSD output queue size (not sent only) is platform independent.

#24 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

Also, isn't it easier to implement "not sent" for both? According to https://android.googlesource.com/kernel/msm.git/+/android-6.0.1_r0.21/include/linux/sockios.h

  • TIOCOUTQ output queue size (not sent + not acked) is Linux-specific.
  • SIOCOUTQNSD output queue size (not sent only) is platform independent.

Neither are platform independent. Where did you read that SIOCOUTQNSD is?

TIOCOUTQ is usable only on TTYs on macOS, but there's a viable alternative: the SO_NWRITE socket option. There is no BSD/macOS counterpart to SIOCOUTQNSD as far as I know.

#25 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

Also, isn't it easier to implement "not sent" for both? According to https://android.googlesource.com/kernel/msm.git/+/android-6.0.1_r0.21/include/linux/sockios.h

  • TIOCOUTQ output queue size (not sent + not acked) is Linux-specific.
  • SIOCOUTQNSD output queue size (not sent only) is platform independent.

Neither are platform independent. Where did you read that SIOCOUTQNSD is?

Well, TIOCOUTQ was listed under "Linux-specific", so I assumed the rest would be platform-independent.

But you're right, SIOCOUTQNSD isn't platform independent either.

TIOCOUTQ is usable only on TTYs on macOS, but there's a viable alternative: the SO_NWRITE socket option. There is no BSD/macOS counterpart to SIOCOUTQNSD as far as I know.

Yes, "SO_NWRITE" sounds like what we are looking for.

#26 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Davide Pesavento wrote:

Klaus Schneider wrote:

Also, isn't it easier to implement "not sent" for both? According to https://android.googlesource.com/kernel/msm.git/+/android-6.0.1_r0.21/include/linux/sockios.h

  • TIOCOUTQ output queue size (not sent + not acked) is Linux-specific.
  • SIOCOUTQNSD output queue size (not sent only) is platform independent.

Neither are platform independent. Where did you read that SIOCOUTQNSD is?

Well, TIOCOUTQ was listed under "Linux-specific", so I assumed the rest would be platform-independent.

But you're right, SIOCOUTQNSD isn't platform independent either.

TIOCOUTQ is usable only on TTYs on macOS, but there's a viable alternative: the SO_NWRITE socket option. There is no BSD/macOS counterpart to SIOCOUTQNSD as far as I know.

Yes, "SO_NWRITE" sounds like what we are looking for.

Linux actually doesn't support support SO_NWRITE, so we have separate implementations for Linux and macOS. Linux uses SIOCOUTQ via ioctl and macOS uses SO_NWRITE via getsockopt.

#27 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

Yes, "SO_NWRITE" sounds like what we are looking for.

SO_NWRITE is the # of unsent + unacked bytes. So, not good if we want unsent only per note-23.

#28 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

Yes, "SO_NWRITE" sounds like what we are looking for.

SO_NWRITE is the # of unsent + unacked bytes. So, not good if we want unsent only per note-23.

Alright, I got it.

Is there a way to count only the unacked bytes? (let's call it X)

Then subtract SO_NWRITE - X?

#29 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Davide Pesavento wrote:

Klaus Schneider wrote:

Yes, "SO_NWRITE" sounds like what we are looking for.

SO_NWRITE is the # of unsent + unacked bytes. So, not good if we want unsent only per note-23.

Alright, I got it.

Is there a way to count only the unacked bytes? (let's call it X)

Then subtract SO_NWRITE - X?

Not sure if there is, but I would worry about the relevant values changing in the time between the calls to obtain the relevant values.

#30 Updated by Klaus Schneider over 1 year ago

Well, one fall-back option would be to just use m_sendQueue for macOS + TCP.

I think this solution is fine, since TCP tunnels are less important than UDP tunnels anways.

#31 Updated by Klaus Schneider over 1 year ago

I wonder whether SO_NWRITE will work as desired in macOS + Unix Domain socket?

#32 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

Is there a way to count only the unacked bytes? (let's call it X)

Then subtract SO_NWRITE - X?

In theory, there's the TCP_INFO socket option. But:

  1. it's marked as PRIVATE, which means it's not exposed to user-space, but I think we can ignore that.
  2. it's also completely undocumented (I had to look at the xnu kernel source).
  3. the struct tcp_info it returns is huge, copying all of it to user-space for every packet sent is much more expensive than SO_NWRITE. But since we're not marking every packet anyway, maybe we can call getSendQueueLength only when we would mark the packet currently being sent (if the buffer was congested). Would that work?

The advantage of this approach is that struct tcp_info contains both fields:

        u_int32_t       tcpi_snd_sbbytes;       /* bytes in snd buffer including data inflight */
        ...
        u_int64_t       tcpi_txunacked __attribute__((aligned(8)));    /* current number of bytes not acknowledged */

so there shouldn't be any race conditions/inconsistent values.

#33 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

Is there a way to count only the unacked bytes? (let's call it X)

Then subtract SO_NWRITE - X?

In theory, there's the TCP_INFO socket option. But:

  1. it's marked as PRIVATE, which means it's not exposed to user-space, but I think we can ignore that.
  2. it's also completely undocumented (I had to look at the xnu kernel source).
  3. the struct tcp_info it returns is huge, copying all of it to user-space for every packet sent is much more expensive than SO_NWRITE. But since we're not marking every packet anyway, maybe we can call getSendQueueLength only when we would mark the packet currently being sent (if the buffer was congested). Would that work?

Well that would work after you entered the congested state (where the queue is above target). But how often would you check the queue size before the first packet is marked?

Also, even in the congested state, your solution would miss if the queue runs empty before the next packet is expected to be marked. This is fine in our current code, but would not allow us to extend it to the proper CoDel, where there is a minimum interval (100ms) that the queue has to be above target (5ms) to enter the congested state.

I think since TCP tunnels are not that important, we can use the simpler solution of just using m_sendQueue.

The advantage of this approach is that struct tcp_info contains both fields:

        u_int32_t       tcpi_snd_sbbytes;       /* bytes in snd buffer including data inflight */
        ...
        u_int64_t       tcpi_txunacked __attribute__((aligned(8)));    /* current number of bytes not acknowledged */

Why is the not-acked number a uint64 while snd_sbbytes is uint32? Shouldn't snd_sbbytes always be the same or larger?

so there shouldn't be any race conditions/inconsistent values.

#34 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

I think since TCP tunnels are not that important, we can use the simpler solution of just using m_sendQueue.

Fair enough.

        u_int32_t       tcpi_snd_sbbytes;       /* bytes in snd buffer including data inflight */
        ...
        u_int64_t       tcpi_txunacked __attribute__((aligned(8)));    /* current number of bytes not acknowledged */

Why is the not-acked number a uint64 while snd_sbbytes is uint32? Shouldn't snd_sbbytes always be the same or larger?

I had the same question, no idea. A 32-bit quantity would be more than enough for both values, I guess the second one was extended due to alignment requirements...

#35 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

I wonder whether SO_NWRITE will work as desired in macOS + Unix Domain socket?

I believe it should. There are no ACKs in Unix sockets, data is written directly to the memory buffer from which the peer is reading. So my guess is that SO_NWRITE simply returns the # of bytes in that buffer.

#36 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

I wonder whether SO_NWRITE will work as desired in macOS + Unix Domain socket?

I believe it should. There are no ACKs in Unix sockets, data is written directly to the memory buffer from which the peer is reading. So my guess is that SO_NWRITE simply returns the # of bytes in that buffer.

Sounds good. Let's do that.

So in summary we use:

Linux

  • UDP: TIOCOUTQ
  • TCP: SIOCOUTQNSD + m_sendQueue
  • Unix: SIOCOUTQNSD + m_sendQueue

MacOS

  • UDP: SO_NWRITE
  • TCP: m_sendQueue
  • Unix: SO_NWRITE + m_sendQueue

Is that correct?

#37 Updated by Davide Pesavento over 1 year ago

Mostly correct. SIOCOUTQNSD makes sense for TCP sockets only. For Unix sockets, we must use SIOCOUTQ (+ m_sendQueue).

#38 Updated by Eric Newberry over 1 year ago

Since WebSocket runs over TCP, should we also attempt to detect congestion in WebSocketTransport? I found out how to get the underlying Asio socket through websocketpp. However, WebSocketTransport doesn't contain a send queue (like TcpTransport), so macOS would not be able to detect congestion using the mechanisms outlined above.

#39 Updated by Junxiao Shi over 1 year ago

It makes no sense to detect congestion on Unix stream and WebSockets, which exclusively connect to end applications.
Unix streams are local IPC channels. If an app can’t read from a socket fast enough, it can detect the situation by itself (there’s almost no idle cycles between select calls), and a congestion mark does not help it more.
WebSockets are proxied through nginx in a typical deployment, and congestion happens between nginx and browser, not between NFD and nginx. Therefore, detecting congestion on NFD side won’t work. Also, NDN-JS does not recognize congestion marks (and I don’t see an open issue requesting that) so this won’t help applications.

#40 Updated by Davide Pesavento over 1 year ago

Junxiao Shi wrote:

Unix streams are local IPC channels. If an app can’t read from a socket fast enough, it can detect the situation by itself (there’s almost no idle cycles between select calls),

That condition doesn't imply a congested state by itself. And in any case, why would you want to do this in applications? Let's not make writing apps harder than it already is.

and a congestion mark does not help it more.

#4407 seems to prove you wrong. Marking packets is helpful to signal the consumer to slow down.

WebSockets are proxied through nginx in a typical deployment, and congestion happens between nginx and browser, not between NFD and nginx.

[citation needed]

Also, NDN-JS does not recognize congestion marks (and I don’t see an open issue requesting that) so this won’t help applications.

That's a completely orthogonal issue. Libraries will add support over time, as with any other new feature.

#41 Updated by Junxiao Shi over 1 year ago

WebSockets are proxied through nginx in a typical deployment, and congestion happens between nginx and browser, not between NFD and nginx.

[citation needed]

As described in https://lists.named-data.net/mailman/private/operators/2016-June/001062.html, modern browsers cannot use non-TLS WebSockets except in very limited circumstances, so the testbed uses nginx as a proxy.
The connection between NFD and nginx is local so it cannot be congested.

#42 Updated by Eric Newberry over 1 year ago

For transports that add the size of m_sendQueue onto the send queue length obtained from the socket, if there is an error obtaining the queue length from the socket, do we want to just return the size of m_sendQueue? Or do we want to return an error code (QUEUE_ERROR)?

#43 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

For transports that add the size of m_sendQueue onto the send queue length obtained from the socket, if there is an error obtaining the queue length from the socket, do we want to just return the size of m_sendQueue? Or do we want to return an error code (QUEUE_ERROR)?

IMO, return at least the size of m_sendQueue, which is always available in a stream transport. Some info is better than none.

#44 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Eric Newberry wrote:

For transports that add the size of m_sendQueue onto the send queue length obtained from the socket, if there is an error obtaining the queue length from the socket, do we want to just return the size of m_sendQueue? Or do we want to return an error code (QUEUE_ERROR)?

IMO, return at least the size of m_sendQueue, which is always available in a stream transport. Some info is better than none.

Agreed. I don't think returning an error code would be very useful.

I also agree fully with Davide's note-40.

Regarding Websockets: Do you have any concrete design for MacOS? If not, I think it's fine to leave it out (+ document it).

#45 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Davide Pesavento wrote:

Eric Newberry wrote:

For transports that add the size of m_sendQueue onto the send queue length obtained from the socket, if there is an error obtaining the queue length from the socket, do we want to just return the size of m_sendQueue? Or do we want to return an error code (QUEUE_ERROR)?

IMO, return at least the size of m_sendQueue, which is always available in a stream transport. Some info is better than none.

Agreed. I don't think returning an error code would be very useful.

I also agree fully with Davide's note-40.

Regarding Websockets: Do you have any concrete design for MacOS? If not, I think it's fine to leave it out (+ document it).

Patchset 14 should implement everything according to note 40.

I believe WebSocket would use the same information as TCP, without m_sendQueue, meaning that it would have absolutely no send queue length information on macOS.

#46 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

Regarding Websockets: Do you have any concrete design for MacOS? If not, I think it's fine to leave it out (+ document it).

It would be exactly the same as TCP since the WebSocket protocol is layered over a TCP connection. But let's leave WebSocketTransport support out for now, we can add it later.

#47 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

Regarding Websockets: Do you have any concrete design for MacOS? If not, I think it's fine to leave it out (+ document it).

It would be exactly the same as TCP since the WebSocket protocol is layered over a TCP connection. But let's leave WebSocketTransport support out for now, we can add it later.

Sounds good.

#48 Updated by Eric Newberry over 1 year ago

Do we plan to attempt an implementation of queue length retrieval for Ethernet faces in the future?

Also, we should create another issue to allow congestion marking to be enabled and disabled from NFD management. This would include design work.

#49 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

Do we plan to attempt an implementation of queue length retrieval for Ethernet faces in the future?

I hope so, but does that impact change 4411?

#50 Updated by Eric Newberry over 1 year ago

Davide Pesavento wrote:

Eric Newberry wrote:

Do we plan to attempt an implementation of queue length retrieval for Ethernet faces in the future?

I hope so, but does that impact change 4411?

No, but it definitely relates to this issue, since the description mentions Ethernet links.

#51 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

No, but it definitely relates to this issue, since the description mentions Ethernet links.

"Ethernet links" doesn't imply ethernet faces are being used. Although I'm not sure what "When NFD is running natively" means TBH.
Let's keep this issue focused on socket-based transports only, like the title and the rest of the description say. Ethernet transports differ substantially from the rest, and the detection mechanism(s) will probably be very different, assuming it's even feasible to do it. Use a separate issue for them.

#52 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Davide Pesavento wrote:

Eric Newberry wrote:

Do we plan to attempt an implementation of queue length retrieval for Ethernet faces in the future?

I hope so, but does that impact change 4411?

No, but it definitely relates to this issue, since the description mentions Ethernet links.

Looks like the description needs an update. Let's exclude Ethernet links from this issue.

Also, we should create another issue to allow congestion marking to be enabled and disabled from NFD management. This would include design work.

Which design work are you thinking about?

#53 Updated by Klaus Schneider over 1 year ago

  • Description updated (diff)

#54 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Also, we should create another issue to allow congestion marking to be enabled and disabled from NFD management. This would include design work.

Which design work are you thinking about?

Just a design of how it would be enabled from management (probably a bit in the Flags and Mask fields), plus adding any other necessary options. It should be pretty straightforward, with nothing too complex.

#55 Updated by Klaus Schneider over 1 year ago

  • Description updated (diff)

#56 Updated by Eric Newberry over 1 year ago

Does "drops" in the description refer to the number of packets marked in the incident of congestion or the total number of packets processed in the incident of congestion?

#57 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Does "drops" in the description refer to the number of packets marked in the incident of congestion or the total number of packets processed in the incident of congestion?

"drops" (better called "marks") means the number of packets marked after entering the marking state (i.e. the queue being above target).

Once the queue goes below target, drops is reset to 0.

In general, any reference to "drops" in the CoDel RFC will translate to "marked packets" in our design.

#58 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Eric Newberry wrote:

Does "drops" in the description refer to the number of packets marked in the incident of congestion or the total number of packets processed in the incident of congestion?

"drops" (better called "marks") means the number of packets marked after entering the marking state (i.e. the queue being above target).

Once the queue goes below target, drops is reset to 0.

Previous patchsets used the second incorrect definition, but the latest patchset (18) should use the correct one stated above.

#59 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Previous patchsets used the second incorrect definition, but the latest patchset (18) should use the correct one stated above.

sounds good.

#60 Updated by Klaus Schneider over 1 year ago

  • % Done changed from 100 to 90

I ran a simple test on my local machine (100MB file) and it failed:

ndncatchunks -v /test > /dev/null

All segments have been received.
Time elapsed: 4003.19 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 208.181317 Mbit/s
Total # of packet loss events: 1
Total # of retransmitted segments: 1292
Total # of received congestion marks: 0
RTT min/avg/max = 0.875/131/260 ms

Catchunks doesn't receive any congestion marks.

In contrast, our hackathon code works:

ndncatchunks -v /test > /dev/null

All segments have been received.
Time elapsed: 1701.96 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 489.664074 Mbit/s
Total # of packet loss events: 0
Total # of retransmitted segments: 0
Total # of received congestion marks: 1046
RTT min/avg/max = 0.388/2.34/20.4 ms

Edit: removed the misleading "packet loss rate", which is currently a bug in ndncatchunks.

#61 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

I ran a simple test on my local machine (100MB file) and it failed:

ndncatchunks -v /test > /dev/null

All segments have been received.
Time elapsed: 4003.19 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 208.181317 Mbit/s
Total # of packet loss events: 1
Packet loss rate: 4.22369e-05
Total # of retransmitted segments: 1292
Total # of received congestion marks: 0
RTT min/avg/max = 0.875/131/260 ms

Catchunks doesn't receive any congestion marks.

In contrast, our hackathon code works:

ndncatchunks -v /test > /dev/null

All segments have been received.
Time elapsed: 1701.96 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 489.664074 Mbit/s
Total # of packet loss events: 0
Packet loss rate: 0
Total # of retransmitted segments: 0
Total # of received congestion marks: 1046
RTT min/avg/max = 0.388/2.34/20.4 ms

I don't think we can expect any system to work 100% perfectly 100% of the time. The loss could also have been caused by external factors. Did you run multiple trials?

#62 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

I don't think we can expect any system to work 100% perfectly 100% of the time. The loss could also have been caused by external factors. Did you run multiple trials?

Currently, it works 0% of the time.

The queue is clearly above the limit (avg. RTT of 131ms), yet there are no congestion marks.

Edit: Currently the catchunk output of "packet loss event" and "packet loss rate" is misleading. Look at "# of retransmitted segments" to see the actual packet loss rate.

#63 Updated by Klaus Schneider over 1 year ago

Okay, I figured out that I forgot to enable the congestion detection in the code (we should be able to do this in a config file, rather than by recompiling).

Now it works quite well, but needs some tuning for links with low delay (here less than 1ms).

ndncatchunks -v /test > /dev/null

All segments have been received.
Time elapsed: 2304.11 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 361.697410 Mbit/s
Total # of retransmitted segments: 970
Total # of received congestion marks: 22
RTT min/avg/max = 0.489/18.9/246 ms

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

All segments have been received.
Time elapsed: 1640.65 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 507.962230 Mbit/s
Total # of retransmitted segments: 0
Total # of received congestion marks: 45
RTT min/avg/max = 0.499/2.74/25.3 ms

Or by increasing the consumer RTO:

ndncatchunks -v --aimd-rto-min 500 /test > /dev/null

All segments have been received.
Time elapsed: 1856.24 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 448.965759 Mbit/s
Total # of retransmitted segments: 0
Total # of received congestion marks: 15
RTT min/avg/max = 0.658/25.2/233 ms

Also note that (as expected) we get much viewer congestion marks here, than in the hackathon code.

I still have to do the evaluation with higher RTTs and TCP/UDP tunnels.

#64 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Okay, I figured out that I forgot to enable the congestion detection in the code (we should be able to do this in a config file, rather than by recompiling).

Whether to allow congestion marking on a face or not should be set through the management system (and therefore via nfdc). It should be implemented in a similar manner to LpReliability.

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

I assume we're not going to change this value in the current commit. We can allow it to be changed through the management system.

#65 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Klaus Schneider wrote:

Okay, I figured out that I forgot to enable the congestion detection in the code (we should be able to do this in a config file, rather than by recompiling).

Whether to allow congestion marking on a face or not should be set through the management system (and therefore via nfdc). It should be implemented in a similar manner to LpReliability.

Sounds good. Are we doing that in next commit?

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

I assume we're not going to change this value in the current commit. We can allow it to be changed through the management system.

Yes, the current default of 100ms is fine. But it should be reasonably easy to change.

#66 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Eric Newberry wrote:

Klaus Schneider wrote:

Okay, I figured out that I forgot to enable the congestion detection in the code (we should be able to do this in a config file, rather than by recompiling).

Whether to allow congestion marking on a face or not should be set through the management system (and therefore via nfdc). It should be implemented in a similar manner to LpReliability.

Sounds good. Are we doing that in next commit?

Yes

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

I assume we're not going to change this value in the current commit. We can allow it to be changed through the management system.

Yes, the current default of 100ms is fine. But it should be reasonably easy to change.

We can probably just have this (from the user's perspective) as a command-line argument to nfdc.

#67 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

Klaus Schneider wrote:

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

I assume we're not going to change this value in the current commit. We can allow it to be changed through the management system.

Yes we can do that, but the problem is that no one is actually going to change it, especially not for local (unix stream) faces.

#68 Updated by Eric Newberry over 1 year ago

Davide Pesavento wrote:

Eric Newberry wrote:

Klaus Schneider wrote:

The problem can be fixed by changing the CoDel interval from 100ms to 10ms:

I assume we're not going to change this value in the current commit. We can allow it to be changed through the management system.

Yes we can do that, but the problem is that no one is actually going to change it, especially not for local (unix stream) faces.

We could have the management system set a different default for local faces?

#69 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

We could have the management system set a different default for local faces?

Please elaborate. Currently, management has no power over Unix faces, and more generally, management is not involved in the creation of on-demand faces.

#70 Updated by Eric Newberry over 1 year ago

Davide Pesavento wrote:

Eric Newberry wrote:

We could have the management system set a different default for local faces?

Please elaborate. Currently, management has no power over Unix faces, and more generally, management is not involved in the creation of on-demand faces.

What about local TCP faces? Are they ever utilized and are they also created on-demand?

#71 Updated by Davide Pesavento over 1 year ago

Eric Newberry wrote:

Davide Pesavento wrote:

Eric Newberry wrote:

We could have the management system set a different default for local faces?

Please elaborate. Currently, management has no power over Unix faces, and more generally, management is not involved in the creation of on-demand faces.

What about local TCP faces? Are they ever utilized and are they also created on-demand?

Same thing. Whenever you have an application on the other side, the face is on-demand. Local TCP faces are also rarely used, compared to Unix faces.

#72 Updated by Klaus Schneider over 1 year ago

I did some testing over UDP on my virtual machine (both NFDs have congestion detection enabled). I also added 50ms link delay.

First case: Producer runs on laptop, consumer runs on VM:

All segments have been received.
Time elapsed: 29136.8 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 28.602673 Mbit/s
Total # of retransmitted segments: 338
Total # of received congestion marks: 0
RTT min/avg/max = 43.954/55.612/110.566 ms

Second case: Consumer runs on laptop, producer runs on VM:

All segments have been received.
Time elapsed: 13365.2 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 62.355268 Mbit/s
Total # of retransmitted segments: 842
Total # of received congestion marks: 0
RTT min/avg/max = 51/101/325 ms

As you see, there's no congestion marks. Increasing the consumer RTO didn't help.

#73 Updated by Klaus Schneider over 1 year ago

The log files look like this (grep "cong").

Congestion detection is clearly enabled, but it never marks a packet. (no occurence of "LpPacket was marked as congested").

1515462666.399212 TRACE: [GenericLinkService] [id=267,local=unix:///run/nfd.sock,remote=fd://29] Send queue length: 91392 - congestion threshold: 65536
1515462666.399356 TRACE: [GenericLinkService] [id=267,local=unix:///run/nfd.sock,remote=fd://29] Send queue length: 96768 - congestion threshold: 65536
1515462666.399499 TRACE: [GenericLinkService] [id=267,local=unix:///run/nfd.sock,remote=fd://29] Send queue length: 102144 - congestion threshold: 65536
1515462673.287010 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462681.545601 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462681.546727 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 65536 - congestion threshold: 53248
1515462681.548310 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 73728 - congestion threshold: 53248
1515462681.943691 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462681.944187 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 65536 - congestion threshold: 53248
1515462681.944718 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 73728 - congestion threshold: 53248
1515462681.945164 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 81920 - congestion threshold: 53248
1515462681.945894 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 90112 - congestion threshold: 53248
1515462681.946476 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 98304 - congestion threshold: 53248
1515462681.947019 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 106496 - congestion threshold: 53248
1515462681.947638 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 114688 - congestion threshold: 53248
1515462681.948119 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 122880 - congestion threshold: 53248
1515462681.948629 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 131072 - congestion threshold: 53248
1515462681.949120 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 139264 - congestion threshold: 53248
1515462682.917569 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462682.945887 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462682.946526 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 65536 - congestion threshold: 53248
1515462685.098668 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248
1515462685.099018 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 65536 - congestion threshold: 53248
1515462685.099470 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 73728 - congestion threshold: 53248
1515462686.671630 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.1.107:6363,remote=udp4://192.168.1.104:6363] Send queue length: 57344 - congestion threshold: 53248

#74 Updated by Klaus Schneider over 1 year ago

As written in Gerrit, it would be good to have a better logging output.

#75 Updated by Klaus Schneider over 1 year ago

Regarding the UDP test:

  1. I found that the congestion on the actual UDP face never exceeds the threshold:

1515480726.669771 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 768 - send queue capacity: 106496 - congestion threshold: 53248
1515480726.673690 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 768 - send queue capacity: 106496 - congestion threshold: 53248
1515480726.673833 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 1536 - send queue capacity: 106496 - congestion threshold: 53248
1515480726.674149 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 2304 - send queue capacity: 106496 - congestion threshold: 53248
1515480726.674294 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 3072 - send queue capacity: 106496 - congestion threshold: 53248

Instead the congestion occurs at the unix socket.

However, for the unix socket, the congestion level goes below the threshold quite often and unexplicably:

1515480710.620540 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 53760 - send queue capacity: -1 - congestion threshold: 65536
1515480710.620611 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 59136 - send queue capacity: -1 - congestion threshold: 65536
1515480710.620692 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 64512 - send queue capacity: -1 - congestion threshold: 65536
1515480710.620766 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 69888 - send queue capacity: -1 - congestion threshold: 65536
1515480710.620843 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 75264 - send queue capacity: -1 - congestion threshold: 65536
1515480710.627133 DEBUG: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length dropped below congestion threshold
1515480710.627363 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 5376 - send queue capacity: -1 - congestion threshold: 65536
1515480710.627483 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 10752 - send queue capacity: -1 - congestion threshold: 65536

I wonder whether this is caused by packet loss/buffer overflow?

Another problem, is that we currently can't determine the queue capacity for unix sockets (result = -1).

#76 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Another problem, is that we currently can't determine the queue capacity for unix sockets (result = -1).

Can you test again with the latest patchset (25)? It should implement queue capacity detection for Unix sockets.

#77 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Klaus Schneider wrote:

Another problem, is that we currently can't determine the queue capacity for unix sockets (result = -1).

Can you test again with the latest patchset (25)? It should implement queue capacity detection for Unix sockets.

I think Davide made a good point: The unix socket also uses m_sendQueue() and thus can't drop any packets.

Is that true? If yes, the earlier design (returning -1 was fine).

But then we have the question why the queue suddenly becomes empty.

#78 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

I wonder whether this is caused by packet loss/buffer overflow?

No. Unix sockets don't lose packets.

The unix socket also uses m_sendQueue() and thus can't drop any packets.
Is that true? If yes, the earlier design (returning -1 was fine).

Yes, that's correct.

#79 Updated by Klaus Schneider over 1 year ago

Here is the output of the consumer (60ms link propagation delay):

All segments have been received.
Time elapsed: 36545.2 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 22.804349 Mbit/s
Total # of packet loss events: 21
Packet loss rate: 0.000886974
Total # of retransmitted segments: 206
Total # of received congestion marks: 0
RTT min/avg/max = 60.673/68.721/132.907 ms

There are no timeouts caused by exceeding the RTO (200ms), but 206 retransmitted packets.

So where do these packet drops come from, if the UDP tunnel never exceeds the limit, and both unix sockets and NFD are perfectly reliable (i.e. introduce delay, but don't drop packets)?

#80 Updated by Klaus Schneider over 1 year ago

Another weird outcome: The UDP tunnel send queue size seems to be counting # of packets rather than bytes:

1515482564.798918 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 18 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.798998 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 19 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.799072 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 20 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.799144 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 21 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.799255 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 22 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.799968 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 23 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.800050 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 24 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.800190 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 25 - send queue capacity: 106496 - congestion threshold: 53248
1515482564.800975 TRACE: [GenericLinkService] [id=260,local=udp4://10.0.2.15:6363,remote=udp4://10.134.223.61:6363] Send queue length: 26 - send queue capacity: 106496 - congestion threshold: 53248

Or how else do you explain that the "send queue length" is incremented in steps of 1?

#81 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

So where do these packet drops come from, if the UDP tunnel never exceeds the limit, and both unix sockets and NFD are perfectly reliable (i.e. introduce delay, but don't drop packets)?

How do you know the udp socket isn't dropping packets?

#82 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

So where do these packet drops come from, if the UDP tunnel never exceeds the limit, and both unix sockets and NFD are perfectly reliable (i.e. introduce delay, but don't drop packets)?

How do you know the udp socket isn't dropping packets?

If it was, we would first see its queue go above the threshold. But it's never even close to the threshold. (assuming that our queue size detection works correctly)

I have another theory: Maybe some process clears the buffer in batches, which empties it periodically?

That would explain why the queue suddenly decreases so much:

1515485721.065771 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 129024 - congestion threshold: 53248 - capacity: 106496
1515485721.065844 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 134400 - congestion threshold: 53248 - capacity: 106496
1515485721.065943 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 139776 - congestion threshold: 53248 - capacity: 106496
1515485721.075803 DEBUG: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length dropped below congestion threshold
1515485721.668688 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 5376 - congestion threshold: 53248 - capacity: 106496
1515485721.845179 TRACE: [GenericLinkService] [id=262,local=unix:///run/nfd.sock,remote=fd://27] Send queue length: 5376 - congestion threshold: 53248 - capacity: 106496

But it still doesn't explain the packet loss.

#83 Updated by Klaus Schneider over 1 year ago

I'm still wondering what's up with the non-byte queue lengths:

1515486333.962175 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 20 - congestion threshold: 53248 - capacity: 106496
1515486333.962446 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 20 - congestion threshold: 53248 - capacity: 106496
1515486333.969706 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 12 - congestion threshold: 53248 - capacity: 106496
1515486333.970008 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 13 - congestion threshold: 53248 - capacity: 106496
1515486333.978128 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 8 - congestion threshold: 53248 - capacity: 106496
1515486333.978493 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 9 - congestion threshold: 53248 - capacity: 106496

#84 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

Davide Pesavento wrote:

Klaus Schneider wrote:

So where do these packet drops come from, if the UDP tunnel never exceeds the limit, and both unix sockets and NFD are perfectly reliable (i.e. introduce delay, but don't drop packets)?

How do you know the udp socket isn't dropping packets?

If it was, we would first see its queue go above the threshold. But it's never even close to the threshold. (assuming that our queue size detection works correctly)

I wouldn't make that assumption, seeing that the queue length values don't make much sense.

I'm still wondering what's up with the non-byte queue lengths:

I don't know what's happening in your test, but the queue length is in bytes.

#85 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

Davide Pesavento wrote:

Klaus Schneider wrote:

So where do these packet drops come from, if the UDP tunnel never exceeds the limit, and both unix sockets and NFD are perfectly reliable (i.e. introduce delay, but don't drop packets)?

How do you know the udp socket isn't dropping packets?

If it was, we would first see its queue go above the threshold. But it's never even close to the threshold. (assuming that our queue size detection works correctly)

I wouldn't make that assumption, seeing that the queue length values don't make much sense.

So how can we figure out if the UDP tunnel drops something?

I'm still wondering what's up with the non-byte queue lengths:

I don't know what's happening in your test, but the queue length is in bytes.

So why does SIOCOUTQ return values that differ by 1? Are there 1-byte packets in the buffer?

Can you (or someone) try to run the code in UDP tunnels, and see if you can replicate my results?

#86 Updated by Klaus Schneider over 1 year ago

I just figured out a way in which the udp tunnel could drop packets, without us noticing (http://developerweb.net/viewtopic.php?id=6488): If the receive queue at the other endpoint is overflowing!

Can we also monitor the receive queue of each UDP face and print out trace output?

The command would be ioctl(SIOCINQ)

#87 Updated by Klaus Schneider over 1 year ago

Actually, the numbers make perfect sense if the UDP queue length is somehow mistakenly in #packets rather than bytes:

1515543750.854293 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 136 - congestion threshold: 53248 - capacity: 106496
1515543750.854851 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 136 - congestion threshold: 53248 - capacity: 106496
1515543750.855346 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 136 - congestion threshold: 53248 - capacity: 106496
1515543750.856916 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 132 - congestion threshold: 53248 - capacity: 106496
1515543750.857660 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 132 - congestion threshold: 53248 - capacity: 106496
1515543750.858332 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 136 - congestion threshold: 53248 - capacity: 106496
1515543750.859473 TRACE: [GenericLinkService] [id=260,local=udp4://192.168.56.101:6363,remote=udp4://192.168.56.1:6363] Send queue length: 132 - congestion threshold: 53248 - capacity: 106496

The udp buffer capacity is 106,496 bytes.

The highest observed queue length is 136 x 768(packet size) = 104,448 bytes. (Right below capacity).

#88 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

So how can we figure out if the UDP tunnel drops something?

I'm not sure, I don't think there's a way to know that for a specific socket. You can try "nstat -az | grep UdpSndbufErrors" for a system-wide counter, but other type of errors may be conflated in the same metric.

But let's take a step back, why do you care if the socket dropped a packet?

The command would be ioctl(SIOCINQ)

No, SIOCINQ returns the size of the first datagram in the receive queue.

Actually, the numbers make perfect sense if the UDP queue length is somehow mistakenly in #packets rather than bytes:

Where do you see these weird numbers? on macOS?

The highest observed queue length is 136 x 768(packet size) = 104,448 bytes. (Right below capacity).

Are you sure the packet is 768 bytes? Are you using chunks? You must be doing something unusual if the packets have that size (too big for an Interest, too small for a Data).

#89 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

So how can we figure out if the UDP tunnel drops something?

I'm not sure, I don't think there's a way to know that for a specific socket. You can try "nstat -az | grep UdpSndbufErrors" for a system-wide counter, but other type of errors may be conflated in the same metric.

But let's take a step back, why do you care if the socket dropped a packet?

Well there is packet loss somewhere, and I'm trying to figure out where it occurs.

The command would be ioctl(SIOCINQ)

No, SIOCINQ returns the size of the first datagram in the receive queue.

Fair enough. It seems to work on every platform but Linux https://stackoverflow.com/questions/9278189/how-do-i-get-amount-of-queued-data-for-udp-socket

Actually, the numbers make perfect sense if the UDP queue length is somehow mistakenly in #packets rather than bytes:

Where do you see these weird numbers? on macOS?

Ubuntu 16.04 VM.

The highest observed queue length is 136 x 768(packet size) = 104,448 bytes. (Right below capacity).

Are you sure the packet is 768 bytes? Are you using chunks? You must be doing something unusual if the packets have that size (too big for an Interest, too small for a Data).

Not sure, but 768 bytes was the "step size" in the unix queue output.

#90 Updated by Klaus Schneider over 1 year ago

I did netstat -s -su and I get a couple packet receive errors and RcvbufErrors

Before catchunks run:

Udp:
291724 packets received
435 packet receive errors
293536 packets sent
RcvbufErrors: 435

After catchunks run:

Udp:
315400 packets received
544 packet receive errors
317322 packets sent
RcvbufErrors: 544

So definitely, the UDP sockets are dropping something.

#91 Updated by Klaus Schneider over 1 year ago

And indeed it's a receive buffer overflow. You can see it when running:

watch -n .1 ss -nump

On both ends, the Send-Q is almost empty, while the Recv-Q is overflowing.

#92 Updated by Klaus Schneider over 1 year ago

Note that the problem can be mitigated some some combination of:

  1. Reducing the default marking interval, that is, the time the queue has to stay above limit before the first packet is marked (here: to 5ms)
  2. Increasing the UDP receive buffer size (here to around 10x the normal, 2000KB).
  3. Increasing the consumer min-rto (here to 600ms)

All segments have been received.
Time elapsed: 13476.7 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 61.839497 Mbit/s
Total # of packet loss events: 0
Packet loss rate: 0
Total # of retransmitted segments: 0
Total # of received congestion marks: 2
RTT min/avg/max = 41.858/140.903/348.999 ms

#93 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

  1. Reducing the default marking interval, that is, the time the queue has to stay above limit before the first packet is marked (here: to 5ms)

We could make this a separate interval in GenericLinkService::Options from baseCongestionMarkingInterval.

#94 Updated by Klaus Schneider over 1 year ago

Eric Newberry wrote:

Klaus Schneider wrote:

  1. Reducing the default marking interval, that is, the time the queue has to stay above limit before the first packet is marked (here: to 5ms)

We could make this a separate interval in GenericLinkService::Options from baseCongestionMarkingInterval.

Actually, I thought about the following design:

  • Ignore the interval when first marking the packet. Just mark the first packet once queue > threshold.
  • Then, wait at least "INTERVAL" before marking the next packet. If the queue stays above the threshold, reduce the interval, as it is done currently.

#95 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

Eric Newberry wrote:

Klaus Schneider wrote:

  1. Reducing the default marking interval, that is, the time the queue has to stay above limit before the first packet is marked (here: to 5ms)

We could make this a separate interval in GenericLinkService::Options from baseCongestionMarkingInterval.

Actually, I thought about the following design:

  • Ignore the interval when first marking the packet. Just mark the first packet once queue > threshold.
  • Then, wait at least "INTERVAL" before marking the next packet. If the queue stays above the threshold, reduce the interval, as it is done currently.

I believe this is actually more or less the design I had in a previous patchset. I can't remember which number at the moment.

#96 Updated by Klaus Schneider over 1 year ago

Yeah, I know. Let's go back to that and try it out.

#97 Updated by Eric Newberry over 1 year ago

The design changes in note 94 have been implemented in patchset 29.

#98 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

  • Ignore the interval when first marking the packet. Just mark the first packet once queue > threshold.
  • Then, wait at least "INTERVAL" before marking the next packet. If the queue stays above the threshold, reduce the interval, as it is done currently.

The second point is unclear. If the queue was above threshold, then goes below, and then again above it before one interval has passed since the previous mark, should we mark or not? If yes, a queue that oscillates around the threshold value would mark too many packets I think.

#99 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

  • Ignore the interval when first marking the packet. Just mark the first packet once queue > threshold.
  • Then, wait at least "INTERVAL" before marking the next packet. If the queue stays above the threshold, reduce the interval, as it is done currently.

The second point is unclear. If the queue was above threshold, then goes below, and then again above it before one interval has passed since the previous mark, should we mark or not? If yes, a queue that oscillates around the threshold value would mark too many packets I think.

I would say "no", don't mark the packet.

#100 Updated by Eric Newberry over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

  • Ignore the interval when first marking the packet. Just mark the first packet once queue > threshold.
  • Then, wait at least "INTERVAL" before marking the next packet. If the queue stays above the threshold, reduce the interval, as it is done currently.

The second point is unclear. If the queue was above threshold, then goes below, and then again above it before one interval has passed since the previous mark, should we mark or not? If yes, a queue that oscillates around the threshold value would mark too many packets I think.

How would we track and prevent this?

#101 Updated by Eric Newberry over 1 year ago

The changes in notes 98 and 99 should be implemented in patchset 31.

#102 Updated by Klaus Schneider over 1 year ago

I just figured out that enabling the TRACE logging (redirecting it to a file), completely destroyed the performance of my earlier measurements.

Here is one run over UDP (link RTT 20ms) with the TRACE logging:

All segments have been received.
Time elapsed: 13284.3 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 62.734813 Mbit/s
Total # of retransmitted segments: 953
Total # of received congestion marks: 1
RTT min/avg/max = 21.155/104.836/454.905 ms

Here is the exact same configuration without trace logging:

All segments have been received.
Time elapsed: 4716.32 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 176.703426 Mbit/s
Total # of retransmitted segments: 0
Total # of received congestion marks: 4
RTT min/avg/max = 20.311/43.109/137.979 ms

We get 3x higher throughput, less than half the avg. delay, and the retransmissions completely disappear.

Looks like I need to do the testing again, without trace logging.

#103 Updated by Klaus Schneider over 1 year ago

So with the current settings, I get pretty good throughput on my local machine:

klaus@Latitude-E7470:~/backup_ccn/NFD$ ndncatchunks /local > /dev/null

All segments have been received.
Time elapsed: 1034.94 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 805.251379 Mbit/s
Total # of retransmitted segments: 0
Total # of received congestion marks: 6
RTT min/avg/max = 0.931/3.331/9.483 ms

And for the UDP tunnel (20ms RTT), the results are okay, even though one might want to fine-tune the marking a bit to get rid of the retransmissions:

klaus@Latitude-E7470:~/backup_ccn/NFD$ ndncatchunks /udp > /dev/null

All segments have been received.
Time elapsed: 8509.57 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 97.935614 Mbit/s
Total # of retransmitted segments: 43
Total # of received congestion marks: 9
RTT min/avg/max = 20.246/22.962/59.581 ms

Thus, after a couple small changes (see Gerrit), I'm fine with merging the current code.

#104 Updated by Davide Pesavento over 1 year ago

Klaus Schneider wrote:

I just figured out that enabling the TRACE logging (redirecting it to a file), completely destroyed the performance of my earlier measurements.

Well, yes, that's expected. Benchmarks should be run in release mode and with minimal logging. In particular, never enable TRACE globally for every module if you're doing something time-sensitive.

#105 Updated by Klaus Schneider over 1 year ago

Davide Pesavento wrote:

Klaus Schneider wrote:

I just figured out that enabling the TRACE logging (redirecting it to a file), completely destroyed the performance of my earlier measurements.

Well, yes, that's expected. Benchmarks should be run in release mode and with minimal logging. In particular, never enable TRACE globally for every module if you're doing something time-sensitive.

I'll consider that next time :)

#106 Updated by Klaus Schneider over 1 year ago

Just for completeness, here's the result for a TCP face (same scenario as above; RTT of 20ms):

All segments have been received.
Time elapsed: 21059.5 milliseconds
Total # of segments received: 23676
Total size: 104174kB
Goodput: 39.573042 Mbit/s
Total # of retransmitted segments: 0
Total # of received congestion marks: 19
RTT min/avg/max = 21.920/39.302/65.801 ms

#107 Updated by Eric Newberry over 1 year ago

Change 4411 has been merged. Are we ready to close this issue and move onto the NFD management component of this system (allowing us to allow and disallow congestion marking on a face and set any applicable parameters)?

#108 Updated by Davide Pesavento over 1 year ago

I believe so, unless Klaus wanted to do more as part of this task. Please open separate issues for 1) integration with NFD management and nfdc, and 2) EthernetTransport support.

#109 Updated by Klaus Schneider over 1 year ago

  • Status changed from Code review to Closed
  • % Done changed from 90 to 100

Sure, we can put up separate issues.

#110 Updated by Klaus Schneider over 1 year ago

@Eric: Can you create the issue for the NFD Management support?

#111 Updated by Eric Newberry over 1 year ago

Klaus Schneider wrote:

@Eric: Can you create the issue for the NFD Management support?

An issue for this has been created as #4465.

Also available in: Atom PDF