Project

General

Profile

Bug #2441

EthernetTransport drops frames under heavy load

Added by susmit shannigrahi over 4 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Category:
Faces
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Confirming fragment drop on Ethernet face with NDNLP.
I have added print statements to the partial message store (attached).
When I try to send/receive large packets (2x5MB for this specific test), I see fragment losses.
Also, the sender side crashes after sending the fragments.

Producer:

ndnputbigchunk /test 5000000 2

Consumer:

nfd-start
nfdc add-nexthop /test 257
ndncatbigchunks /test -c 2 -p 2

Error:

***** Gap detected ***** Current fragment: 2843Last fragment:2839 Total loss169
1422479959.704064 TRACE: [EthernetFace] [id:257,endpoint:em4] Received: 1500 bytes from c8:1f:66:e6:7e:3a
***** Gap detected ***** Current fragment: 2848Last fragment:2843 Total loss173
1422479959.704110 TRACE: [EthernetFace] [id:257,endpoint:em4] Received: 1500 bytes from c8:1f:66:e6:7e:3a

ndnputbigchunk.cpp (4.34 KB) ndnputbigchunk.cpp Producer susmit shannigrahi, 01/28/2015 03:45 PM
ndncatbigchunks.cpp (5.63 KB) ndncatbigchunks.cpp Consumer susmit shannigrahi, 01/28/2015 03:45 PM
ndnlp-partial-message-store.cpp (3.56 KB) ndnlp-partial-message-store.cpp Changed to log gaps in fragment sequence susmit shannigrahi, 01/28/2015 03:48 PM
tlv.hpp (14.2 KB) tlv.hpp Increase MAX_NDN_PACKET_SIZE to 10MB susmit shannigrahi, 01/28/2015 03:50 PM
sender_trace.txt (927 Bytes) sender_trace.txt GDB trace from producer susmit shannigrahi, 01/28/2015 04:36 PM
nfd_trace.txt (5.04 KB) nfd_trace.txt GDB trace from NFD susmit shannigrahi, 01/28/2015 04:36 PM
nfd_producer_crashlog.txt (426 KB) nfd_producer_crashlog.txt Producer side log of crashed NFD susmit shannigrahi, 01/29/2015 10:25 AM
loss.txt (37.1 KB) loss.txt susmit shannigrahi, 01/29/2015 01:31 PM

History

#1 Updated by Junxiao Shi over 4 years ago

Please attach pcap captures from send machine and receive machine.

Please run sender in GDB, and obtain a stacktrace after it crashes.

#2 Updated by Davide Pesavento over 4 years ago

  • Category set to Faces
  • Start date deleted (01/28/2015)

What's the MTU of the interface? Also please attach a pcap trace of the packet exchange.

#3 Updated by susmit shannigrahi over 4 years ago

The MTU was 1500 bytes.The receiver.cap and sender.cap are below. Can not attach them due to size constrain.

http://www.cs.colostate.edu/~susmit/ndn/sender.cap

http://www.cs.colostate.edu/~susmit/ndn/receiver.cap

Error in sender: sender_trace.txt

Error in NFD: nfd_trace.txt

#4 Updated by Junxiao Shi over 4 years ago

pcap traces show that all 4016 fragments are successfully sent, and all of them are received.

One possibility is that NFD EthernetFace cannot process fragments fast enough, so that some packets are dropped by the kernel.

#5 Updated by Davide Pesavento over 4 years ago

Junxiao Shi wrote:

One possibility is that NFD EthernetFace cannot process fragments fast enough, so that some packets are dropped by the kernel.

Yep, I had the same thought. There's limited buffer space in the TPACKET implementation. If the application is unable to read the fragments fast enough, eventually the buffer will fill up and the kernel will start dropping incoming packets.

#6 Updated by Junxiao Shi over 4 years ago

There's limited buffer space in the TPACKET implementation. If the application is unable to read the fragments fast enough, eventually the buffer will fill up and the kernel will start dropping incoming packets.

Is there a way to confirm whether this situation is happening, and to what extent?

Also, is the sender crash caused by writing packets too fast and filling up some buffer?

#7 Updated by Davide Pesavento over 4 years ago

Junxiao Shi wrote:

Is there a way to confirm whether this situation is happening, and to what extent?

Yes, pcap_stats(3).

Also, is the sender crash caused by writing packets too fast and filling up some buffer?

I doubt it. In fact the stack trace shows that the crash is happening during a handleRead call... there are some too low values for this in the topmost 3 frames, suggesting a null pointer somewhere.

#8 Updated by Davide Pesavento over 4 years ago

susmit, what does NFD print before crashing? Please attach a full log file of a crashing scenario.

#9 Updated by Davide Pesavento over 4 years ago

You can also try increasing the ring buffer size (the libpcap default is 2M on Linux):

pcap_set_buffer_size(m_pcap.get(), 10*1024*1024); // e.g. 10 MB

Note that this call must happen before pcap_activate or it will fail.

#10 Updated by Davide Pesavento over 4 years ago

Apply this patch http://gerrit.named-data.net/1692 and see if it reports any packet drop.

#11 Updated by Davide Pesavento over 4 years ago

Davide Pesavento wrote:

Apply this patch http://gerrit.named-data.net/1692 and see if it reports any packet drop.

Make sure you build NFD in debug mode (./waf configure --debug)

#12 Updated by susmit shannigrahi over 4 years ago

what does NFD print before crashing? Please attach a full log file of a crashing scenario.

Attaching the log for crashing NFD at the producer side.

#13 Updated by Davide Pesavento over 4 years ago

This doesn't make any sense. The stack trace shows that the handleRead callback is invoked, and execution arrives at least as far as the async_read_some call before crashing. Thus there are two possibilities:

  1. We received something (a frame) and processed it in processIncomingPacket
  2. pcap_next_ex returned 0 (read timeout)

In both cases one or more log messages should have been printed, but I see nothing in the log file, so... what the hell?

Another thing:

1422555794.215244 INFO: [FaceTable] Removed face id=258 remote=ether://[01:00:5e:00:17:aa] local=dev://eno4

what happened here?

#14 Updated by Davide Pesavento over 4 years ago

Davide Pesavento wrote:

1422555794.215244 INFO: [FaceTable] Removed face id=258 remote=ether://[01:00:5e:00:17:aa] local=dev://eno4

what happened here?

ah, it turns out that the failure reason passed to Face::fail is never logged anywhere. I'll submit a patch shortly.

#15 Updated by susmit shannigrahi over 4 years ago

Apply this patch http://gerrit.named-data.net/1692 and see if it reports any packet drop.

Make sure you build NFD in debug mode (./waf configure --debug)

Davide, something is printing the packets when I compile with --debug. I see a bunch of binary output on the terminal (Mostly As).

#16 Updated by susmit shannigrahi over 4 years ago

Attaching a partial loss log. The pcap/kernel does drop packets.

#17 Updated by Davide Pesavento over 4 years ago

Well there's not much we can do then... NFD is single-threaded. Try increasing the buffer size as I said in note-9, and see at what point the packet loss stops (or is significantly reduced).

#18 Updated by susmit shannigrahi over 4 years ago

The loss is not reduced even after increasing the said buffer.

#19 Updated by Davide Pesavento over 4 years ago

What size did you set it to? If you're sending two 5 MB packets, I expect you won't see any drops with a 20 MB buffer.

How many fragments (ethernet frames) are being sent per second?

#20 Updated by Davide Pesavento over 4 years ago

  • Subject changed from NDNLP on Ethernet face drops fragments to EthernetFace drops frames under heavy load
  • Priority changed from Normal to Low
  • Target version set to Unsupported

#21 Updated by Davide Pesavento over 4 years ago

Is the crash still reproducible with the latest NFD?

#22 Updated by Junxiao Shi about 2 years ago

  • Status changed from New to Closed

I'll assume this issue has been solved because the reporter did not post a new log against the updated code.

#23 Updated by Davide Pesavento about 2 years ago

  • Subject changed from EthernetFace drops frames under heavy load to EthernetTransport drops frames under heavy load
  • Assignee set to Davide Pesavento
  • Priority changed from Low to Normal
  • Target version changed from Unsupported to v0.6

Actually, while working on the unicast ethernet feature, I noticed that the number of slots in the TPACKET ring buffer is way too small, which can cause frame drops if NFD is unable to handle incoming frames quickly enough.

This needs to be investigated. Please reopen.

#24 Updated by Junxiao Shi about 2 years ago

  • Status changed from Closed to New

#25 Updated by Davide Pesavento over 1 year ago

  • Target version deleted (v0.6)

#26 Updated by Davide Pesavento about 1 year ago

  • Description updated (diff)

#27 Updated by Davide Pesavento about 1 year ago

Davide Pesavento wrote:

the number of slots in the TPACKET ring buffer is way too small, which can cause frame drops if NFD is unable to handle incoming frames quickly enough.

This situation can be improved and packet loss alleviated by reducing the snaplen and/or increasing the buffer size.

Also available in: Atom PDF