Project

General

Profile

Bug #5079

Packet retransmission issue because of LpReliability

Added by Chavoosh Ghasemi 16 days ago. Updated 10 days ago.

Status:
Code review
Priority:
Normal
Assignee:
Category:
Faces
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:

Description

There are some pieces of evidence that show LpReliability causes multiple retransmissions before either timeout happens at the sender (i.e., 200 ms) or any ack for greater sequence numbers receives by the sender.
This problem has been seen between two testbed hubs where both sides have LpReliability on. Worth mentioning that this problem is not deterministic and happens only for some Interests.

To show this issue, a client ndnpinged Washington hub (wu) through Arizona hub (hobo). Attached is the tcpdump pcap file that is collected on hobo.
Upon receiving Interest /ndn/edu/wustl/ping/1761450254637779364 from the client, hobo sends the Interest to wu. However, before timeout or receiving any Ack it will retransmit the same Interest for 3 times. In return wu sends back 4 Data packets and 12 Nacks.

Here is the output from pcap, particularly for /ndn/edu/wustl/ping/1761450254637779364:

No. Time    Source  Destination Protocol    Length  Info    delta
915 61.626371   10.132.245.39   128.196.203.36  UDP (NDN)   103 Interest /ndn/edu/wustl/ping/1761450254637779364    0.042070
916 61.626475   128.196.203.36  128.252.153.194 UDP (NDN)   119 Interest /ndn/edu/wustl/ping/1761450254637779364    0.000104
918 61.717688   128.196.203.36  128.252.153.194 UDP (NDN)   119 Interest /ndn/edu/wustl/ping/1761450254637779364    0.039450
928 61.808919   128.196.203.36  128.252.153.194 UDP (NDN)   119 Interest /ndn/edu/wustl/ping/1761450254637779364    0.049886
929 61.900236   128.196.203.36  128.252.153.194 UDP (NDN)   119 Interest /ndn/edu/wustl/ping/1761450254637779364    0.091317
930 62.067296   128.252.153.194 128.196.203.36  UDP (NDN)   165 Data /ndn/edu/wustl/ping/1761450254637779364    0.167060
931 62.067405   128.196.203.36  10.132.245.39   UDP (NDN)   137 Data /ndn/edu/wustl/ping/1761450254637779364    0.000109
933 62.100172   128.252.153.194 128.196.203.36  UDP (NDN)   140 Nack /ndn/edu/wustl/ping/1761450254637779364    0.027834
935 62.255842   128.252.153.194 128.196.203.36  UDP (NDN)   140 Nack /ndn/edu/wustl/ping/1761450254637779364    0.150629
937 62.282512   128.252.153.194 128.196.203.36  UDP (NDN)   153 Data /ndn/edu/wustl/ping/1761450254637779364    0.021628
938 62.284511   128.252.153.194 128.196.203.36  UDP (NDN)   140 Nack /ndn/edu/wustl/ping/1761450254637779364    0.001999
940 62.313064   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.025509
942 62.458480   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.140376
943 62.458967   128.252.153.194 128.196.203.36  UDP (NDN)   153 Data /ndn/edu/wustl/ping/1761450254637779364    0.000487
947 62.486449   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.000866
949 62.508351   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.016863
952 62.581047   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.004247
954 62.630670   128.252.153.194 128.196.203.36  UDP (NDN)   153 Data /ndn/edu/wustl/ping/1761450254637779364    0.044582
956 62.651119   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.015408
957 62.652113   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.000994
960 62.678285   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.010253
964 62.711556   128.252.153.194 128.196.203.36  UDP (NDN)   128 Nack /ndn/edu/wustl/ping/1761450254637779364    0.028228

Files

hobo-wu-lp-reliability-2.tcpdump (211 KB) hobo-wu-lp-reliability-2.tcpdump Chavoosh Ghasemi, 02/03/2020 09:59 AM

Related issues

Related to NFD - Feature #5080: Add logging to LpReliabilityClosed

Actions

History

#1

Updated by Eric Newberry 16 days ago

  • Assignee set to Eric Newberry

I think one of the first things we should do is add logging to LpReliability to help us trace down this issue.

#2

Updated by Eric Newberry 16 days ago

#3

Updated by Davide Pesavento 15 days ago

  • Category set to Faces
  • Target version set to v0.8
  • Start date deleted (02/03/2020)
#4

Updated by Eric Newberry 14 days ago

With the version of NFD currently deployed on the testbed (0.6.6), LpReliability makes use of nfd::RttEstimator, which has a minimum RTO of 1ms. As of 0.7.0, LpReliability relies upon ndn::util::RttEstimator from ndn-cxx to compute its estimated RTOs, which features the minimum RTO of 200ms expected in the issue description. Therefore, the scenario described in the issue description seems to be the expected behavior in 0.6.6.

#5

Updated by Davide Pesavento 14 days ago

Ok, that explains the aggressive retransmissions, but why are we seeing multiple Data/Nacks for the same Interest in some cases?

#6

Updated by Eric Newberry 14 days ago

Davide Pesavento wrote:

Ok, that explains the aggressive retransmissions, but why are we seeing multiple Data/Nacks for the same Interest in some cases?

I'm assuming that you're referring to a Data packet and Nack for the same Interest like seen at lines 480-481 and 483 in the attached pcap file, all in response to the Interests at lines 478-479? If so, it looks like the two Data packets being sent are the first transmission and then a retransmission of the same Data after the first one timed out. Both are being sent in response to the original Interest. Then, the Nack occurs because the Interest was retransmitted, despite the first copy having being received successfully, causing a duplicate Interest packet to be processed with the same nonce, causing a Duplicate Nack to be sent in response. The remaining string of Nacks/Datas for the same Interest are probably caused by retransmission semantics due to the use of the older nfd::RttEstimator with the lower minimum RTO - my educated guess is that the low RTOs are causing the relevant packets to be dropped from the retransmission queue with that TxSequence before they can be acknowledged (with those Acks likely being dropped since they refer to now unknown TxSequences), causing further retransmissions.

#7

Updated by Davide Pesavento 14 days ago

Eric Newberry wrote:

Then, the Nack occurs because the Interest was retransmitted, despite the first copy having being received successfully, causing a duplicate Interest packet to be processed with the same nonce, causing a Duplicate Nack to be sent in response.

But the Nack is generated by higher layers, which means LpReliability is passing all the retransmitted Interests up to the forwarder/strategy. Is that the expected behavior? I believe LpReliability should be transparent and not cause additional network-layer packets to be seen by the upper layers. In other words, retransmissions should be filtered and dropped.

#8

Updated by Eric Newberry 14 days ago

Davide Pesavento wrote:

Eric Newberry wrote:

Then, the Nack occurs because the Interest was retransmitted, despite the first copy having being received successfully, causing a duplicate Interest packet to be processed with the same nonce, causing a Duplicate Nack to be sent in response.

But the Nack is generated by higher layers, which means LpReliability is passing all the retransmitted Interests up to the forwarder/strategy. Is that the expected behavior? I believe LpReliability should be transparent and not cause additional network-layer packets to be seen by the upper layers. In other words, retransmissions should be filtered and dropped.

Yes, this is seemingly the expected behavior, since LpReliability has no understanding of upper layer semantics (apart from reporting lost Interests to the strategy). I'll look further into the design to see if there was any discussion of this issue around the time of initial implementation.

#9

Updated by Eric Newberry 13 days ago

Eric Newberry wrote:

Davide Pesavento wrote:

Eric Newberry wrote:

Then, the Nack occurs because the Interest was retransmitted, despite the first copy having being received successfully, causing a duplicate Interest packet to be processed with the same nonce, causing a Duplicate Nack to be sent in response.

But the Nack is generated by higher layers, which means LpReliability is passing all the retransmitted Interests up to the forwarder/strategy. Is that the expected behavior? I believe LpReliability should be transparent and not cause additional network-layer packets to be seen by the upper layers. In other words, retransmissions should be filtered and dropped.

Yes, this is seemingly the expected behavior, since LpReliability has no understanding of upper layer semantics (apart from reporting lost Interests to the strategy). I'll look further into the design to see if there was any discussion of this issue around the time of initial implementation.

I looked at the original design document (https://docs.google.com/presentation/d/1cP1ya0oEUw1wjpF3SWqaUgLNLx5iH50Y7v2slVTvLOU) and it appears that deduplicating retransmissions was not considered in the design of this protocol.

#10

Updated by Davide Pesavento 13 days ago

Eric Newberry wrote:

Yes, this is seemingly the expected behavior, since LpReliability has no understanding of upper layer semantics (apart from reporting lost Interests to the strategy).

You don't need to understand upper layer semantics. A simple "packet identifier" that is kept the same for all LP-layer retransmissions should be sufficient.

#11

Updated by Eric Newberry 13 days ago

Davide Pesavento wrote:

Eric Newberry wrote:

Yes, this is seemingly the expected behavior, since LpReliability has no understanding of upper layer semantics (apart from reporting lost Interests to the strategy).

You don't need to understand upper layer semantics. A simple "packet identifier" that is kept the same for all LP-layer retransmissions should be sufficient.

You're right. This already exists as the Sequence field, although it is currently only included in packets that are fragmented, presumably to reduce the size of transmitted frames. I think we could use this to detect duplicates if we also included it when LpReliability is enabled on the link.

#12

Updated by Eric Newberry 13 days ago

  • Status changed from New to In Progress

I added a deduplication mechanism to the design of LpReliability. Received frames will be dropped if their sequence numbers have been seen within the last estimated RTO. The new design is here: https://docs.google.com/presentation/d/1FfT25zv936TVv5BzejYuQWJ6VRBgcnswfR9On7n3lFY/edit?usp=sharing

This design will require that Sequence numbers be generated for and included in LpPackets if LpReliability is enabled (currently Sequence numbers are only included if the LpPacket is fragmented).

#13

Updated by Eric Newberry 12 days ago

Per the NFD call on February 6, 2020, the following changes to the reliability protocol will be adopted to resolve the problems described in this issue:

  • The Sequence field is now required to be included in transmitted LpPackets when the LpReliability feature is enabled.
  • Received LpPackets will have their fragments dropped if their Sequence field matches the Sequence of an already received LpPacket on a link.
  • To enable the above, the receiver will keep track of received Sequence numbers for 1x the estimated link RTO at the time of receipt.

These changes are reflected in NDNLPv2 revision 57.

#14

Updated by Eric Newberry 10 days ago

  • Status changed from In Progress to Code review
  • % Done changed from 0 to 100

Also available in: Atom PDF