Project

General

Profile

Bug #5118

Timeout/packet loss on docker

Added by susmit shannigrahi over 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I am not sure if this is a tools problem or NFD problem.

Here is what is happening:

We are trying to containerize NDN and deploy it on Pacific Research Platform (PRP). The same setup works on other platforms but herepacket retrieval starts and then stops after 50 packets.
Sometimes we are seeing the route being dropped and other times, the route is there but packet is being dropped.

We looked at the pcap and NFD logs but everything looks normal. We tried both TCP and UDP tunnels and the same problem occurs.

This is reproducible every time we run the commands.

Image: cbmckni/ndn-tools:latest

/usr/local/bin/nfd -c /workspace/ndn/nfd.conf > /workspace/ndn/debug.log 2>&1

nfdc face create tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log

nfdc route add /BIOLOGY tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log

ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log

# failed after ~50 segments, route/face removed for some reason

ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log

# failed immediately, no route

nfdc face create tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log

nfdc route add /BIOLOGY tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log

ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log

# failed after ~50 segments, but route/face not removed this time

ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log

# failed immediately, timeout

Files

putchunks.log (423 Bytes) putchunks.log susmit shannigrahi, 06/10/2020 03:31 PM
catchunks-debug.log (4 KB) catchunks-debug.log susmit shannigrahi, 06/10/2020 03:31 PM
serverside-nfd.log (1.05 MB) serverside-nfd.log susmit shannigrahi, 06/10/2020 03:31 PM
client-side-nfd--debug.log (382 KB) client-side-nfd--debug.log susmit shannigrahi, 06/10/2020 03:31 PM
client-side-pcap.pcap (744 KB) client-side-pcap.pcap susmit shannigrahi, 06/10/2020 03:32 PM
20200613.pcap (2.89 KB) 20200613.pcap Junxiao Shi, 06/13/2020 01:21 PM
nfd.log (712 KB) nfd.log susmit shannigrahi, 06/22/2020 06:24 AM
ndn.log (31 KB) ndn.log susmit shannigrahi, 06/22/2020 06:24 AM
debug.pcap (3.87 MB) debug.pcap susmit shannigrahi, 06/22/2020 06:24 AM
debug.pcap (64 KB) debug.pcap susmit shannigrahi, 06/22/2020 06:24 AM
ndn.log (143 KB) ndn.log susmit shannigrahi, 06/22/2020 06:24 AM
nfd.log (4.77 MB) nfd.log susmit shannigrahi, 06/22/2020 06:24 AM
server-side-nfd.log (791 KB) server-side-nfd.log susmit shannigrahi, 06/22/2020 06:38 AM
#1

Updated by Junxiao Shi over 1 year ago

I tried the published container and couldn't reproduce the bug.
It seems that the Data packet is missing on the server, so that the server responds with Nack.
tcpdump is attached.

#2

Updated by susmit shannigrahi over 1 year ago

It seems that the Data packet is missing on the server, so that the server responds with Nack.

So I verified there is no packet missing. If we pull with a simple pipeline (-s 5), it completes. This has something to do with rate control.

ndnping fails if we push it too hard (1ms interval). Once it fails, nfd needs to be restarted.

Please find the logs - I do see link layer congestion markings.

#3

Updated by susmit shannigrahi over 1 year ago

Succesful logs.

#4

Updated by susmit shannigrahi over 1 year ago

Failed logs - client side.

#5

Updated by susmit shannigrahi over 1 year ago

NDN log server side - pcap is too large. I can host it somewhere if needed.

1592253354.344069 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253502.193208 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253510.434830 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253510.476897 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold

#6

Updated by Davide Pesavento about 1 year ago

  • Subject changed from Timeout/packet loss on docker to Timeout/packet loss on docker
  • Status changed from New to Rejected

Based on the June 11th discussion on Slack, this seems to be a problem with the underlying network infrastructure, either hardware of software. Closing.

Also available in: Atom PDF