Bug #5118
closedTimeout/packet loss on docker
0%
Description
I am not sure if this is a tools problem or NFD problem.
Here is what is happening:
We are trying to containerize NDN and deploy it on Pacific Research Platform (PRP). The same setup works on other platforms but herepacket retrieval starts and then stops after 50 packets.
Sometimes we are seeing the route being dropped and other times, the route is there but packet is being dropped.
We looked at the pcap and NFD logs but everything looks normal. We tried both TCP and UDP tunnels and the same problem occurs.
This is reproducible every time we run the commands.
Image: cbmckni/ndn-tools:latest
/usr/local/bin/nfd -c /workspace/ndn/nfd.conf > /workspace/ndn/debug.log 2>&1
nfdc face create tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log
nfdc route add /BIOLOGY tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log
ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log
# failed after ~50 segments, route/face removed for some reason
ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log
# failed immediately, no route
nfdc face create tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log
nfdc route add /BIOLOGY tcp4://atmos-csu.research-lan.colostate.edu | tee -a /logs/ndn-debug.log
ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log
# failed after ~50 segments, but route/face not removed this time
ndncatchunks -v /BIOLOGY/SRA/9605/9606/NaN/RNA-Seq/ILLUMINA/TRANSCRIPTOMIC/PAIRED/Kidney/PRJNA359795/SRP095950/SRX2458154/SRR5139395/SRR5139395_1 |& tee -a /logs/ndn-debug.log
# failed immediately, timeout
Files
Updated by Junxiao Shi over 4 years ago
- File 20200613.pcap 20200613.pcap added
I tried the published container and couldn't reproduce the bug.
It seems that the Data packet is missing on the server, so that the server responds with Nack.
tcpdump is attached.
Updated by susmit shannigrahi over 4 years ago
It seems that the Data packet is missing on the server, so that the server responds with Nack.
So I verified there is no packet missing. If we pull with a simple pipeline (-s 5), it completes. This has something to do with rate control.
ndnping fails if we push it too hard (1ms interval). Once it fails, nfd needs to be restarted.
Please find the logs - I do see link layer congestion markings.
Updated by susmit shannigrahi over 4 years ago
- File nfd.log nfd.log added
- File debug.pcap debug.pcap added
- File ndn.log ndn.log added
Succesful logs.
Updated by susmit shannigrahi over 4 years ago
- File nfd.log nfd.log added
- File debug.pcap debug.pcap added
- File ndn.log ndn.log added
Failed logs - client side.
Updated by susmit shannigrahi over 4 years ago
- File server-side-nfd.log server-side-nfd.log added
NDN log server side - pcap is too large. I can host it somewhere if needed.
1592253354.344069 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253502.193208 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253510.434830 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
1592253510.476897 DEBUG: [nfd.GenericLinkService] [id=301,local=tcp4://129.82.175.10:6363,remote=tcp4://209.129.248.194:41640] Send queue length dropped below congestion threshold
Updated by Davide Pesavento over 4 years ago
- Subject changed from Timeout/packet loss on docker to Timeout/packet loss on docker
- Status changed from New to Rejected
Based on the June 11th discussion on Slack, this seems to be a problem with the underlying network infrastructure, either hardware of software. Closing.