Bug #3902: ndncatchunks with AIMD pipeline randomly hangs after transfer is complete - ndn-tools - NDN project issue tracking system

Actions

Copy link

Bug #3902

open

ndncatchunks with AIMD pipeline randomly hangs after transfer is complete

Added by Shuo Yang about 9 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Start date:

Due date:

% Done:

Estimated time:

Description

$ ndnputchunks /foo < foo.txt

ndncatchunks -v -p aimd /foo

......
32.5547 233.996
32.5548 234
32.5548 234.005
32.5548 234.009
32.5549 234.014

All segments have been received.
Total # of segments received: 171
Time used: 586.431 ms
Total # of packet loss burst: 1
Packet loss rate: 0.00584795
Total # of retransmitted segments: 49
Goodput: 10.1834 Mbit/s
^C
vagrant@ubuntu1604-node1:~$

The size of foo.txt is 736KB, can be larger. Very small file won't trigger the issue because it only happens when there are retransmissions.
It doesn't happen when redirecting to files.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Davide Pesavento about 9 years ago

Can this issue be reproduced with a test case?

It doesn't happen when redirecting to files.

This sounds very strange.

Actions

Copy link

Updated by Shuo Yang about 9 years ago

You mean a Boost unit test case? I'll see if I can reproduce it with a test case.

Actions

Copy link

Updated by Anonymous about 9 years ago

The size of foo.txt is 736KB, can be larger. Very small file won't trigger the issue because it only happens when there are retransmissions.
It doesn't happen when redirecting to files.

I did some tests and found the opposite to be true.

I used two variables:

File size: small (19 Bytes), large (3MB)
Output: to console, redirected to file.

The results were that only large files redirected to an output file worked properly:

	Console	To File
Small File	hangs	hangs
Large File	hangs	works

Does writing to a file take (slightly) more time than writing output on the console?

In this case the problem only occurs when the program is done before some threshold of running time. Maybe some thread/process is exiting before another is finished?

Actions

Copy link

Updated by Anonymous about 9 years ago

Looks like it is caused by the retransmission scheduler still running after the rest of the program is done:

m_scheduler.scheduleEvent(m_options.rtoCheckInterval, [this] { checkRto(); });

Indeed, cancelling the scheduling timer after receiving the finalBlockId fixed the problem for all of my 4 test cases:

if (m_hasFinalBlockId && segNo > m_lastSegmentNo && !isRetransmission) { m_scheduler.cancelAllEvents(); return; }

@Shuo: Can you check if this works for you and then patch it in gerrit?

Actions

Copy link

Updated by Anonymous about 9 years ago

Update: My suggestion above avoids the problem, but also prevents any retransmissions.

Thus, a more practical solution should only cancel the timer when the whole file has been retrieved. This is already done in handleData():

if (m_hasFinalBlockId && m_nReceived - 1 >= m_lastSegmentNo) { // all segments have been received
  cancel();
  if (m_options.isVerbose || m_options.outputSummary) {
    printSummary();
  }
}

However, I found that for very short files (1 chunk), handleData() is never called. That is because sendInterest returns after exceeding m_lastSegmentNo:

if (m_hasFinalBlockId && segNo > m_lastSegmentNo && !isRetransmission)
  return;

However, due to the exclusion of the "m_excludedSegmentNo" the initial sequence number is 1:

uint64_t
PipelineInterestsAimd::getNextSegmentNo()
{
  // get around the excluded segment
  if (m_nextSegmentNo == m_excludedSegmentNo)
    m_nextSegmentNo++;
  return m_nextSegmentNo++;
}

I found that removing the above if statement (and first m_nextSegmentNo++) fixed the problem for small files.

For large files the problem still seems to be caused by retx, as Shuo pointed out. The difference between output on the console and redirecting to a file was that the latter somehow avoided retransmissions:

../ndn-tools2/build/bin/ndncatchunks -v -d fixed -t aimd /big/%FD%00%00%01YuDq%D7 

All segments have been received.
Total # of segments received: 693
Time used: 672.743 ms
Total # of packet loss burst: 1
Packet loss rate: 0.001443
Total # of retransmitted segments: 192
Goodput: 36.1796 Mbit/s

../ndn-tools2/build/bin/ndncatchunks -v -d fixed -t aimd /big/%FD%00%00%01YuDq%D7 > output.txt

All segments have been received.
Total # of segments received: 693
Time used: 108.058 ms
Total # of packet loss burst: 0
Packet loss rate: 0
Total # of retransmitted segments: 0
Goodput: 225.246 Mbit/s

Actions

Copy link

Updated by Anonymous about 9 years ago

I also noticed that the doCancel() method doesn't remove all the pending Interests:

void
PipelineInterestsAimd::doCancel()
{
  for (const auto& entry : m_segmentInfo) {
    const SegmentInfo& segInfo = entry.second;
    m_face.removePendingInterest(segInfo.interestId);
  }

  m_segmentInfo.clear();
  m_scheduler.cancelAllEvents();
}

Received all segments!
Face pending Interests: 1
doCancel()
Face pending Interests: 1

Running m_face.removeAllPendingInterests(); also doesn't help.

Does this help anyone to diagnose the underlying problem?

If not, maybe we should move on and use the workaround of m_face.getIoService().stop();

Actions

Copy link

Updated by Davide Pesavento about 9 years ago

Klaus Schneider wrote:

However, I found that for very short files (1 chunk), handleData() is never called. That is because sendInterest returns after exceeding m_lastSegmentNo:

That is normal. Version discovery has already fetched one (the only) segment, so there's no need to send another interest to re-fetch what we already have. That is the reason why m_excludedSegmentNo is skipped.

Actions

Copy link

Updated by Davide Pesavento about 9 years ago

From a very quick look at the code, it seems we forget to keep track of the pending interest for a retransmitted segment in PipelineInterestsAimd::sendInterest().
https://github.com/named-data/ndn-tools/blob/master/tools/chunks/catchunks/pipeline-interests-aimd.cpp#L171

Actions

Copy link

Updated by Anonymous about 9 years ago

Davide Pesavento wrote:

Klaus Schneider wrote:

However, I found that for very short files (1 chunk), handleData() is never called. That is because sendInterest returns after exceeding m_lastSegmentNo:

That is normal. Version discovery has already fetched one (the only) segment, so there's no need to send another interest to re-fetch what we already have. That is the reason why m_excludedSegmentNo is skipped.

In this case, the check for "all segments have been received" (which then runs cancel()) should also be included in the sendInterest() method.

Actions

Copy link

#10

Updated by Anonymous about 9 years ago

Davide Pesavento wrote:

From a very quick look at the code, it seems we forget to keep track of the pending interest for a retransmitted segment in PipelineInterestsAimd::sendInterest().
https://github.com/named-data/ndn-tools/blob/master/tools/chunks/catchunks/pipeline-interests-aimd.cpp#L171

Does this change anything, considering that m_face.removeAllPendingInterests(); and m_scheduler.cancelAllEvents(); aren't able to shut down the face?

Does each interestId have to be removed manually?

Actions

Copy link

#11

Updated by Shuo Yang about 9 years ago

It's not a problem with stdout. I tested a larger file (40MB) with redirecting to file and the program also hangs. It's definitely has something to do with retx. I tested the same data with my original implementation of AIMD pipeline (https://github.com/imsure/ndnchunks_with_congestion_control/blob/master/chunks-updated/tools/chunks/catchunks/pipeline-interests.cpp), the program never hangs. There might be some subtleties in the current code that cause the problem.

Actions

Copy link

#12

Updated by Anonymous about 9 years ago

I agree that (half of) the problem is caused by retransmissions.

Can you:

Try out my fix for 1-chunk files in note-9
Look into Davide's comment about "keeping track of the pending interest" in note-8?

Actions

Copy link

#13

Updated by Shuo Yang about 9 years ago

Klaus Schneider wrote:

I agree that (half of) the problem is caused by retransmissions.

Can you:

Try out my fix for 1-chunk files in note-9

Look into Davide's comment about "keeping track of the pending interest" in note-8?

Yes, for 1-chunk file, there's a bug in the current code. will fix that.
I was aware of this problem and fixed this, but it's not the cause of the problem.

Actions

Copy link

#14

Updated by Shuo Yang about 9 years ago

I compared the current code and my original code, one major difference I found is that I didn't use minRTO in my original code, while the current code uses 200ms as minRTO. For the link with small BDP, this value (200ms) will overrun the network. I think this is the root cause of the problem (though I cannot tell what exactly happened in the network that causes the consumer to hang).

I changed minRTO value to 20ms and it worked well for both small and large files. I think the RTO calculation is self-adaptive and we should give the system flexibility to self-adjust within a broad predefined range.

Actions

Copy link

#15

Updated by Anonymous about 9 years ago

Shuo Yang wrote:

I compared the current code and my original code, one major difference I found is that I didn't use minRTO in my original code, while the current code uses 200ms as minRTO. For the link with small BDP, this value (200ms) will overrun the network. I think this is the root cause of the problem (though I cannot tell what exactly happened in the network that causes the consumer to hang).

I changed minRTO value to 20ms and it worked well for both small and large files. I think the RTO calculation is self-adaptive and we should give the system flexibility to self-adjust within a broad predefined range.

We have established that the problem occurs on retransmissions.

Now you are looking for a way to avoid retransmissions, which indeed will fix the issues in your test case. However, in any real scenario retransmissions are unavoidable. They will occur on a number of problems like bit errors or congestion.

Instead of trying to avoid retransmissions (by tweeking the minRTO) we should try to make the program exit properly when they occur. I still think Davide's note-8 is the best starting point we have so far.

Actions

Copy link

#16

Updated by Davide Pesavento about 9 years ago

Shuo Yang wrote:

I compared the current code and my original code, one major difference I found is that I didn't use minRTO in my original code, while the current code uses 200ms as minRTO. For the link with small BDP, this value (200ms) will overrun the network. I think this is the root cause of the problem (though I cannot tell what exactly happened in the network that causes the consumer to hang).

I changed minRTO value to 20ms and it worked well for both small and large files. I think the RTO calculation is self-adaptive and we should give the system flexibility to self-adjust within a broad predefined range.

This is completely beside the point. The consumer process must run and terminate correctly regardless of any specific minRto or maxRto or any other configuration values. A too high or too low setting cannot be the cause of this bug. The root cause is somewhere else.

Actions

Copy link

#17

Updated by Anonymous about 9 years ago

I did some more testing and I still can't figure out what causes the problem.

The exact same command works sometimes and sometimes it doesn't (some debug output below):

../ndn-tools/build/bin/ndncatchunks -t aimd -d fixed --aimd-rto-min 50  -v /big/%FD%00%00%01Y%A0%F6%8F%10

When it works:

%%EOFReceived all segments: 693, set: 692!
Face pending Interests: 0
PI: Cancel!
doCancel()
DC: Face pending Interests: 0
Face pending Interests: 0

All segments have been received.
Total # of segments sent: 794
Total # of segments received: 693
Time used: 594.106 ms
Total # of packet loss burst: 5
MaxRetx per packet: 1
RetxQueue size: 0, segmentinfo size: 101, inFlight: 0, rto: 50 milliseconds
Packet loss rate: 0.00721501
Total # of retransmitted segments: 101
Goodput: 40.9685 Mbit/s
Start Checking RTO!!!
CheckRTO Is Stopping!!!
After faceProcessEvents()

When it doesn't work:

%%EOFReceived all segments: 693, set: 692!
Face pending Interests: 1
PI: Cancel!
doCancel()
DC: Face pending Interests: 1
Face pending Interests: 1

All segments have been received.
Total # of segments sent: 800
Total # of segments received: 693
Time used: 661.172 ms
Total # of packet loss burst: 6
MaxRetx per packet: 1
RetxQueue size: 0, segmentinfo size: 107, inFlight: 0, rto: 173.123 milliseconds
Packet loss rate: 0.00865801
Total # of retransmitted segments: 107
Goodput: 36.8128 Mbit/s
Received data: 692, set: 693!
Data Stopped: Face pending Interests: 0
Start Checking RTO!!!
CheckRTO Is Stopping!!!

Seams like there are remaining pending Interests and the cancel() method can't remove them. The pending Interest disappears later, but the face doesn't shutdown properly.

Actions

Copy link

#18

Updated by Anonymous about 9 years ago

I found a different bug though: The program exits sometimes without having received every data chunk.

The reason is that m_nReceived counts every received data packet, even if it is a duplicate of a previously received one. This can be fixed by using a set:

m_receivedSet.emplace(data.getName().at(-1).toSegment()); 

if (m_hasFinalBlockId && m_receivedSet.size() - 0 >= m_lastSegmentNo) { // all segments have been received
      cancel();
}

Actions

Copy link

#19

Updated by Anonymous about 9 years ago

I talked with Beichuan about the issue, and we think it's maybe better to just go with the workaround of shutting down the face manually for now.

Actions

Copy link

#20

Updated by Davide Pesavento about 9 years ago

So I studied the code of PipelineInterestsAimd, and I think I understand it fairly well now. A few questions for which I don't have a good answer follow, maybe Shuo can explain:

at the beginning of sendInterest() there is this check:

  if (m_hasFinalBlockId && segNo > m_lastSegmentNo && !isRetransmission)
    return;

Why !isRetransmission? IIUC the logic here is like "if we know the last segment number and we try to fetch a segment after it, we reject the request and do nothing", which makes sense, but why does being a retransmission matter? why do we still send an Interest for a segment number greater than the last one if it's a retx?

Why do we need to track (i.e. keep in m_segmentInfo map) segments in RetxReceived state? Can't we just erase them once we receive the Data? Note that we don't do anything similar for segments that were received on the first try (i.e. without retransmissions), those are simply erased from the map.
As confirmed by a comment in handleData(), we do not use the RTT for retransmitted segments. The check is:

  if (segInfo.state == SegmentState::FirstTimeSent ||
      segInfo.state == SegmentState::InRetxQueue) { // do not sample RTT for retransmitted segments
    ...
    m_rttEstimator.addMeasurement(...); // record RTT
    ...
  }

However, if a segment interest is retransmitted more than once because it always times out, every time it's put on the retx queue its state will be SegmentState::InRetxQueue. Now, if a matching Data packet arrives at that time (which is possible because the "timeout" is a local client-only decision, the interest can still be in the PIT), the "if" condition above will be true, and the RTT estimator will record the RTT of a retransmitted packet! I suspect this is a bug.

Actions

Copy link

#21

Updated by Shuo Yang about 9 years ago

Sorry for the delayed response!

1) yes, the condition of !isRetransmission is not necessary, it doesn't appear in my original code, don't know why Weiwei added this.

2) the reason we track the RetxReceived state is because it's possible that multiple handleData() will be called for the retxed Interest, one for the original one, the other one for the retxed one. This happens even if we call removePendingInterest() before retransmission. I suspect is there something wrong with removePendingInterest()? It doesn't give a return value so cannot check if it succeeds or not.

3) I think you are right. We shouldn't check segInfo.state == SegmentState::InRetxQueue, it won't work for Interests retransmitted more than once.

Actions

Copy link

#22

Updated by Davide Pesavento about 9 years ago

Shuo Yang wrote:

2) the reason we track the RetxReceived state is because it's possible that multiple handleData() will be called for the retxed Interest, one for the original one, the other one for the retxed one. This happens even if we call removePendingInterest() before retransmission. I suspect is there something wrong with removePendingInterest()? It doesn't give a return value so cannot check if it succeeds or not.

But why do you need to keep it in the map? Even if handleData is called twice, the first time it will erase the segment info from the map, thus the second time it won't find anything and the callback will just return without doing anything.

3) I think you are right. We shouldn't check segInfo.state == SegmentState::InRetxQueue, it won't work for Interests retransmitted more than once.

We can't simply delete that condition, or we might lose valid RTT measurements for segments that are satisfied when they are already InRetxQueue but before they had a chance to be retransmitted. I think we need to look at the retxCount in this case.

Actions

Copy link

#23

Updated by Anonymous about 9 years ago

Related to Bug #3957: Face::processEvents blocks forever if no packet is sent added

Actions

Copy link

#24

Updated by Anonymous about 8 years ago

Related to Bug #4439: ndncatchunks fails to terminate on zero-octet collection added

Actions

Copy link

#25

Updated by Davide Pesavento about 8 years ago

Subject changed from ndncatchunks with AIMD pipeline hangs when outputting the transferred file to stdout to ndncatchunks with AIMD pipeline randomly hangs after transfer is complete
Priority changed from High to Normal
Start date deleted (~~12/23/2016~~)

Is Shuo still working on this?

Actions

Copy link

#26

Updated by Anonymous about 8 years ago

Davide Pesavento wrote:

Is Shuo still working on this?

No, Shuo has been gone since a year or so.

Actions

Copy link

#27

Updated by Davide Pesavento about 8 years ago

Assignee deleted (~~Shuo Yang~~)

Actions

Copy link

#28

Updated by Ryan Wickman almost 8 years ago

Assignee set to Ryan Wickman

Actions

Copy link

#29

Updated by Ryan Wickman almost 8 years ago

Status changed from New to In Progress

Actions

Copy link

#30

Updated by Ryan Wickman over 7 years ago

Status changed from In Progress to Code review

Actions

Copy link

#31

Updated by Junxiao Shi over 7 years ago

Description updated (diff)

I'm unable to reproduce this error with ndn-tools:commit:89db73c44333dba5ad3cc0a4926ac50646768ab9 on Ubuntu 16.04. ndnputchunks and ndncatchunks are running on two tabs inside screen program.
stderr for 256MB was redirected to a file. stdout as well as stderr for 4MB were not redirected.

4MB file:

All segments have been received.
Time elapsed: 9049.04 milliseconds
Total # of segments received: 931
Total size: 4096kB
Goodput: 3.621159 Mbit/s
Total # of lost/retransmitted segments: 112 (caused 3 window decreases)
Packet loss rate: 10.7486%
Total # of received congestion marks: 1
RTT min/avg/max = 0.586/26.912/898.029 ms

256MB file:

All segments have been received.
Time elapsed: 860620 milliseconds
Total # of segments received: 58182
Total size: 256000kB
Goodput: 2.379679 Mbit/s
Total # of lost/retransmitted segments: 1057 (caused 20 window decreases)
Packet loss rate: 1.78433%
Total # of received congestion marks: 422
RTT min/avg/max = 0.510/39.208/351278.797 ms

Actions

Copy link

#32

Updated by Ryan Wickman over 7 years ago

Junxiao Shi wrote:

I'm unable to reproduce this error with ndn-tools:commit:89db73c44333dba5ad3cc0a4926ac50646768ab9 on Ubuntu 16.04. ndnputchunks and ndncatchunks are running on two tabs inside screen program.
stderr for 256MB was redirected to a file. stdout as well as stderr for 4MB were not redirected.

4MB file:
All segments have been received.
Time elapsed: 9049.04 milliseconds
Total # of segments received: 931
Total size: 4096kB
Goodput: 3.621159 Mbit/s
Total # of lost/retransmitted segments: 112 (caused 3 window decreases)
Packet loss rate: 10.7486%
Total # of received congestion marks: 1
RTT min/avg/max = 0.586/26.912/898.029 ms
256MB file:
All segments have been received.
Time elapsed: 860620 milliseconds
Total # of segments received: 58182
Total size: 256000kB
Goodput: 2.379679 Mbit/s
Total # of lost/retransmitted segments: 1057 (caused 20 window decreases)
Packet loss rate: 1.78433%
Total # of received congestion marks: 422
RTT min/avg/max = 0.510/39.208/351278.797 ms

That is because the file must be less than chunks size (i.e. it will only send exactly 1 segment). Both of the files you sent are more than 1 chunk size, therefore the bug will not be reproduced. Try sending an empty file.

Actions

Copy link

#33

Updated by Junxiao Shi over 7 years ago

That is because the file must be less than chunks size (i.e. it will only send exactly 1 segment). Both of the files you sent are more than 1 chunk size, therefore the bug will not be reproduced. Try sending an empty file.

The issue description, as quoted below, calls for a large file, not an empty file that qualifies as "very small file".

The size of foo.txt is 736KB, can be larger. Very small file won't trigger the issue because it only happens when there are retransmissions.

Actions

Copy link

#34

Updated by Ryan Wickman over 7 years ago

After looking into it further I realize I mixed this bug up with Bug #4439. I apologize, got a bit confused due to this being a related issue and the comments for this issue assisting me with finding the solution to that bug.

Actions

Copy link

#35

Updated by Davide Pesavento over 7 years ago

Status changed from Code review to New

Actions

Copy link

#36

Updated by Ryan Wickman over 7 years ago

My progress so far:

Fixed chunks so that it will correctly receive all segments before attempting to terminate.
- I now use a set rather than just incrementing number received.
- The original problem was a race condition when two Data segments, with the same number, would arrive to PipelineInterestAimd::handleData() at the same time
Fixed the issue were interestId was not updated in PipelineInterestAimd::m_segmentInfo for retransmissions
Regardless of if there are pending Interest, the process will still hang.
Tried m_scheduler.cancelAllEvents() and m_face.removeAllPendingInterest() in PipelineInterestsAimd::doCancel()
Noticed that DataFetcher::cancel() was never getting called so I call it in DataFetcher::handleData(), DataFetcher::handleTimeout(), and DataFetcher::handleNack().
Since m_schedulercould be preventing the face from shutting down, I print out all the times an event is scheduled. I then checked to see if it could be scheduling an event even after PipelineInterestsAimd::doCancel() is called. But this is not the case.
I'm using Boost Handler Tracking which writes debugging output to the standard error stream.
I've tried using GDB to see what is causing the blocking and it seems the socket is the only thing, atleast in the reactor queue.
I have multiple log files for the non-terminating and terminating transfers.
- The only difference I have noticed between terminating and non-terminating logs is that the socket closes in the non-terminating but not in the terminating (e.g., @asio|1534876042.907770|0|[email protected])
To reproduce the non-terminating transfer I had to use a brute force approach:
1. sudo dd if=/dev/urandom of=/tmp/testfile bs=1k count=2000; cat /tmp/testfile | ndnputchunks -s 1024 ndn:/localhost/demo/gpl3
2. for i in {1..50}; do ndncatchunks -v -d iterative --aimd-rto-min 0 ndn:/localhost/demo/gpl3 1> /dev/null; done

Actions

Copy link

#37

Updated by Junxiao Shi almost 7 years ago

Does this still occur after #3860 is merged?

Actions

Copy link

#38

Updated by Davide Pesavento over 2 years ago

Assignee deleted (~~Ryan Wickman~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

NFD » ndn-tools

Tags

Bug #3902

ndncatchunks with AIMD pipeline randomly hangs after transfer is complete

Updated by Davide Pesavento about 9 years ago

Updated by Shuo Yang about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Shuo Yang about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Shuo Yang about 9 years ago

Updated by Shuo Yang about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Shuo Yang about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Anonymous about 9 years ago

Updated by Anonymous about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Anonymous about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Ryan Wickman almost 8 years ago

Updated by Ryan Wickman almost 8 years ago

Updated by Ryan Wickman over 7 years ago

Updated by Junxiao Shi over 7 years ago

Updated by Ryan Wickman over 7 years ago

Updated by Junxiao Shi over 7 years ago

Updated by Ryan Wickman over 7 years ago

Updated by Davide Pesavento over 7 years ago

Updated by Ryan Wickman over 7 years ago

Updated by Junxiao Shi almost 7 years ago

Updated by Davide Pesavento over 2 years ago