Bug #1769: "error while connecting to forwarder" when using Face.put in a loop on large amount of data - ndn-cxx - NDN project issue tracking system

Actions

Copy link

Bug #1769

closed

"error while connecting to forwarder" when using Face.put in a loop on large amount of data

Added by Anonymous about 11 years ago. Updated almost 11 years ago.

Status:

Abandoned

Priority:

Normal

Assignee:

Alex Afanasyev

Category:

Base

Target version:

v0.3

Start date:

07/16/2014

Due date:

% Done:

70%

Estimated time:

Description

Using Face.put in a loop on a large amount of data (10MB) sometimes produces an "error while connecting to forwarder" exception.

We discovered the problem using a slightly modified version of ndnputchunks (see attached code). Reproduction is inconsistent, but should happen fairly regularly.

Files

ndnputchunks4.cpp (3.86 KB) ndnputchunks4.cpp

Anonymous, 07/16/2014 02:23 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

I believe this error is a result of the current async implementation of communicating with the forwarder. Every time we "put" (or expressInterest), the most things that can happen is that socket::async_send will called. I had experience in the past that this async_send can fail when it is called too many times without being properly processed by io_service thread, as it is when Face::put is called in the loop.

This needs more careful evaluation and consideration. If it is the problem I'm thinking of, then I'm not even sure what should be the interface around. There should be a way (signal, callback) that indicates that face is now ready to accept new Interest/Data, but how exactly this can be achieved?

Actions

Copy link

Updated by Junxiao Shi about 11 years ago

Category set to Base

Confirmed. Simplified snippet:

// g++ bug1769.cpp `pkg-config --libs --cflags libndn-cxx`

#include <ndn-cxx/face.hpp>
#include <ndn-cxx/security/key-chain.hpp>

using namespace ndn;

int
main(int argc, char** argv)
{
  Face face;
  KeyChain keyChain;

  static uint8_t buffer[8000];

  for (int i = 0; i < 2000; ++i) {
    shared_ptr<Data> data = make_shared<Data>(Name("/A").appendSegment(i));
    data->setContent(buffer, sizeof(buffer));
    keyChain.sign(*data);
    BOOST_ASSERT(data->wireEncode().size() < 8800);
    face.put(*data);
  }

  face.processEvents();
}

Observations:

The bug does not appear if the same Data is passed to face.put 2000 times.
The bug does not appear if keyChain.sign is replaced with keyChain.signWithSha256.

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

Status changed from New to In Progress
Assignee set to Alex Afanasyev
Target version set to v0.2

Few observations. This code would work if packet size are <= 8192 bytes. Also, the same code works perfectly if switched to use TCP transport (Face face("localhost")).

I tracked this to a much bigger problem with the way we use Boost.Asio. And this applies to NFD as well. In particular, socket::async_send we are using is not guaranteed to send all the supplied data as I was incorrectly assuming.

The suggested way is to use boost::async_write free function. However, we cannot simply replace async_send with it, since there is a requirement that until async_write finishes, there are no other write calls happen. The current code does not guarantee this and we will need to add some form of queueing.

Actions

Copy link

Updated by Davide Pesavento about 11 years ago

Alex Afanasyev wrote:

I tracked this to a much bigger problem with the way we use Boost.Asio. And this applies to NFD as well. In particular, socket::async_send we are using is not guaranteed to send all the supplied data as I was incorrectly assuming.

Yes, apparently async_send behaves exactly like async_write_some.

The suggested way is to use boost::async_write free function. However, we cannot simply replace async_send with it, since there is a requirement that until async_write finishes, there are no other write calls happen. The current code does not guarantee this and we will need to add some form of queueing.

More simply, can we keep calling async_write_some (that does not have this requirement) with the same buffer (+ offset) until the bytes_transferred argument passed to the completion handler equals the number of bytes remaining?

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

We can call that, but the problem is that we need to prevent other async_send calls to be scheduled in between. This is the problem I'm trying to thing of a solution.

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

% Done changed from 0 to 70

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

Related to Task #1777: Serialization of write operation in socket stream added

Actions

Copy link

Updated by Alex Afanasyev about 11 years ago

Target version changed from v0.2 to v0.3

Steve, can you verify that you don't have the problem anymore (with master branch)

Actions

Copy link

Updated by Anonymous about 11 years ago

Still getting the problem running our when running our demo on Ubuntu 14.04. Basically, we have a script that runs nfs-start, sleeps 2 seconds, and then spins off 6 ndnputchunks4 (previously attached) publishers.

I'm now also noticing the following assertion failure from NRD. I think this is new, but not 100% sure:

nrd: /usr/local/include/ndn-cxx/management/nfd-control-parameters.hpp:358: const milliseconds& ndn::nfd::ControlParameters::getExpirationPeriod() const: Assertion `this->hasExpirationPeriod()' failed.

Actions

Copy link

#10

Updated by Anonymous about 11 years ago

Steve DiBenedetto wrote:
...nfs-start...

nfd-start

Actions

Copy link

#11

Updated by Alex Afanasyev about 11 years ago

Are you using release branch of both library and NFD ? this assert could be if nfd is release and the library is master...

Actions

Copy link

#12

Updated by Alex Afanasyev about 11 years ago

Actually, you need to use master branch on both, since error fixed in master only.

Actions

Copy link

#13

Updated by Anonymous about 11 years ago

My fault. I'm using master for both, but NFD was a little behind. I've updated NFD to the latest master and the assertion failure is gone. However, the "error while connecting to forward" problem remains.

Actions

Copy link

#14

Updated by Alex Afanasyev about 11 years ago

I suspect that creation of data packets for 10mb takes more than 4sec. What you can do for now is to change catchunks to create all data prior to the initial put call or space out (with scheduler) data creation.

the reason is that the first put will initiate connection, but until you give processEvents() to do the work, nothing will happen, but the internal scheduler will remember connection initiation time. If more than default 4 sec, then you will get an error.

Actions

Copy link

#15