Project

General

Profile

Actions

Bug #1769

closed

"error while connecting to forwarder" when using Face.put in a loop on large amount of data

Added by Anonymous over 10 years ago. Updated over 10 years ago.

Status:
Abandoned
Priority:
Normal
Category:
Base
Target version:
Start date:
07/16/2014
Due date:
% Done:

70%

Estimated time:

Description

Using Face.put in a loop on a large amount of data (10MB) sometimes produces an "error while connecting to forwarder" exception.

We discovered the problem using a slightly modified version of ndnputchunks (see attached code). Reproduction is inconsistent, but should happen fairly regularly.


Files

ndnputchunks4.cpp (3.86 KB) ndnputchunks4.cpp Anonymous, 07/16/2014 02:23 PM

Related issues 1 (0 open1 closed)

Related to NFD - Task #1777: Serialization of write operation in socket streamClosedAlex Afanasyev07/18/2014

Actions
Actions #1

Updated by Alex Afanasyev over 10 years ago

I believe this error is a result of the current async implementation of communicating with the forwarder. Every time we "put" (or expressInterest), the most things that can happen is that socket::async_send will called. I had experience in the past that this async_send can fail when it is called too many times without being properly processed by io_service thread, as it is when Face::put is called in the loop.

This needs more careful evaluation and consideration. If it is the problem I'm thinking of, then I'm not even sure what should be the interface around. There should be a way (signal, callback) that indicates that face is now ready to accept new Interest/Data, but how exactly this can be achieved?

Actions #2

Updated by Junxiao Shi over 10 years ago

  • Category set to Base

Confirmed. Simplified snippet:

// g++ bug1769.cpp `pkg-config --libs --cflags libndn-cxx`

#include <ndn-cxx/face.hpp>
#include <ndn-cxx/security/key-chain.hpp>

using namespace ndn;

int
main(int argc, char** argv)
{
  Face face;
  KeyChain keyChain;

  static uint8_t buffer[8000];

  for (int i = 0; i < 2000; ++i) {
    shared_ptr<Data> data = make_shared<Data>(Name("/A").appendSegment(i));
    data->setContent(buffer, sizeof(buffer));
    keyChain.sign(*data);
    BOOST_ASSERT(data->wireEncode().size() < 8800);
    face.put(*data);
  }

  face.processEvents();
}

Observations:

  • The bug does not appear if the same Data is passed to face.put 2000 times.
  • The bug does not appear if keyChain.sign is replaced with keyChain.signWithSha256.
Actions #3

Updated by Alex Afanasyev over 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Alex Afanasyev
  • Target version set to v0.2

Few observations. This code would work if packet size are <= 8192 bytes. Also, the same code works perfectly if switched to use TCP transport (Face face("localhost")).

I tracked this to a much bigger problem with the way we use Boost.Asio. And this applies to NFD as well. In particular, socket::async_send we are using is not guaranteed to send all the supplied data as I was incorrectly assuming.

The suggested way is to use boost::async_write free function. However, we cannot simply replace async_send with it, since there is a requirement that until async_write finishes, there are no other write calls happen. The current code does not guarantee this and we will need to add some form of queueing.

Actions #4

Updated by Davide Pesavento over 10 years ago

Alex Afanasyev wrote:

I tracked this to a much bigger problem with the way we use Boost.Asio. And this applies to NFD as well. In particular, socket::async_send we are using is not guaranteed to send all the supplied data as I was incorrectly assuming.

Yes, apparently async_send behaves exactly like async_write_some.

The suggested way is to use boost::async_write free function. However, we cannot simply replace async_send with it, since there is a requirement that until async_write finishes, there are no other write calls happen. The current code does not guarantee this and we will need to add some form of queueing.

More simply, can we keep calling async_write_some (that does not have this requirement) with the same buffer (+ offset) until the bytes_transferred argument passed to the completion handler equals the number of bytes remaining?

Actions #5

Updated by Alex Afanasyev over 10 years ago

We can call that, but the problem is that we need to prevent other async_send calls to be scheduled in between. This is the problem I'm trying to thing of a solution.

Actions #6

Updated by Alex Afanasyev over 10 years ago

  • % Done changed from 0 to 70
Actions #7

Updated by Alex Afanasyev over 10 years ago

  • Related to Task #1777: Serialization of write operation in socket stream added
Actions #8

Updated by Alex Afanasyev over 10 years ago

  • Target version changed from v0.2 to v0.3

Steve, can you verify that you don't have the problem anymore (with master branch)

Actions #9

Updated by Anonymous over 10 years ago

Still getting the problem running our when running our demo on Ubuntu 14.04. Basically, we have a script that runs nfs-start, sleeps 2 seconds, and then spins off 6 ndnputchunks4 (previously attached) publishers.

I'm now also noticing the following assertion failure from NRD. I think this is new, but not 100% sure:

nrd: /usr/local/include/ndn-cxx/management/nfd-control-parameters.hpp:358: const milliseconds& ndn::nfd::ControlParameters::getExpirationPeriod() const: Assertion `this->hasExpirationPeriod()' failed.

Actions #10

Updated by Anonymous over 10 years ago

Steve DiBenedetto wrote:
...nfs-start...

nfd-start

Actions #11

Updated by Alex Afanasyev over 10 years ago

Are you using release branch of both library and NFD ? this assert could be if nfd is release and the library is master...

Actions #12

Updated by Alex Afanasyev over 10 years ago

Actually, you need to use master branch on both, since error fixed in master only.

Actions #13

Updated by Anonymous over 10 years ago

My fault. I'm using master for both, but NFD was a little behind. I've updated NFD to the latest master and the assertion failure is gone. However, the "error while connecting to forward" problem remains.

Actions #14

Updated by Alex Afanasyev over 10 years ago

I suspect that creation of data packets for 10mb takes more than 4sec. What you can do for now is to change catchunks to create all data prior to the initial put call or space out (with scheduler) data creation.

the reason is that the first put will initiate connection, but until you give processEvents() to do the work, nothing will happen, but the internal scheduler will remember connection initiation time. If more than default 4 sec, then you will get an error.

Actions #15

Updated by Junxiao Shi over 10 years ago

I doubt the workaround in note-14 can help.

I changed the snippet in note-2 as follows:

  • total 20 Data packets
  • add sleep(5) before face.processEvents()

And there is no error.

There is also no error running note-2 snippet unchanged.

Actions #16

Updated by susmit shannigrahi over 10 years ago

Is there any updates in this?
Thanks.

Actions #17

Updated by Alex Afanasyev over 10 years ago

Have you tried the suggestion I made (preparing data packets first, and then putting them to face)?

Actions #18

Updated by susmit shannigrahi over 10 years ago

I not getting the error with the latest version of ndn-cxx/NFD. I tried with and without the fix Alex suggested.
Could not reproduce either way.

Actions #19

Updated by Junxiao Shi over 10 years ago

  • Status changed from In Progress to Abandoned

This bug is gone after recent ndn-cxx and NFD update, as reported in note-18 and note-15.

Actions

Also available in: Atom PDF