Project

General

Profile

Actions

Bug #3650

closed

nfd-status not working

Added by Junxiao Shi almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
High
Category:
Tools
Target version:
Start date:
06/14/2016
Due date:
% Done:

100%

Estimated time:

Description

Environment: Ubuntu 14.04

nfd-status does not display NFD status information, but returns immediately without any output or error message.

However nfd-status -h is working properly.


Files

nfd-status-strace.txt (53.4 KB) nfd-status-strace.txt Alex Afanasyev, 06/14/2016 06:01 PM
Actions #1

Updated by Junxiao Shi almost 8 years ago

  • Description updated (diff)
  • Assignee set to Alex Afanasyev
  • Priority changed from Normal to High

Priority is increased to Urgent because there are already five complaints.


Ashlesh has performed a bisect and finds that the last good version is NFD:commit:ace83ac9384da037a36695888e829d7337ce36b0 which means the first bad version is NFD:commit:ab49745afe0967eb59d1005efce92894c66fcf7f.

The only change that might affect nfd-status is in wscript:

conf.check_boost(lib=boost_libs, mt=True)

mt=True is added in this commit.
I wonder, what's the implication when NFD has mt=True while ndn-cxx does not?

This line is added by Alex in patchset7, so I'm assigning this issue to Alex.

Actions #2

Updated by Alex Afanasyev almost 8 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Alex Afanasyev almost 8 years ago

I was finally reproduce the problem in clean Ubuntu 14.04 environment by just compiling master commits of ndn-cxx and NFD. Few observations:

  • nfd-status exits with code 0, but prints nothing
  • strace nfd-status attached
  • currently, all tools depend on core-objects, which for nfd-status results in the following command line
  /usr/bin/g++ -pthread tools/nfd-status.cpp.3.o core/city-hash.cpp.2.o core/config-file.cpp.2.o core/global-io.cpp.2.o core/manager-base.cpp.2.o core/network-interface.cpp.2.o core/network.cpp.2.o core/privilege-helper.cpp.2.o core/random.cpp.2.o core/scheduler.cpp.2.o core/logger-factory.cpp.2.o core/logger.cpp.2.o -o /home/vagrant/NFD/build/bin/nfd-status -Wl,-Bstatic -L/usr/lib/x86_64-linux-gnu -Wl,-Bdynamic -L/usr/lib/x86_64-linux-gnu -L/usr/local/lib -lboost_system -lboost_chrono -lboost_program_options -lboost_random -lboost_thread -lboost_log -lboost_log_setup -lndn-cxx -lboost_system -lboost_filesystem -lboost_date_time -lboost_iostreams -lboost_regex -lboost_program_options -lboost_chrono -lboost_random -lcryptopp -lsqlite3 -lrt -lpthread -lrt -lresolv

Manually removing all core/* dependencies, fixes the issue.

  • ndn-cxx example (I copied consumer.cpp into tools/) when compiled with core/*, stops working (Face::processEvents() returns immediately)

  • Ha. Calling Face::getIoService()::run() makes the app to work correctly.

  • But I'm really puzzled. Here are two code snippets:

    • First
    Face m_face;
    m_face.expressInterest(Name("/temp"),
                           bind([] { std::cerr << "ON DATA" << std::endl; }),
                           bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
    std::cerr << "RUN indirectly" << std::endl;
    m_face.processEvents();
    std::cerr << "END RUN indirectly" << std::endl;
  • Second
    Face m_face;
    m_face.expressInterest(Name("/temp"),
                           bind([] { std::cerr << "ON DATA" << std::endl; }),
                           bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
    std::cerr << "RUN indirectly" << std::endl;
    m_face.processEvents();
    std::cerr << "END RUN indirectly" << std::endl;

    std::cerr << "RUN directly" << std::endl;
    m_face.getIoService().run();
    std::cerr << "END RUN directly" << std::endl;

Output of the first:

     vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
     RUN indirectly
     END RUN indirectly

Output of the second

     vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
     RUN indirectly
     ON TIMEOUT
     END RUN indirectly
     RUN directly
     END RUN directly

How does it make sense?!?

Actions #5

Updated by Alex Afanasyev almost 8 years ago

Forgot to mention. Tried with two different boost versions (default 1.54 and 1.55).

Actions #6

Updated by Alex Afanasyev almost 8 years ago

Could it be some compiler bug? Recompiled ndn-cxx and NFD with clang++-3.6 and issue is gone.

Actions #7

Updated by Junxiao Shi almost 8 years ago

The "directly" and "indirectly" calls in note-3 are not equivalent:
Face::processEvents also has a m_ioService.reset().

Can you try:

  • one processEvents invocation
  • .reset() and .run()
  • two processEvents invocations
  • .reset() and .run() repeated twice
  • one processEvents invocation, followed by .reset() and .run()
  • .reset() and .run(), followed by one processEvents invocation

Also,

  • Print the return value of .run() both inside and outside processEvents, to get an idea on why .run() is returning.
  • Print the value of .stopped() after .run() returns.

note-3 also does not identify which of core-objects is the troublemaker.

Can you disable just core/logger*.o to confirm whether logger is the one causing problem?

Actions #8

Updated by Alex Afanasyev almost 8 years ago

I have checked the library (added printouts). The code is equivalent. The only method that is called from inside the processEvents() is run. Everything else is disabled.

Note. That after adding direct run method, the indirect run method starts working! This is something ridiculous and I cannot blame anything except the compiler.

I tried running directly reset and run (same effect, processEvents magically starts working).

Just in case. I have tried debug and optimized modes. Incorrect behavior with gcc-4.8 is the same.


When nothing is done, return from run is "Success - 0" (boost::system::error_code ec; m_ioService.run(ec); std::cerr << ec.value() << " - " << ec.message() << std::endl;)


The "troublemakers" are core/logger* and two files that use logger: core/network-interface* and core/privilege-helper*. However, removing those binaries simply eliminates dependency on Boost.Log shared lib.


There was somewhat related bug https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1338693, but it seem to be fixed.

Actions #9

Updated by Junxiao Shi almost 8 years ago

I'm asking about the return value of io.run() which indicates how many handlers are invoked, not the error_code.


To confirm whether this is same as gcc-4.8 bug 1338693, try the snippet in their link and see what happens.

If it's fixed over there, check the installed version contains that fix.

Actions #10

Updated by Alex Afanasyev almost 8 years ago

The cited bug is similar, but not the same. The snippet from the bug works as expected.

Actions #11

Updated by Junxiao Shi almost 8 years ago

I notice valgrind nfd-status reports several Conditional jump or move depends on uninitialised value(s) errors within io_service on Ubuntu 14.04 64-bit and Boost 1.54, but there's no such error on Ubuntu 16.04 32-bit and Boost 1.58. Both are tested when NFD is not running.

Actions #12

Updated by Alex Afanasyev almost 8 years ago

I have the same observation, seeing a bunch of Conditional jump or move depends on uninitialised value(s).

Additional input. When NFD is compiled with debug flags, the trick with direct running of io_service works. When compiling with optimized flags, execution of an app simply hangs. So it is not a solution.


As for number of executed handlers. I checked, it is 0.


Any idea / suggestion what to do? Tell everybody to use clang-4.6 on Ubuntu 14.04 (it is available from standard packages) and avoid the default gcc-4.8?

I just checked with the recent version of gcc (gcc-6 from PPA). The problem is exactly the same as with gcc-4.8. Are we doing something wrong?

Actions #13

Updated by Alex Afanasyev almost 8 years ago

A few more observations. From valgrind output I see that nfd-status uses some of boost::asio... from libboost_log...:

...
==1952== Conditional jump or move depends on uninitialised value(s)
==1952==    at 0x52DC046: boost::asio::detail::task_io_service::stop() (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952==    by 0x52DEFF7: boost::asio::detail::task_io_service::run(boost::system::error_code&) (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952==    by 0x566A75C: run (io_service.ipp:66)
==1952==    by 0x566A75C: ndn::Face::processEvents(boost::chrono::duration<long, boost::ratio<1l, 1000l> > const&, bool) (face.cpp:448)
==1952==    by 0x41E290: ndn::NfdStatus::fetchInformation() (nfd-status.cpp:823)
==1952==    by 0x418DDA: main (nfd-status.cpp:918)
...

This suggested me that the linker did something weird. I tried to ensure and -lndn-cxx appears before -lboost_log and things got better for nfd-status. But I'm not sure if there is effect on the log library.

Actions #14

Updated by Davide Pesavento almost 8 years ago

Could be an ODR violation somewhere. Try building ndn-cxx with -Wl,-Bsymbolic-functions.

Actions #15

Updated by Davide Pesavento almost 8 years ago

If it's a symbol collision, we knew it was going to happen sooner or later, since ndn-cxx does not hide its private symbols. See also #2859.

Actions #16

Updated by Alex Afanasyev almost 8 years ago

Adding -Wl,-Bsymbolic-functions flag when linking libndn-cxx.so fixes the problem. Should we add it as a default flag to ndn-cxx then?

Actions #17

Updated by Davide Pesavento almost 8 years ago

The real solution would be to implement #2859... but yes, you can add -Wl,-Bsymbolic-functions to the default linker flags [1] in the meantime, it should be pretty safe(*). Note that it's a Linux-only flag, it's implemented in both GNU ld and GNU gold, but not in the Apple linker.

Independently from -Bsymbolic-functions, can you try building ndn-cxx with CXXFLAGS=-fvisibility-inlines-hidden, and see if that flag alone can fix the problem as well?

(*) Though it will break code that compares the address of a library function taken from an executable and the address of the same function taken from the library (the addresses will be different). I don't think we have code that does that, do we?

Actions #18

Updated by Junxiao Shi over 7 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF