Bug #3650: nfd-status not working - NFD - NDN project issue tracking system

Actions

Copy link

Bug #3650

closed

nfd-status not working

Added by Junxiao Shi about 9 years ago. Updated about 9 years ago.

Status:

Closed

Priority:

High

Assignee:

Alex Afanasyev

Category:

Tools

Target version:

v0.5

Start date:

06/14/2016

Due date:

% Done:

100%

Estimated time:

Description

Environment: Ubuntu 14.04

nfd-status does not display NFD status information, but returns immediately without any output or error message.

However nfd-status -h is working properly.

Files

nfd-status-strace.txt (53.4 KB) nfd-status-strace.txt

Alex Afanasyev, 06/14/2016 06:01 PM

Actions

Copy link

Updated by Junxiao Shi about 9 years ago

Description updated (diff)
Assignee set to Alex Afanasyev
Priority changed from Normal to High

Priority is increased to Urgent because there are already five complaints.

Ashlesh has performed a bisect and finds that the last good version is NFD:commit:ace83ac9384da037a36695888e829d7337ce36b0 which means the first bad version is NFD:commit:ab49745afe0967eb59d1005efce92894c66fcf7f.

The only change that might affect nfd-status is in wscript:

conf.check_boost(lib=boost_libs, mt=True)

mt=True is added in this commit.
I wonder, what's the implication when NFD has mt=True while ndn-cxx does not?

This line is added by Alex in patchset7, so I'm assigning this issue to Alex.

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

I was finally reproduce the problem in clean Ubuntu 14.04 environment by just compiling master commits of ndn-cxx and NFD. Few observations:

nfd-status exits with code 0, but prints nothing
strace nfd-status attached
currently, all tools depend on core-objects, which for nfd-status results in the following command line

  /usr/bin/g++ -pthread tools/nfd-status.cpp.3.o core/city-hash.cpp.2.o core/config-file.cpp.2.o core/global-io.cpp.2.o core/manager-base.cpp.2.o core/network-interface.cpp.2.o core/network.cpp.2.o core/privilege-helper.cpp.2.o core/random.cpp.2.o core/scheduler.cpp.2.o core/logger-factory.cpp.2.o core/logger.cpp.2.o -o /home/vagrant/NFD/build/bin/nfd-status -Wl,-Bstatic -L/usr/lib/x86_64-linux-gnu -Wl,-Bdynamic -L/usr/lib/x86_64-linux-gnu -L/usr/local/lib -lboost_system -lboost_chrono -lboost_program_options -lboost_random -lboost_thread -lboost_log -lboost_log_setup -lndn-cxx -lboost_system -lboost_filesystem -lboost_date_time -lboost_iostreams -lboost_regex -lboost_program_options -lboost_chrono -lboost_random -lcryptopp -lsqlite3 -lrt -lpthread -lrt -lresolv

Manually removing all core/* dependencies, fixes the issue.

ndn-cxx example (I copied consumer.cpp into tools/) when compiled with core/*, stops working (Face::processEvents() returns immediately)
Ha. Calling Face::getIoService()::run() makes the app to work correctly.
But I'm really puzzled. Here are two code snippets:
- First

    Face m_face;
    m_face.expressInterest(Name("/temp"),
                           bind([] { std::cerr << "ON DATA" << std::endl; }),
                           bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
    std::cerr << "RUN indirectly" << std::endl;
    m_face.processEvents();
    std::cerr << "END RUN indirectly" << std::endl;

Second

    Face m_face;
    m_face.expressInterest(Name("/temp"),
                           bind([] { std::cerr << "ON DATA" << std::endl; }),
                           bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
    std::cerr << "RUN indirectly" << std::endl;
    m_face.processEvents();
    std::cerr << "END RUN indirectly" << std::endl;

    std::cerr << "RUN directly" << std::endl;
    m_face.getIoService().run();
    std::cerr << "END RUN directly" << std::endl;

Output of the first:

     vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
     RUN indirectly
     END RUN indirectly

Output of the second

     vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
     RUN indirectly
     ON TIMEOUT
     END RUN indirectly
     RUN directly
     END RUN directly

How does it make sense?!?

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

File nfd-status-strace.txt nfd-status-strace.txt added

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

Forgot to mention. Tried with two different boost versions (default 1.54 and 1.55).

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

Could it be some compiler bug? Recompiled ndn-cxx and NFD with clang++-3.6 and issue is gone.

Actions

Copy link

Updated by Junxiao Shi about 9 years ago

The "directly" and "indirectly" calls in note-3 are not equivalent:
Face::processEvents also has a m_ioService.reset().

Can you try:

one processEvents invocation
.reset() and .run()
two processEvents invocations
.reset() and .run() repeated twice
one processEvents invocation, followed by .reset() and .run()
.reset() and .run(), followed by one processEvents invocation

Also,

Print the return value of .run() both inside and outside processEvents, to get an idea on why .run() is returning.
Print the value of .stopped() after .run() returns.

note-3 also does not identify which of core-objects is the troublemaker.

Can you disable just core/logger*.o to confirm whether logger is the one causing problem?

Actions

Copy link

Updated by Alex Afanasyev about 9 years ago

I have checked the library (added printouts). The code is equivalent. The only method that is called from inside the processEvents() is run. Everything else is disabled.

Note. That after adding direct run method, the indirect run method starts working! This is something ridiculous and I cannot blame anything except the compiler.

I tried running directly reset and run (same effect, processEvents magically starts working).

Just in case. I have tried debug and optimized modes. Incorrect behavior with gcc-4.8 is the same.

When nothing is done, return from run is "Success - 0" (boost::system::error_code ec; m_ioService.run(ec); std::cerr << ec.value() << " - " << ec.message() << std::endl;)

The "troublemakers" are core/logger* and two files that use logger: core/network-interface* and core/privilege-helper*. However, removing those binaries simply eliminates dependency on Boost.Log shared lib.

There was somewhat related bug https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1338693, but it seem to be fixed.

Actions

Copy link

Updated by Junxiao Shi about 9 years ago

I'm asking about the return value of io.run() which indicates how many handlers are invoked, not the error_code.

To confirm whether this is same as gcc-4.8 bug 1338693, try the snippet in their link and see what happens.

If it's fixed over there, check the installed version contains that fix.

Actions

Copy link

#10

Updated by Alex Afanasyev about 9 years ago

The cited bug is similar, but not the same. The snippet from the bug works as expected.

Actions

Copy link

#11

Updated by Junxiao Shi about 9 years ago

I notice valgrind nfd-status reports several Conditional jump or move depends on uninitialised value(s) errors within io_service on Ubuntu 14.04 64-bit and Boost 1.54, but there's no such error on Ubuntu 16.04 32-bit and Boost 1.58. Both are tested when NFD is not running.

Actions

Copy link

#12

Updated by Alex Afanasyev about 9 years ago

I have the same observation, seeing a bunch of Conditional jump or move depends on uninitialised value(s).

Additional input. When NFD is compiled with debug flags, the trick with direct running of io_service works. When compiling with optimized flags, execution of an app simply hangs. So it is not a solution.

As for number of executed handlers. I checked, it is 0.

Any idea / suggestion what to do? Tell everybody to use clang-4.6 on Ubuntu 14.04 (it is available from standard packages) and avoid the default gcc-4.8?

I just checked with the recent version of gcc (gcc-6 from PPA). The problem is exactly the same as with gcc-4.8. Are we doing something wrong?

Actions

Copy link

#13

Updated by Alex Afanasyev about 9 years ago

A few more observations. From valgrind output I see that nfd-status uses some of boost::asio... from libboost_log...:

...
==1952== Conditional jump or move depends on uninitialised value(s)
==1952==    at 0x52DC046: boost::asio::detail::task_io_service::stop() (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952==    by 0x52DEFF7: boost::asio::detail::task_io_service::run(boost::system::error_code&) (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952==    by 0x566A75C: run (io_service.ipp:66)
==1952==    by 0x566A75C: ndn::Face::processEvents(boost::chrono::duration<long, boost::ratio<1l, 1000l> > const&, bool) (face.cpp:448)
==1952==    by 0x41E290: ndn::NfdStatus::fetchInformation() (nfd-status.cpp:823)
==1952==    by 0x418DDA: main (nfd-status.cpp:918)
...

This suggested me that the linker did something weird. I tried to ensure and -lndn-cxx appears before -lboost_log and things got better for nfd-status. But I'm not sure if there is effect on the log library.

Actions

Copy link

#14

Updated by Davide Pesavento about 9 years ago

Could be an ODR violation somewhere. Try building ndn-cxx with -Wl,-Bsymbolic-functions.

Actions

Copy link

#15

Updated by Davide Pesavento about 9 years ago

If it's a symbol collision, we knew it was going to happen sooner or later, since ndn-cxx does not hide its private symbols. See also #2859.

Actions

Copy link

#16

Updated by Alex Afanasyev about 9 years ago

Adding -Wl,-Bsymbolic-functions flag when linking libndn-cxx.so fixes the problem. Should we add it as a default flag to ndn-cxx then?

Actions

Copy link

#17

Updated by Davide Pesavento about 9 years ago

The real solution would be to implement #2859... but yes, you can add -Wl,-Bsymbolic-functions to the default linker flags [1] in the meantime, it should be pretty safe(*). Note that it's a Linux-only flag, it's implemented in both GNU ld and GNU gold, but not in the Apple linker.

Independently from -Bsymbolic-functions, can you try building ndn-cxx with CXXFLAGS=-fvisibility-inlines-hidden, and see if that flag alone can fix the problem as well?

(*) Though it will break code that compares the address of a library function taken from an executable and the address of the same function taken from the library (the addresses will be different). I don't think we have code that does that, do we?

Actions

Copy link

#18

Updated by Junxiao Shi about 9 years ago

Status changed from In Progress to Closed
% Done changed from 0 to 100

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

NFD

Tags

Bug #3650

nfd-status not working

Updated by Junxiao Shi about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Junxiao Shi about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Junxiao Shi about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Junxiao Shi about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Alex Afanasyev about 9 years ago

Updated by Davide Pesavento about 9 years ago

Updated by Junxiao Shi about 9 years ago