Bug #3650
closednfd-status not working
100%
Description
Environment: Ubuntu 14.04
nfd-status
does not display NFD status information, but returns immediately without any output or error message.
However nfd-status -h
is working properly.
Files
Updated by Junxiao Shi over 8 years ago
- Description updated (diff)
- Assignee set to Alex Afanasyev
- Priority changed from Normal to High
Priority is increased to Urgent because there are already five complaints.
Ashlesh has performed a bisect and finds that the last good version is NFD:commit:ace83ac9384da037a36695888e829d7337ce36b0 which means the first bad version is NFD:commit:ab49745afe0967eb59d1005efce92894c66fcf7f.
The only change that might affect nfd-status
is in wscript
:
conf.check_boost(lib=boost_libs, mt=True)
mt=True
is added in this commit.
I wonder, what's the implication when NFD has mt=True
while ndn-cxx does not?
This line is added by Alex in patchset7, so I'm assigning this issue to Alex.
Updated by Alex Afanasyev over 8 years ago
- Status changed from New to In Progress
Updated by Alex Afanasyev over 8 years ago
I was finally reproduce the problem in clean Ubuntu 14.04 environment by just compiling master
commits of ndn-cxx and NFD. Few observations:
nfd-status
exits with code 0, but prints nothingstrace nfd-status
attached- currently, all tools depend on core-objects, which for
nfd-status
results in the following command line
/usr/bin/g++ -pthread tools/nfd-status.cpp.3.o core/city-hash.cpp.2.o core/config-file.cpp.2.o core/global-io.cpp.2.o core/manager-base.cpp.2.o core/network-interface.cpp.2.o core/network.cpp.2.o core/privilege-helper.cpp.2.o core/random.cpp.2.o core/scheduler.cpp.2.o core/logger-factory.cpp.2.o core/logger.cpp.2.o -o /home/vagrant/NFD/build/bin/nfd-status -Wl,-Bstatic -L/usr/lib/x86_64-linux-gnu -Wl,-Bdynamic -L/usr/lib/x86_64-linux-gnu -L/usr/local/lib -lboost_system -lboost_chrono -lboost_program_options -lboost_random -lboost_thread -lboost_log -lboost_log_setup -lndn-cxx -lboost_system -lboost_filesystem -lboost_date_time -lboost_iostreams -lboost_regex -lboost_program_options -lboost_chrono -lboost_random -lcryptopp -lsqlite3 -lrt -lpthread -lrt -lresolv
Manually removing all core/* dependencies, fixes the issue.
ndn-cxx example (I copied
consumer.cpp
intotools/
) when compiled with core/*, stops working (Face::processEvents()
returns immediately)Ha. Calling Face::getIoService()::run() makes the app to work correctly.
But I'm really puzzled. Here are two code snippets:
- First
Face m_face;
m_face.expressInterest(Name("/temp"),
bind([] { std::cerr << "ON DATA" << std::endl; }),
bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
std::cerr << "RUN indirectly" << std::endl;
m_face.processEvents();
std::cerr << "END RUN indirectly" << std::endl;
- Second
Face m_face;
m_face.expressInterest(Name("/temp"),
bind([] { std::cerr << "ON DATA" << std::endl; }),
bind([] { std::cerr << "ON TIMEOUT" << std::endl; }));
std::cerr << "RUN indirectly" << std::endl;
m_face.processEvents();
std::cerr << "END RUN indirectly" << std::endl;
std::cerr << "RUN directly" << std::endl;
m_face.getIoService().run();
std::cerr << "END RUN directly" << std::endl;
Output of the first:
vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
RUN indirectly
END RUN indirectly
Output of the second
vagrant@vagrant-ubuntu-trusty-64:~/NFD$ build/bin/consumer
RUN indirectly
ON TIMEOUT
END RUN indirectly
RUN directly
END RUN directly
How does it make sense?!?
Updated by Alex Afanasyev over 8 years ago
- File nfd-status-strace.txt nfd-status-strace.txt added
Updated by Alex Afanasyev over 8 years ago
Forgot to mention. Tried with two different boost versions (default 1.54 and 1.55).
Updated by Alex Afanasyev over 8 years ago
Could it be some compiler bug? Recompiled ndn-cxx and NFD with clang++-3.6 and issue is gone.
Updated by Junxiao Shi over 8 years ago
The "directly" and "indirectly" calls in note-3 are not equivalent:
Face::processEvents
also has a m_ioService.reset()
.
Can you try:
- one
processEvents
invocation .reset()
and.run()
- two
processEvents
invocations .reset()
and.run()
repeated twice- one
processEvents
invocation, followed by.reset()
and.run()
.reset()
and.run()
, followed by oneprocessEvents
invocation
Also,
- Print the return value of
.run()
both inside and outsideprocessEvents
, to get an idea on why.run()
is returning. - Print the value of
.stopped()
after.run()
returns.
note-3 also does not identify which of core-objects
is the troublemaker.
Can you disable just core/logger*.o
to confirm whether logger is the one causing problem?
Updated by Alex Afanasyev over 8 years ago
I have checked the library (added printouts). The code is equivalent. The only method that is called from inside the processEvents() is run. Everything else is disabled.
Note. That after adding direct run method, the indirect run method starts working! This is something ridiculous and I cannot blame anything except the compiler.
I tried running directly reset
and run
(same effect, processEvents magically starts working).
Just in case. I have tried debug and optimized modes. Incorrect behavior with gcc-4.8 is the same.
When nothing is done, return from run is "Success - 0" (boost::system::error_code ec; m_ioService.run(ec); std::cerr << ec.value() << " - " << ec.message() << std::endl;
)
The "troublemakers" are core/logger*
and two files that use logger: core/network-interface*
and core/privilege-helper*
. However, removing those binaries simply eliminates dependency on Boost.Log shared lib.
There was somewhat related bug https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1338693, but it seem to be fixed.
Updated by Junxiao Shi over 8 years ago
I'm asking about the return value of io.run()
which indicates how many handlers are invoked, not the error_code
.
To confirm whether this is same as gcc-4.8 bug 1338693, try the snippet in their link and see what happens.
If it's fixed over there, check the installed version contains that fix.
Updated by Alex Afanasyev over 8 years ago
The cited bug is similar, but not the same. The snippet from the bug works as expected.
Updated by Junxiao Shi over 8 years ago
I notice valgrind nfd-status
reports several Conditional jump or move depends on uninitialised value(s)
errors within io_service
on Ubuntu 14.04 64-bit and Boost 1.54, but there's no such error on Ubuntu 16.04 32-bit and Boost 1.58. Both are tested when NFD is not running.
Updated by Alex Afanasyev over 8 years ago
I have the same observation, seeing a bunch of Conditional jump or move depends on uninitialised value(s)
.
Additional input. When NFD is compiled with debug flags, the trick with direct running of io_service
works. When compiling with optimized flags, execution of an app simply hangs. So it is not a solution.
As for number of executed handlers. I checked, it is 0.
Any idea / suggestion what to do? Tell everybody to use clang-4.6 on Ubuntu 14.04 (it is available from standard packages) and avoid the default gcc-4.8?
I just checked with the recent version of gcc (gcc-6 from PPA). The problem is exactly the same as with gcc-4.8. Are we doing something wrong?
Updated by Alex Afanasyev over 8 years ago
A few more observations. From valgrind output I see that nfd-status uses some of boost::asio...
from libboost_log...
:
...
==1952== Conditional jump or move depends on uninitialised value(s)
==1952== at 0x52DC046: boost::asio::detail::task_io_service::stop() (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952== by 0x52DEFF7: boost::asio::detail::task_io_service::run(boost::system::error_code&) (in /usr/lib/x86_64-linux-gnu/libboost_log.so.1.54.0)
==1952== by 0x566A75C: run (io_service.ipp:66)
==1952== by 0x566A75C: ndn::Face::processEvents(boost::chrono::duration<long, boost::ratio<1l, 1000l> > const&, bool) (face.cpp:448)
==1952== by 0x41E290: ndn::NfdStatus::fetchInformation() (nfd-status.cpp:823)
==1952== by 0x418DDA: main (nfd-status.cpp:918)
...
This suggested me that the linker did something weird. I tried to ensure and -lndn-cxx
appears before -lboost_log
and things got better for nfd-status. But I'm not sure if there is effect on the log library.
Updated by Davide Pesavento over 8 years ago
Could be an ODR violation somewhere. Try building ndn-cxx with -Wl,-Bsymbolic-functions
.
Updated by Davide Pesavento over 8 years ago
If it's a symbol collision, we knew it was going to happen sooner or later, since ndn-cxx does not hide its private symbols. See also #2859.
Updated by Alex Afanasyev over 8 years ago
Adding -Wl,-Bsymbolic-functions
flag when linking libndn-cxx.so
fixes the problem. Should we add it as a default flag to ndn-cxx then?
Updated by Davide Pesavento over 8 years ago
The real solution would be to implement #2859... but yes, you can add -Wl,-Bsymbolic-functions
to the default linker flags [1] in the meantime, it should be pretty safe(*). Note that it's a Linux-only flag, it's implemented in both GNU ld and GNU gold, but not in the Apple linker.
Independently from -Bsymbolic-functions
, can you try building ndn-cxx with CXXFLAGS=-fvisibility-inlines-hidden
, and see if that flag alone can fix the problem as well?
(*) Though it will break code that compares the address of a library function taken from an executable and the address of the same function taken from the library (the addresses will be different). I don't think we have code that does that, do we?
Updated by Junxiao Shi over 8 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100