Bug #4596
closedSegfault in Face::Impl::satisfyPendingInterests
100%
Description
Ubuntu 17.10
gcc 7.2.0
ndn-cxx: 974b81aeed570be328cf99495e16fbe1b7d0a563 (master HEAD)
NFD: 21e0193e5c81f1c4e8bd50c0d63f2e199eb140c2 (master HEAD)
Running psync-repo and psync-consumer found here:
https://netwisdom.cs.memphis.edu/gitlab/mzhang4/psync/tree/emulation/tools
Experiment I ran:
https://netwisdom.cs.memphis.edu/gitlab/agawande/psync-mini-ndn/blob/emulation/ndn/experiments/partial-sync.py
with the attached topology.
Getting the segfault after face.put here:
https://netwisdom.cs.memphis.edu/gitlab/mzhang4/psync/blob/emulation/src/logic_repo.cpp#L199
I don't know how to reproduce the error except running the Mini-NDN experiment and waiting for repo-exp (psync-repo is the binary that is installed) to
crash. Only some (2-4) of the psync-repo out of 5 running crash. The crashes happen after running for a while (within 2-5 minutes).
Backtrace of the crash is attached.
Segfault happens when m_pendingInterestTable.erase(i) empties m_pendingInterestTable (.size() = 0) iirc.
If I make the following changes it fixes the crash (at least does not happen for ~30 minutes):
diff --git a/src/detail/face-impl.hpp b/src/detail/face-impl.hpp
index 86945ba6..be0c9515 100644
--- a/src/detail/face-impl.hpp
+++ b/src/detail/face-impl.hpp
@@ -136,7 +136,6 @@ public: // consumer
}
NDN_LOG_DEBUG(" satisfying " << *entry->getInterest() << " from " << entry->getOrigin());
- i = m_pendingInterestTable.erase(i);
if (entry->getOrigin() == PendingInterestOrigin::APP) {
hasAppMatch = true;
@@ -145,6 +144,8 @@ public: // consumer
else {
hasForwarderMatch = true;
}
+
+ i = m_pendingInterestTable.erase(i);
}
// if Data matches no pending Interest record, it is sent to the forwarder as unsolicited Data
return hasForwarderMatch || !hasAppMatch;
Files
Updated by Junxiao Shi over 6 years ago
- Assignee set to Ashlesh Gawande
- Target version set to v0.7
- Estimated time set to 1.00 h
The proposed fix is correct. Old code accesses entry
after it has been erased, which is undefined behavior. You may upload your Change to Gerrit. No new test cases are necessary.
Updated by Davide Pesavento over 6 years ago
Junxiao Shi wrote:
Old code accesses
entry
after it has been erased, which is undefined behavior.
Are you sure? The code makes a copy of the shared_ptr
before erasing the entry...
Updated by Ashlesh Gawande over 6 years ago
- File scenario.zip scenario.zip added
I think the problem may be that ChronoSync removes pending interest from the face after publishing data if it satisfies own interest:
https://github.com/named-data/ChronoSync/blob/e374e7000cd443d9b973130c63e88005c26c69e3/src/logic.cpp#L770
A more simplified scenario is attached (without using Mini-NDN).
Run the compile.sh to compile the programs. And run test.sh to run the scenario. One of the chrono-app should crash (time is given to attach gdb to the PIDs of chrono-app before running the consumer which triggers the notification publisher to publish).
Updated by Ashlesh Gawande over 6 years ago
Unit test to reproduce (face.t.cpp), need to make receiveElement NDN_CXX_PUBLIC_WITH_TESTS_ELSE_PROTECTED in face.hpp:
BOOST_AUTO_TEST_CASE(SatisfyPendingInterestSegfault)
{
face.expressInterest(Interest("/localhost/notification/1"),
[&] (const Interest& i, const Data& d) {
std::cout << "Got data from notification stream " << std::endl;
std::cout << "Publish into chronosync" << std::endl;
face.receive(*makeData("/chronosync/sampleDigest/1"));
},
bind([] { BOOST_FAIL("Unexpected Nack"); }),
bind([] { BOOST_FAIL("Unexpected timeout"); }));
advanceClocks(10_ms);
// ChronoSync sets interest filter for sync prefix (/chronosync)
face.setInterestFilter(ndn::InterestFilter("/chronosync").allowLoopback(false),
[&] (const InterestFilter&, const Interest& interest) {
std::cout << "Got interest for ChronoSync" << std::endl;
});
advanceClocks(10_ms);
shared_ptr<Interest> interest = make_shared<Interest>();
interest->setName("/chronosync/sampleDigest");
interest->setNonce(1);
face.expressInterest(*interest,
[&] (const Interest& i, const Data& d) {
std::cout << "Got data for chronosync" << std::endl;
},
bind([] { BOOST_FAIL("Unexpected Nack"); }),
bind([] { BOOST_FAIL("Unexpected timeout"); }));
// Same interest but from forwarder (other chronosync)
interest->setNonce(2);
face.onReceiveElement(interest->wireEncode());
advanceClocks(10_ms);
// Got a notification from subscribed stream
face.receive(*makeData("/localhost/notification/1"));
advanceClocks(10_ms);
}
So I guess here is what happens:
1) face has pending interest from app for /localhost/notification/1
2) face has pending interest from app for /chronosync/sampleDigest
3) face has pending interest from fwd for /chronosync/sampleDigest
4) face receives data for /localhost/notification/1 and goes in satisfyPendingInterest and satisfies the pending interest from 1)
5) face publishes data for /chronosync/sampleData and goes in satisfyPendingInterest
6) /chronosync/sampleData is erased from pending interests [2) and 3) both]
7) Loop from 4) continues and tries to get the next entry (/chronoSync/sampleData) which is already erased
Updated by Ashlesh Gawande over 6 years ago
In the current fix the two /chronosync/... pending interests are deleted first and then the /localhost/... pending interest is deleted which makes the next i to be end instead of /chronosync/... like before.
I am not sure if this is correct.
Seems like we should not allow satisfyPendingInterests to be called again while it is executing. Any suggestions/comments?
Also with this fix nothing crashes in my experiments but notification subscriber does not get either data or timeout from face (I need to investigate this further as to exactly how it happens).
Updated by Davide Pesavento over 6 years ago
I'm a bit confused by the test case in note-4. An application would use Face::put()
to send Data, not receive()
, which is a member of DummyClientFace
and is only used because it's a test case.
If this is really a reentrancy issue, then I suppose the fix would be to use post
instead of dispatch
in Face::put()
.
Updated by Ashlesh Gawande over 6 years ago
With face.put as well the segfault can be reproduced.
Yes, with post there is no segfault. Where to read more about this?
Updated by Davide Pesavento over 6 years ago
The difference between post
and dispatch
is explained in Boost.Asio reference doc https://www.boost.org/doc/libs/1_58_0/doc/html/boost_asio/reference/io_service.html
Updated by Ashlesh Gawande over 6 years ago
- Status changed from New to Code review
Updated by Junxiao Shi over 6 years ago
In the current fix the two /chronosync/... pending interests are deleted first and then the /localhost/... pending interest is deleted which makes the next i to be end instead of /chronosync/... like before.
I am not sure if this is correct.
This old fix is incorrect. It fails in the following case:
- App expresses Interest /A CanBePrefix=1. Its DataCallback puts Data /A/2.
- Face receives Data /A/1 from forwarder.
Face::Impl::satisfyPendingInterests
iterates to PendingInterest /A, and invokes DataCallback.Face::Impl::satisfyPendingInterests
iterates to PendingInterest /A, and invokes DataCallback again.- Program does not terminate.
If this is really a reentrancy issue, then I suppose the fix would be to use
post
instead ofdispatch
inFace::put()
.
This new fix works:
- App expresses Interest /A CanBePrefix=1. Its DataCallback puts Data /A/2.
- Face receives Data /A/1 from forwarder.
Face::Impl::satisfyPendingInterests
iterates to PendingInterest /A, invokes DataCallback, and erases the PendingInterest.Face::Impl::satisfyPendingInterests
finds that there's no PendingInterest, and thus drops the Data.
also, does anyone have a better solution that doesn't require calling post() every time? in the vast majority of cases, dispatch is enough and is more efficient.
One alternative is simply declaring invoking put
in DataCallback or NackCallback as undefined behavior. Applications that need this should use face.getIoService().post
explicitly.
Another alternative is adding a flag "is satisfyPendingInterest running". Face::put
always calls dispatch
(it cannot read the flag because Face::put
is meant to be thread safe). Face::Impl::satisfyPendingInterests
post
s itself if the flag is set.
Updated by Davide Pesavento over 6 years ago
Junxiao Shi wrote:
One alternative is simply declaring invoking
put
in DataCallback or NackCallback as undefined behavior. Applications that need this should useface.getIoService().post
explicitly.
Yes, but this "solution" is less convenient for application developers. Plus, are we going to audit all existing code?
Another alternative is adding a flag "is satisfyPendingInterest running".
Face::put
always callsdispatch
(it cannot read the flag becauseFace::put
is meant to be thread safe).Face::Impl::satisfyPendingInterests
post
s itself if the flag is set.
It may not be just satisfyPendingInterest
, other internal functions might not expect to be called recursively.
Updated by Davide Pesavento over 6 years ago
- Subject changed from Segfault in face-impl satisfyPendingInterests to Segfault in Face::Impl::satisfyPendingInterests
- Status changed from Code review to Closed
- % Done changed from 0 to 100