Project

General

Profile

Actions

Bug #4596

closed

Segfault in Face::Impl::satisfyPendingInterests

Added by Ashlesh Gawande almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Category:
Base
Target version:
Start date:
04/26/2018
Due date:
% Done:

100%

Estimated time:
1.00 h

Description

Ubuntu 17.10
gcc 7.2.0

ndn-cxx: 974b81aeed570be328cf99495e16fbe1b7d0a563 (master HEAD)
NFD: 21e0193e5c81f1c4e8bd50c0d63f2e199eb140c2 (master HEAD)

Running psync-repo and psync-consumer found here:
https://netwisdom.cs.memphis.edu/gitlab/mzhang4/psync/tree/emulation/tools

Experiment I ran:
https://netwisdom.cs.memphis.edu/gitlab/agawande/psync-mini-ndn/blob/emulation/ndn/experiments/partial-sync.py
with the attached topology.

Getting the segfault after face.put here:
https://netwisdom.cs.memphis.edu/gitlab/mzhang4/psync/blob/emulation/src/logic_repo.cpp#L199

I don't know how to reproduce the error except running the Mini-NDN experiment and waiting for repo-exp (psync-repo is the binary that is installed) to
crash. Only some (2-4) of the psync-repo out of 5 running crash. The crashes happen after running for a while (within 2-5 minutes).
Backtrace of the crash is attached.

Segfault happens when m_pendingInterestTable.erase(i) empties m_pendingInterestTable (.size() = 0) iirc.

If I make the following changes it fixes the crash (at least does not happen for ~30 minutes):

diff --git a/src/detail/face-impl.hpp b/src/detail/face-impl.hpp
index 86945ba6..be0c9515 100644
--- a/src/detail/face-impl.hpp
+++ b/src/detail/face-impl.hpp
@@ -136,7 +136,6 @@ public: // consumer
       }

       NDN_LOG_DEBUG("   satisfying " << *entry->getInterest() << " from " << entry->getOrigin());
-      i = m_pendingInterestTable.erase(i);

       if (entry->getOrigin() == PendingInterestOrigin::APP) {
         hasAppMatch = true;
@@ -145,6 +144,8 @@ public: // consumer
       else {
         hasForwarderMatch = true;
       }
+
+      i = m_pendingInterestTable.erase(i);
     }
     // if Data matches no pending Interest record, it is sent to the forwarder as unsolicited Data
     return hasForwarderMatch || !hasAppMatch;


Files

bt-full.txt (27.7 KB) bt-full.txt Ashlesh Gawande, 04/26/2018 09:15 PM
20cons-5repo.conf (2.11 KB) 20cons-5repo.conf Ashlesh Gawande, 04/26/2018 09:16 PM
scenario.zip (1.01 MB) scenario.zip Ashlesh Gawande, 05/09/2018 01:25 PM
Actions #1

Updated by Junxiao Shi almost 6 years ago

  • Assignee set to Ashlesh Gawande
  • Target version set to v0.7
  • Estimated time set to 1.00 h

The proposed fix is correct. Old code accesses entry after it has been erased, which is undefined behavior. You may upload your Change to Gerrit. No new test cases are necessary.

Actions #2

Updated by Davide Pesavento almost 6 years ago

Junxiao Shi wrote:

Old code accesses entry after it has been erased, which is undefined behavior.

Are you sure? The code makes a copy of the shared_ptr before erasing the entry...

Actions #3

Updated by Ashlesh Gawande almost 6 years ago

I think the problem may be that ChronoSync removes pending interest from the face after publishing data if it satisfies own interest:
https://github.com/named-data/ChronoSync/blob/e374e7000cd443d9b973130c63e88005c26c69e3/src/logic.cpp#L770

A more simplified scenario is attached (without using Mini-NDN).

Run the compile.sh to compile the programs. And run test.sh to run the scenario. One of the chrono-app should crash (time is given to attach gdb to the PIDs of chrono-app before running the consumer which triggers the notification publisher to publish).

Actions #4

Updated by Ashlesh Gawande almost 6 years ago

Unit test to reproduce (face.t.cpp), need to make receiveElement NDN_CXX_PUBLIC_WITH_TESTS_ELSE_PROTECTED in face.hpp:

BOOST_AUTO_TEST_CASE(SatisfyPendingInterestSegfault)
{
  face.expressInterest(Interest("/localhost/notification/1"),
                       [&] (const Interest& i, const Data& d) {
                         std::cout << "Got data from notification stream " << std::endl;
                         std::cout << "Publish into chronosync" << std::endl;
                         face.receive(*makeData("/chronosync/sampleDigest/1"));
                       },
                       bind([] { BOOST_FAIL("Unexpected Nack"); }),
                       bind([] { BOOST_FAIL("Unexpected timeout"); }));

  advanceClocks(10_ms);

  // ChronoSync sets interest filter for sync prefix (/chronosync)
  face.setInterestFilter(ndn::InterestFilter("/chronosync").allowLoopback(false),
    [&] (const InterestFilter&, const Interest& interest) {
      std::cout << "Got interest for ChronoSync" << std::endl;
    });

  advanceClocks(10_ms);

  shared_ptr<Interest> interest = make_shared<Interest>();
  interest->setName("/chronosync/sampleDigest");
  interest->setNonce(1);

  face.expressInterest(*interest,
                       [&] (const Interest& i, const Data& d) {
                         std::cout << "Got data for chronosync" << std::endl;
                       },
                       bind([] { BOOST_FAIL("Unexpected Nack"); }),
                       bind([] { BOOST_FAIL("Unexpected timeout"); }));

  // Same interest but from forwarder (other chronosync)
  interest->setNonce(2);
  face.onReceiveElement(interest->wireEncode());

  advanceClocks(10_ms);

  // Got a notification from subscribed stream
  face.receive(*makeData("/localhost/notification/1"));

  advanceClocks(10_ms);
}

So I guess here is what happens:
1) face has pending interest from app for /localhost/notification/1
2) face has pending interest from app for /chronosync/sampleDigest
3) face has pending interest from fwd for /chronosync/sampleDigest
4) face receives data for /localhost/notification/1 and goes in satisfyPendingInterest and satisfies the pending interest from 1)
5) face publishes data for /chronosync/sampleData and goes in satisfyPendingInterest
6) /chronosync/sampleData is erased from pending interests [2) and 3) both]
7) Loop from 4) continues and tries to get the next entry (/chronoSync/sampleData) which is already erased

Actions #5

Updated by Ashlesh Gawande almost 6 years ago

In the current fix the two /chronosync/... pending interests are deleted first and then the /localhost/... pending interest is deleted which makes the next i to be end instead of /chronosync/... like before.

I am not sure if this is correct.
Seems like we should not allow satisfyPendingInterests to be called again while it is executing. Any suggestions/comments?

Also with this fix nothing crashes in my experiments but notification subscriber does not get either data or timeout from face (I need to investigate this further as to exactly how it happens).

Actions #6

Updated by Davide Pesavento almost 6 years ago

I'm a bit confused by the test case in note-4. An application would use Face::put() to send Data, not receive(), which is a member of DummyClientFace and is only used because it's a test case.

If this is really a reentrancy issue, then I suppose the fix would be to use post instead of dispatch in Face::put().

Actions #7

Updated by Ashlesh Gawande almost 6 years ago

With face.put as well the segfault can be reproduced.

Yes, with post there is no segfault. Where to read more about this?

Actions #8

Updated by Davide Pesavento almost 6 years ago

The difference between post and dispatch is explained in Boost.Asio reference doc https://www.boost.org/doc/libs/1_58_0/doc/html/boost_asio/reference/io_service.html

Actions #9

Updated by Ashlesh Gawande almost 6 years ago

Thanks! So should I post to gerrit?

Actions #10

Updated by Ashlesh Gawande almost 6 years ago

  • Status changed from New to Code review
Actions #11

Updated by Junxiao Shi almost 6 years ago

In the current fix the two /chronosync/... pending interests are deleted first and then the /localhost/... pending interest is deleted which makes the next i to be end instead of /chronosync/... like before.
I am not sure if this is correct.

This old fix is incorrect. It fails in the following case:

  1. App expresses Interest /A CanBePrefix=1. Its DataCallback puts Data /A/2.
  2. Face receives Data /A/1 from forwarder.
  3. Face::Impl::satisfyPendingInterests iterates to PendingInterest /A, and invokes DataCallback.
  4. Face::Impl::satisfyPendingInterests iterates to PendingInterest /A, and invokes DataCallback again.
  5. Program does not terminate.

If this is really a reentrancy issue, then I suppose the fix would be to use post instead of dispatch in Face::put().

This new fix works:

  1. App expresses Interest /A CanBePrefix=1. Its DataCallback puts Data /A/2.
  2. Face receives Data /A/1 from forwarder.
  3. Face::Impl::satisfyPendingInterests iterates to PendingInterest /A, invokes DataCallback, and erases the PendingInterest.
  4. Face::Impl::satisfyPendingInterests finds that there's no PendingInterest, and thus drops the Data.

also, does anyone have a better solution that doesn't require calling post() every time? in the vast majority of cases, dispatch is enough and is more efficient.

One alternative is simply declaring invoking put in DataCallback or NackCallback as undefined behavior. Applications that need this should use face.getIoService().post explicitly.
Another alternative is adding a flag "is satisfyPendingInterest running". Face::put always calls dispatch (it cannot read the flag because Face::put is meant to be thread safe). Face::Impl::satisfyPendingInterests posts itself if the flag is set.

Actions #12

Updated by Davide Pesavento almost 6 years ago

Junxiao Shi wrote:

One alternative is simply declaring invoking put in DataCallback or NackCallback as undefined behavior. Applications that need this should use face.getIoService().post explicitly.

Yes, but this "solution" is less convenient for application developers. Plus, are we going to audit all existing code?

Another alternative is adding a flag "is satisfyPendingInterest running". Face::put always calls dispatch (it cannot read the flag because Face::put is meant to be thread safe). Face::Impl::satisfyPendingInterests posts itself if the flag is set.

It may not be just satisfyPendingInterest, other internal functions might not expect to be called recursively.

Actions #13

Updated by Davide Pesavento almost 6 years ago

  • Subject changed from Segfault in face-impl satisfyPendingInterests to Segfault in Face::Impl::satisfyPendingInterests
  • Status changed from Code review to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF