Project

General

Profile

Actions

Bug #2517

closed

NFD state is not properly cleaned up on destruction

Added by Alex Afanasyev almost 10 years ago. Updated over 1 year ago.

Status:
Abandoned
Priority:
Normal
Assignee:
-
Category:
Core
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

In some environments (specifically Android app), it may be necessary (at least needed for now) to start/stop NFD multiple times within the same process. This requires that Nrd/Nfd destructors properly cancel everything that was posted on io_service without requiring to stop io_service.

The current way we do thing in main() is we stopping the global io_service first, after which we call the destructors. This effectively destructs everything, however there are dangling handlers posted that will be removed when the application is terminated. This approach does not work when nfd needs to be started again within the same process---whenever global io_service is started again, it will fire up a few handlers that point to deallocated memory.


Related issues 3 (0 open3 closed)

Related to ndn-cxx - Bug #2518: Segfault when destructing face without cancelling pending interestsClosedAlex Afanasyev02/12/2015

Actions
Blocked by NFD - Task #2489: Merge nrd into nfdClosedAlex Afanasyev

Actions
Blocked by NFD - Bug #2516: UdpChannel::newPeer does not properly handle error conditionClosedAlex Afanasyev02/12/2015

Actions
Actions #1

Updated by Alex Afanasyev almost 10 years ago

  • Tracker changed from Task to Bug
  • Subject changed from ls to Full restart of NFD inside the same process
  • Category set to Core
  • Assignee set to Alex Afanasyev
  • Target version set to v0.4

While attempting to start/stop NFD multiple times within the same process (without reset of scheduler and io_service) i got stuck with the problem that it is not currently supported.

The quickest way to solve the problem for me is to reset IO and scheduler. I will try to make things in a cleaner way, but if it will take too much time, I will just use the shortcut.

Actions #2

Updated by Junxiao Shi almost 10 years ago

Please update Description (not Notes) with steps to reproduce, expected, and actual behavior.

Actions #3

Updated by Alex Afanasyev almost 10 years ago

I'm working with the test case to reproduce/test the issue.

Actions #4

Updated by Alex Afanasyev almost 10 years ago

  • Status changed from New to Code review
  • % Done changed from 0 to 100
Actions #5

Updated by Alex Afanasyev almost 10 years ago

  • Description updated (diff)
Actions #6

Updated by Junxiao Shi almost 10 years ago

Actions #7

Updated by Junxiao Shi almost 10 years ago

Nfd and Nrd types are not thread safe. The snippet calls their destructors in another thread, which could lead to undefined behavior.

Actions #8

Updated by Alex Afanasyev almost 10 years ago

This is just snippet. The actual test case has a little bit of thread safety handling. The test case reproduces the problem and is reference implementation. The snippet just to illustrate the issue.

Actions #9

Updated by Alex Afanasyev almost 10 years ago

Technically, this task is not blocked by nfd/nrd merge. It is only blocked by #2516

Actions #10

Updated by Alex Afanasyev almost 10 years ago

  • Blocked by Bug #2516: UdpChannel::newPeer does not properly handle error condition added
Actions #11

Updated by Junxiao Shi almost 10 years ago

If the bug report is illustrated with code snippet, the snippet must be correct. Alternatively, describe the steps in text, and the test case in the solution should match this text.

Actions #12

Updated by Alex Afanasyev almost 10 years ago

Junxiao, let's be productive. I don't want to waste time on formal things. My opinion is that the snippet and the test case achieve their different goals.

Actions #13

Updated by Junxiao Shi almost 10 years ago

It's not "formal things". A Bug report must allow others to understand what's the problem. I cannot understand the problem from this snippet, thus it needs revising.

I don't understand the significance of this problem. Nfd never needs to restart within a process. If it crashes, the process should exit.

Actions #14

Updated by Alex Afanasyev almost 10 years ago

  • Description updated (diff)
Actions #15

Updated by Alex Afanasyev almost 10 years ago

  • Related to Bug #2518: Segfault when destructing face without cancelling pending interests added
Actions #16

Updated by Junxiao Shi almost 10 years ago

This approach does not work when nfd needs to be started again within the same process---whenever global io_service is started again, it will fire up a few handlers that point to deallocated memory.

Why would NFD need to be started again within the same process? Just quit the process and let the monitor to restart it.

Actions #17

Updated by Alex Afanasyev almost 10 years ago

I don't have an answer of "why". It is more question of purity and "correctness": I would like to have Nrd/Nfd/Forwarder classes to do proper cleanup in a proper order, which gives ability to re-start NFD within the same process if somebody ever wants to do that.

Actions #18

Updated by Junxiao Shi almost 10 years ago

I don't have an answer of "why". It is more question of purity and "correctness": I would like to have Nrd/Nfd/Forwarder classes to do proper cleanup in a proper order, which gives ability to re-start NFD within the same process if somebody ever wants to do that.

I won't agree with this change until there's a use case.

The easy solution is: let Nfd type own the io_service instance, which is destroyed whenever Nfd is destructed. Same goes for Nrd type.

Actions #19

Updated by Alex Afanasyev almost 10 years ago

What's a point of you blocking the change? There is no functional change and it just cleans up the order things are destructed. I can come up with any work-around, this change is purely for conceptual correctness.

Actions #20

Updated by Junxiao Shi almost 10 years ago

I won't agree with this Change until there's a use case. Without a use case, this is "unnecessary changes".

Actions #21

Updated by Davide Pesavento almost 10 years ago

Even without the "in-process restart", it makes sense to cleanup in destructors... I'd tend to agree with Alex here.

Actions #22

Updated by Junxiao Shi almost 10 years ago

Then update the issue title and description to get rid of "restart".

Actions #23

Updated by Alex Afanasyev almost 10 years ago

Davide's comment on gerrit:

Shouldn't be the channels' responsibility to close the faces though? The channels create them, I'd expect the channels to also destroy them. FaceTable is just a passive observer in this.

Junxiao's comment on gerrit:

Yes

While I'm in principle agree that FaceTable may not be a good place to "own" faces. Channels is not a good place either. First, channels are not the only place where face is created. Some faces are channel-less (e.g., multicast faces) and some faces are completely channel- and factory-less (InternalFace). The only place that has a complete knowledge about all faces is the forwarder itself and in particular FaceTable. Therefore, I would argue that FaceTable is the owner of faces and channels are just observers overseeing faces that they created.

Actions #24

Updated by Davide Pesavento almost 10 years ago

Channels or factories it's not too important... the point is that "normal" faces are owned by channels (or factories), channels are owned by factories, factories are owned by the FaceManager. InternalFace and NullFace are constructed by Nfd and should be closed by it.

This reminds me of a thought I had a while ago, see #2300 note 8.

Also I have a question about this task description: if everything is destructed, why are there "dangling handlers"? Handlers for what specifically?

Actions #25

Updated by Alex Afanasyev almost 10 years ago

  • Subject changed from Full restart of NFD inside the same process to Nrd/Forwarder state is not properly cleanup on destruction
  • Description updated (diff)
Actions #26

Updated by Junxiao Shi almost 10 years ago

In some environments (specifically Android app), it may be necessary (at least needed for now) to start/stop NFD multiple times within the same process. This requires that Nrd/Nfd destructors properly cancel everything that was posted on io_service without requiring to stop io_service.

Why can't Android app destroy and recreate io_service?

Actions #27

Updated by Davide Pesavento almost 10 years ago

  • Subject changed from Nrd/Forwarder state is not properly cleanup on destruction to Nrd/Forwarder state is not properly cleaned up on destruction
Actions #28

Updated by Alex Afanasyev almost 10 years ago

I could agree with using weak_ptr instead of shared_ptr in many cases. However, this would require a lot of changes and I really don't want to do them any time soon.

As for removal and dangling stuff (this may not be entirely correct in the description). From what I can guess, Channels/Factories belong to FaceManager, which intentionally is destructed after Forwarder class (as it has to be created prior). Unless face is already "failed" while Forwarder instance is still alive, destruction of face inside Channel/Factory will result in signalling of failure to the forwarder, while the forwarder is already destructed. I guess, this specific part can be fixed by saving connections and disconnecting them inside FaceTable destructor.

Actions #29

Updated by Alex Afanasyev almost 10 years ago

Junxiao Shi wrote:

Why can't Android app destroy and recreate io_service?

I can. The bug is not exactly about ability to "cheat" (and I'm already doing it). The bug is about correctness during the destruction process.

Actions #30

Updated by Davide Pesavento almost 10 years ago

Alex Afanasyev wrote:

Unless face is already "failed" while Forwarder instance is still alive, destruction of face inside Channel/Factory will result in signalling of failure to the forwarder, while the forwarder is already destructed. I guess, this specific part can be fixed by saving connections and disconnecting them inside FaceTable destructor.

Yes I guess FaceTable should use a ScopedConnection for its subscriptions to the faces then.

Actions #31

Updated by Alex Afanasyev about 9 years ago

  • Target version changed from v0.4 to v0.5
Actions #32

Updated by Junxiao Shi almost 9 years ago

  • Status changed from Code review to In Progress

http://gerrit.named-data.net/1759 is blocked and marked -2 by Change Owner, so Status shouldn't be CodeReview.

Actions #33

Updated by Junxiao Shi over 8 years ago

I rebased https://gerrit.named-data.net/1759 to patchset11, but I couldn't get the test case to work.

One problem is: FaceTable destructor expects Face::close (formerly Face::fail) to close the face immediately, but Face::close only requests the face to be closed, the actual closing is asynchronous.
Thus, FaceTable destructor is not the right place to request closing faces.

The caller needs to request the closure (possible through a helper method on FaceTable class), and then waits until all faces are closed before deallocating FaceTable.

Actions #34

Updated by Davide Pesavento over 8 years ago

Junxiao Shi wrote:

One problem is: FaceTable destructor expects Face::close (formerly Face::fail) to close the face immediately, but Face::close only requests the face to be closed, the actual closing is asynchronous.

Correct.

Thus, FaceTable destructor is not the right place to request closing faces.

This is only one possible conclusion. There is at least another one: FaceTable destructor should not be expecting all faces to be closed at the end of its execution. At that point the faces can be either CLOSING, FAILED, or CLOSED. The io_service should naturally return from run() once there are no more handlers to execute. After that, all faces will be CLOSED (in fact, they will be deallocated immediately after their last handler has returned).

This also means that we need a stop()/reset() method in Nfd or NfdRunner that triggers the destruction of everything.

The caller needs to request the closure (possible through a helper method on FaceTable class), and then waits until all faces are closed before deallocating FaceTable.

The problem here is that "the caller" is another destructor, so how do you "wait"? I don't know if it's a good idea to call io_service::poll() from a destructor.

Actions #35

Updated by Alex Afanasyev almost 8 years ago

  • Status changed from In Progress to New
  • Assignee deleted (Alex Afanasyev)
  • Target version deleted (v0.5)
  • % Done changed from 100 to 0
Actions #36

Updated by Davide Pesavento about 6 years ago

  • Subject changed from Nrd/Forwarder state is not properly cleaned up on destruction to NFD state is not properly cleaned up on destruction
  • Start date deleted (02/12/2015)
Actions #37

Updated by Davide Pesavento over 1 year ago

  • Status changed from New to Abandoned

The corresponding change has been abandoned a long time ago with the following reason:

This probably not needed anymore, will revive if running to the problem

Actions

Also available in: Atom PDF