Feature #2075
openThe library should have an API to inform the application when the connection is lost to the forwarder
Added by Anonymous about 10 years ago. Updated over 8 years ago.
0%
Description
The connection may be lost if the forwarder crashes, or other means. The library should have a way to test if the forwarder is still there, maybe with a periodic ping. There should also be an API call so that the application can test when desired if the forwarder is still there.
This request comes from an email exchange on nfd-dev, in the thread "Help needed with PyNDN2".
http://www.lists.cs.ucla.edu/pipermail/nfd-dev/2014-September/000392.html
Updated by Anonymous over 9 years ago
- Project changed from 5 to NDN-CCL
- Subject changed from The library should inform the application when the connection is lost to the forwarder to The library should have an API to inform the application when the connection is lost to the forwarder
Updated by Anonymous over 9 years ago
- Assignee changed from Anonymous to Beichuan Zhang
Hello Beichuan. We assigned this to you to see if you can bring this up as an architectural question. Should the application be aware of managing connection drops and reconnecting or do we want the architecture to be able to handle this "under the hood" away from the application?
Updated by Alex Afanasyev over 9 years ago
This is not architectural question. This is an artifact of current implementation. If NDN was a networking stack embedded into the kernel, there wouldn't be concept of "connection lost".
Given the artifact, the application needs to get informed in some way that something got wrong. In most of the languages, the easiest way is to throw an exception (works in java, javascript, python). It is application's choice what to do next: retry or abort. Different applications would want to use different behavior and it should not be hidden from them.
Updated by Jeff Burke over 9 years ago
s/architecture/nfd/g. The point of the question is less the phrasing and more a request to specific if/how the libraries should handle this? Should applications be encouraged to have a notion of "connection" to the forwarder at all?
Updated by Anonymous over 9 years ago
- Assignee changed from Beichuan Zhang to Anonymous
I will do some experiments where the library throws a specific exception when the connection is dropped.
Updated by Andrew Brown over 9 years ago
I duplicated this thread as a bug at http://redmine.named-data.net/issues/2869. Here is my original post and I'm leaning towards option 2:
If a Face is connected to a remote NFD using TcpTransport, an NFD restart will break the TCP socket. If the default constructor is used, the user does not have a reference to the transport to call transport.connect() and re-connect the TCP socket; repeated calls to face.processEvents() will fail without attempting to reconnect. Don't know if this is a bug or a feature request and haven't tested with other transport types. Possible solutions:
- make transport.processEvents() attempt to reconnect the broken socket automatically
- allow users access to the transport from the face so that they can re-connect it (or expose some face-level method to do this)
- force users to pass in a transport instance and keep a reference to it alongside the face reference; write code to trap a face.processEvents() error and trigger a transport.connect (I don't like this option as much)
Updated by Anonymous over 9 years ago
Hi Alex. In note 3, you suggest to throw an exception if the connection is dropped. What should happen for an async producer application which just starts the main event loop and waits for incoming interests (without calling library methods which could throw an exception)? Should the main event loop quit and throw the exception? I ask because the event loop may still be doing other useful things. The application should tell the face to reconnect, and also restart the event loop?
Updated by Andrew Brown over 8 years ago
Jeff, I have some new thoughts about this: it seems that option 1 above seems a common enough case that we could add a flag to AsyncTcpTransport.ConnectionInfo to attempt to reconnect on failure. This would be helpful to us, the users, since we don't really want to manage the reconnection like we would have to do in option 2. Would you be interested in a PR to attempt reconnections, perhaps with an exponential back-off to 30 seconds or something?
Updated by Anonymous over 8 years ago
I agree that something would be better than nothing, but I'll brain dump the concerns I have which have made me procrastinate:
In the simplest case of a consumer (without register prefix), expressInterest will still call Transport.send with encoded interests. Should the Transport queue these to send when it reconnects, or throw an exception when down? (In time-sensitive applications, sending the interest several seconds later after reconnecting may not be the expected behavior.)
For a producer which has registered a prefix, this is lost when the connection drops. How to re-register the prefix? It doesn't seem that this should be handled at the level of the Transport. Does the Face get a signal from the Transport and automatically try to re-register? Or if the application has to try to re-register then what is the library API to inform the application (the original title of this issue)?
Since this issue was opened, NFD added the concept of a "permanent face" which I don't entirely understand. Would it be of any help here? (I think not. I assume it would only work to reconnect to a reachable address which the client typically doesn't have.)
Any thoughts?
Updated by Anonymous over 8 years ago
... of course for the case of the consumer, the simplest solution when the application calls expressInterest is where the Transport just drops the packet when the connection is down. In this case, the application would get a timeout.
Perhaps there is a fancier solution using network Nack (which I'm working on implementing in the CCL libraries). As in ndn-cxx, the API would update expressInterest to add an onNack callback in addition to onData and onTimeout. Unlike onTimeout, onNack can be called immediately. The callback includes a Nack reason such as NO_ROUTE. We could use this, or abuse the reason with something more specific for a down connection, which is a candidate for the API to inform the application of a lost connection (which would only happen after expressInterest).
Updated by Anonymous over 8 years ago
- Blocks Bug #1017: Firefox add-on should try to reconnect if connection drops added
Updated by Andrew Brown over 8 years ago
Jeff Thompson wrote:
I agree that something would be better than nothing, but I'll brain dump the concerns I have which have made me procrastinate:
In the simplest case of a consumer (without register prefix), expressInterest will still call Transport.send with encoded interests. Should the Transport queue these to send when it reconnects, or throw an exception when down? (In time-sensitive applications, sending the interest several seconds later after reconnecting may not be the expected behavior.)
I like throwing an exception because it is simpler; we could always add the queuing later if necessary.
For a producer which has registered a prefix, this is lost when the connection drops. How to re-register the prefix? It doesn't seem that this should be handled at the level of the Transport. Does the Face get a signal from the Transport and automatically try to re-register? Or if the application has to try to re-register then what is the library API to inform the application (the original title of this issue)?
I don't have a good feeling about this, but we already pass a Runnable in to the connect method and we could replace this with either two runnables or with a callback with two methods, onConnect() and onDisconnect() (or even a third, onFailedConnect()).
Since this issue was opened, NFD added the concept of a "permanent face" which I don't entirely understand. Would it be of any help here? (I think not. I assume it would only work to reconnect to a reachable address which the client typically doesn't have.)
I agree; not really sure how the clients would have consistent, reachable addresses.
Any thoughts?
Updated by Andrew Brown over 8 years ago
One more comment about that last: if we do reconnect sockets and do re-register prefixes, it seems like there should be some way to configure this (e.g. turn it on/off, control reconnect # attempts, attempt frequency, etc.). Transport already has the ConnectionInfo, but it seems like Face would need something similar.
Updated by Andrew Brown over 8 years ago
Jeff Thompson wrote:
Perhaps there is a fancier solution using network Nack (which I'm working on implementing in the CCL libraries). As in ndn-cxx, the API would update expressInterest to add an onNack callback in addition to onData and onTimeout. Unlike onTimeout, onNack can be called immediately. The callback includes a Nack reason such as NO_ROUTE. We could use this, or abuse the reason with something more specific for a down connection, which is a candidate for the API to inform the application of a lost connection (which would only happen after expressInterest).
I really like this idea because it is re-usable for other purposes; so if my NACK callback is fired then I can control how and when the transport attempts to reconnect by calling transport.connect()? Perhaps OnNack would give me a reference to the transport so I could do this. However, we would still need to address the "re-register prefix on transport failure" case in some other way.
Updated by Anonymous over 8 years ago
We need some kind of callback. But the more I think about it, the onNack callback from expressInterest is really supposed to be about problems routing that specific interest, such as congestion on the route towards the producer.
A dropped connection is really a problem with the transport, and the types of solutions to a dropped connection may be very different depending on the transport. Think of a local Unix socket vs. a remote wifi connection. At the moment I'm not sure how to handle this in generalized Face API, so I'm leaning towards starting with a transport-specific solution where you provide a callback when you create the Transport. You are looking at the AsyncTcpTransport, right?
Updated by Anonymous over 8 years ago
Andrew Brown wrote:
Jeff Thompson wrote:
In the simplest case of a consumer (without register prefix), expressInterest will still call Transport.send with encoded interests. Should the Transport queue these to send when it reconnects, or throw an exception when down? (In time-sensitive applications, sending the interest several seconds later after reconnecting may not be the expected behavior.)
I like throwing an exception because it is simpler; we could always add the queuing later if necessary.
In order to throw an exception, ThreadPoolFace.expressInterest would need to check if the transport connection is down before dispatching to the thread pool to perform the network operation. Can there be a super simple Transport API like isAlive() ?
https://github.com/named-data/jndn/blob/72e9b76a768df59a6247899bdedfa326b85d6d69/src/net/named_data/jndn/ThreadPoolFace.java#L116
Updated by Andrew Brown over 8 years ago
We already have getIsConnected() in there; could we use that?
Updated by Andrew Brown over 8 years ago
But that's on a Transport and you need access to that information at the Node level?
Updated by Anonymous over 8 years ago
Andrew Brown wrote:
We already have getIsConnected() in there; could we use that?
AsyncTcpTransport.getIsConnected currently uses channel_.getRemoteAddress() != null
. Does getRemoteAddress()
go null when the connection is dropped?
https://github.com/named-data/jndn/blob/728f1a8464921cfabf499d8d2bbaf9ebfa1522ef/src/net/named_data/jndn/transport/AsyncTcpTransport.java#L289
Updated by Anonymous over 8 years ago
- Related to Bug #1047: The removeRegisteredPrefix API should "unregister" from the forwarding daemon added
Updated by Anonymous over 8 years ago
How about adding a reason code to the OnTimeout callback? Currently, when the application calls expressInterest but the Transport's connection is down, the result is that the library calls OnTimeout.
However, if the library knows that the connection is down, it could call OnTimeout with an additional reason code like TRANSPORT_UNCONNECTED. Whether or not the library handles the reason code, it is still going to get the OnTimeout callback. But if it does handle the reason code, it can decide whether to retransmit the interest, etc. What do you think?
Updated by Andrew Brown over 8 years ago
I like this. I would go even further and propose an OnFailure callback that would accept all the reasons that OnTimeout might fail with, all the reason OnNack might be called with, and any future reasons that the library needs. The reason for this is that adding new callbacks for each of the conceivable categories of Interest failure will result in API thrash for our applications. It seems that a more future-proofed approach is to add additional reason codes as they are discovered.
Updated by Anonymous over 8 years ago
Hi Andrew. I'd like to keep OnNack as its own callback since it's a specific message sent by the network and is part of the protocol design with recommended ways to handle it. For all the rest, we can rename OnTimeout to OnFailure as you suggest. I can think of two possible ways to implement OnFailure:
OnFailure(Interest interest, int reasonCode, String message)
or
OnFailure(Interest interest, int reasonCode, Object info)
In the first way, the application can check the reasonCode and show the message. In the second way, the application can check the reasonCode and cast the info Object to a particular class which can contain lots of information specific to the reasonCode. (Maybe instead of Object it should be a subclass of Exception, which at least has getMessage() that the application can show.) What do you think?
Updated by Andrew Brown over 8 years ago
I see your point about the OnNack; for the OnFailure, I like the second version better and the Exception-subclassing even more. That way you could make an NdnException base class and add or remove fields/methods from it as new error handling is added (e.g. OnFailure(Interest interest, NdnException error). Perhaps you could even collapse the reason code into such an exception? If I have to cast the Object, then I have to have some inner knowledge of how the library works but an exception base class forms a sort of interface to protect this.
Updated by Anonymous over 8 years ago
Glad you like the idea. Unless someone objects, we'll try the approach with Exception subclassing.
One point. While you're right that an NdnException base class could include the reason code, I hesitate for the following reasons. In some cases the library may encounter the error because another method throws an exception which may not be a subclass of NdnException, so we would have to put it inside a wrapper subclass of NdnException. Also, our other supported dynamic languages like JavaScript and Python don't have a type system to ensure that the object has a reasonCode field so the OnFailure handler would always have to check anyway. Having a separate reasonCode argument avoids this, especially if the handler is only interested in the code (like "timed out") and isn't interested in the details in the object. What do you think?
Updated by Andrew Brown over 8 years ago
That makes sense; sometimes I forget you have to consider other languages.
Updated by Anonymous over 8 years ago
Hi Andrew. The first argument to OnFailure will be the Interest which was passed to expressInterest. Since this callback is specific to expressInterest I'm thinking to give it a more specific name like OnExpressFailure. What do you think?
Updated by Anonymous over 8 years ago
Hi Andrew. In branch issue/2075-OnExpressFailure I added expressInterest overloads which take an OnExpressFailure.
https://github.com/named-data/jndn/blob/13924934c95d55ed6ea0ec8720401cb0a67c2859/src/net/named_data/jndn/Face.java#L95
Now we can discuss a way for AsyncTcpTransport to alert the application with ExpressFailureReason.TRANSPORT_UNCONNECTED
. The easiest way is for Transport.send to immediately inform expressInterest that the connection is down so that it can call the OnExpressFailed callback right away.
https://github.com/named-data/jndn/blob/e62bfee4951b74d09e0602e75e5f21886249698e/src/net/named_data/jndn/Node.java#L598
Do you think this is enough for you to experiment? Or do we also need to handle the (more difficult) case where an Interest has been sent and later the AsyncTcpTransport discovers that the connection dropped and wants the library to invoke the OnExpressFailed callback.
Updated by Andrew Brown over 8 years ago
Jeff, let's see if I understand: as it looks right now, all we can do is catch the AsyncTcpTransport's IOException in Node and then call OnExpressFailure instead of letting the exception bubble up, right? This does not add much except standardization of how we handle failures. So I wonder if your third paragraph is more the case: it seems like we should think of a way for the asynchronously returned failures to trigger the OnExpressFailure. Perhaps we must pass this callback to AsyncTcpTransport?
Updated by Anonymous over 8 years ago
Hi Andrew. You have a good point. Simply reacting to the synchronous call to send is little better than throwing a fancy exception. So we should focus on the asynchronous case and use the failure callbacks from previous calls to expressInterest. I've been thinking about this and have a basic question. If AsyncTcpTransport discovers a dropped connection, should the library simply call the OnExpressFailure callback for all pending interests and clear out the PendingInterestTable? That means that if the connection is re-established and a Data packet does arrive, there will be no pending interest to match it and the library won't call OnData. What would you expect to happen?
Updated by Anonymous over 8 years ago
... to elaborate my question, suppose an application that receives an OnExpressFailure callback with TRANSPORT_UNCONNECTED
simply calls expressInterest again, which (because it's still unconnected) will immediately call the OnExpressFailure callback again, and we could have a nasty loop. Is it the application's responsibility to delay the call to re-express the interest? An alternative: When the library calls OnExpressFailure with TRANSPORT_UNCONNECTED
for an interest, it keeps the interest in the PendingInterestTable until the timeout expires. If the connection is restored and the library receives a Data packet for the interest, then it can simply call OnData. Otherwise when the timeout expires, the library calls OnExpressFailure, maybe with a different reason code like TIMEOUT_DURING_TRANSPORT_UNCONNECTED
. And maybe it's at this moment that the application can re-express the interest (thereby avoiding a nasty loop).