ASF Strategy should isolate timed-out or NACK'd interests into separate namespaces
In the week of October 9, 2017, some problems occurred on the NDN Testbed that revealed some flaws with the ASF strategy.
Currently, if the local NFD instance has a FIB entry for a very broad prefix (in our case, it was
/ndn/edu/ucla), then even one misbehaving or malicious prefix anchored there is capable of causing major routing disruptions for all traffic at that prefix. The strategy should be able to defend against things like this, in some way.
I have one suggestion to mitigate this: Whenever some interest provokes a NACK or is timed-out, the strategy should remember which namespace that was, and only change its nexthop selection for that namespace. Moreover, the strategy could choose to isolate a namespace that's an n-component prefix of the offending interest. The amount of components to remove should be configurable.
Updated by Nicholas Gordon over 2 years ago
I have a few thoughts, but these are more rough and un-discussed, so they are living in a note:
There could be a mechanism such that a prefix is considered "non-responsible", that is, interests under this prefix would always be isloated into their own namespaces, so that other, unknown traffic is considered "innocent", and nexthop choices are only affected for offending interests. This could be used in supplement to or instead of the suggestion in the issue description.
/ndn/edu/memphis/nmgordon could be considered non-responsible, so that failures or NACKs from
/ndn/edu/memphis/nmgordon/some-experiment don't cause a change in nexthop selection behavior for
/ndn/edu/memphis/nmgordon/home-automation, or any other prefixes I may advertise.
I'm not sure if failures at
/ndn/edu/memphis should cause selection changes for the
nmgordon sub-prefix, or if it should be considered separately. The main issues I see with something like this are that the strategy begins to subsume the function of the FIB, where it would need to keep a detailed and complex set of relationships for each FIB entry, maintaining a sort of "meta-FIB".
Updated by Nicholas Gordon over 2 years ago
Additionally, I should note that it seems clear to me that if a topology has a generally high degree, failures of this sort are less damaging, because in most cases the performance along alternate routes will be similar, but slightly worse. However, the concerned nodes on the testbed have comparatively low degree, and the difference in costs for those nexthops is very high, as some cross the Pacific Ocean. Possibly a strategy should be configured with a "preference" to use seemingly-unreliable links, because the alternatives are very-high latency.
I understand that the SRTT mechanism is supposed to fill this role, but it evidently needs some kind of tuning.
Updated by John DeHart over 2 years ago
I also wonder if there is perhaps a different NACK that is needed. If I remember correctly,
what was happening was the Interest with a prefix of /ndn/edu/ucla/scripts/ was getting
to the UCLA node but the daemon that should have handled that had died and the FIB entry
that that daemon had registered had gone away. So, the routing and forwarding actually
worked to the extent that the Interest reached the node it was supposed to. But, again
if I remember correctly, since the the Interest was within the network region for that node
it did not get a NACK NoRoute it just got Rejected. The the previous nodes just saw a
timeout for that set of Interests. Seems like what ASF needs is for that type of Interest to
get a NACK NoLocalRoute or something like that that indicates that we had reached the
correct place in the network there was just no local server for it. Then ASF could
ignore Interests that get that NACK and not make any changes.
I'll go back an review the log files and see if I can verify that that was really what I saw.
Updated by Klaus Schneider over 2 years ago
One question: is the FIB prefix /ndn/edu/ucla split up at some point into more fine prefixes?
Whenever some interest provokes a NACK or is timed-out, the strategy should remember which namespace that was, and only change its nexthop selection for that namespace.
How do you know the granularity of the offending namespace? (i.e. the prefix length)
You can make the guess "n components broader than offending Interest", but then you have to make a trade-off between too much routing state (n too small) and isolating a too large prefix (n too large).
- Don't use timeouts as indication of a routing problem
- Use NACKs but consider the specific NACK type (as you said in the other issue)
- Require a signature for the NACKs to prevent/reduce malicious use
Timeouts can be caused by either a problem inside the network (link failure, router failure) or a problem at the end-point application (app not answering). Thus, unless your messages are specifically addressing routers (like the different OSPF message types), I wouldn't use timeouts to influence routing decisions.