Task #3593
openHandle old state in Sync digest tree
0%
Description
When a router is restarted in the network, the router sends a Sync recovery Interest and synchronizes with the data of every other node in the network. The data that is synchronized may be current data as well as old data. The recovering router may learn about the last known sequence number of a node that has been failed for a long time and will attempt to fetch the node's LSAs to synchronize its own LSDB.
John reported this issue in regards to newly restarted routers learning about and expressing Interests for PKU's LSAs; PKU has been offline for months.
This task should investigate potential solutions for resolving this problem (e.g., by removing or ignoring old data).
One potential solution is to maintain a timestamp for each digest node and pass this timestamp to NLSR when there is a Sync update. NLSR can decide if it will try to fetch the LSAs based on this timestamp and the configured router-dead-interval.
Updated by Nicholas Gordon over 7 years ago
- Related to Task #3842: Add timestamp information for LSAs in LSDB added
Updated by Nicholas Gordon over 7 years ago
In an email conversation with Dr. Wang (edited for content):
For Sync, the definition of what set is being sync’ed is important - (a) If the set is defined as any data that has ever been produced under a given name prefix then whether a node is alive or not does not make any difference; (b) If the definition of the data set is what data the current sync group members have then when a group member is gone then that should be reflected in the set... (ChronoSync) should have mechanism to detect the liveliness of the group members... without this mechanism, we run into the situation described in the redmine issue. When a router recovers, it syncs with other nodes and gets some data names that belong to nodes that are already dead or gone. It then tries to fetch the data wasting its CPU and bandwidth. The data may or may not be in the network depending on when the node disappeared.
For NLSR, we certainly care about whether a node is live or not. When a node is dead or not reachable, its data should not be retrieved or considered in the routing computation. Note that theoretically their data should not affect the results as they are not reachable anyway... We require nodes to periodically refresh their LSAs so any LSAs that haven’t been refreshed for a while will expire and be deleted...
Since ChronoSync does not have a complete solution yet, the solution suggested in the redmine is a quick fix, i.e., if we put a timestamp in every ChronoSync reply to indicate when that data was first put into the sync data set, then NLSR can use that information to decide whether to retrieve the data. This does require ChronoSync to add the timestamp information to its data structure. It’s not a complete solution in the sense that ChronoSync doesn’t do liveness detection and still sync’s obsolete data names. It just gives NLSR some more information to solve the problem mentioned in the redmine. It assumes that NLSR still does its periodic refreshes for removing obsolete LSAs.
Now that I’m thinking about this again, we can actually add the timestamp information to the LSA data name (at the end after any version number). This way NLSR just needs to look at the LSA data name and then decide whether to retrieve it. This does require clock synchronization, but it can be a relatively loose sync as long as we have a relatively long grace period for considering LSAs obsolete.
From this conversation I see two approaches to a solution for this issue:
- Add a timestamp to sync
- Add a timestamp to the LSA
Inarguably (2) will be less work, but I wonder if this is only a band-aid solution and not mending the broken bone, so to speak. I suspect that many applications will want to be able to selectively fetch items in a sync set based on age. In that case, it seems that (1) is the more systematic solution. What's more, (1) doesn't place a limitation on whether the sync set prunes items based on age: if some consumer is interested in the whole set, they can ignore the timestamp. The overhead of this is known and static, specifically the 64-bit UTC timestamp plus the overhead to encode it in the data.
To me, it seems that the solution of adding a timestamp to ChronoSync replies allows ChronoSync to behave like an insertion-only, comprehensive set, but still provide consumers the ability to select a temporally-relevant subset, which satisfies the most people.
Updated by Nicholas Gordon over 7 years ago
In a group meeting on 15-06-17, Dr. Wang suggested that we could also encode the LSA timestamp directly into the Interest as a name component. She says that's what she meant all along. I think this is a good idea. Even though it introduces 8 bytes of overhead for the timestamp, neither of the other solutions avoid this, and this solution requires a router to do minimal processing, i.e. no TLV decoding, to determine the age of an LSA.
At any rate, more discussion is required.