Project

General

Profile

Actions

Bug #1968

closed

nfd-autoreg: Inconsistency between the existing faces and registered prefixes

Added by Alex Afanasyev over 9 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Tools
Target version:
Start date:
09/05/2014
Due date:
% Done:

100%

Estimated time:

Description

Not sure how, but FIB/RIB was missing autoregistered prefixes for some faces. Either re-connect to hub or restart nfd-autoreg solved the problem.

We could try to do periodic checks and re-registering prefixes for existing faces.


Files

20160302_CSU_nfd-autoreg.log (40.6 KB) 20160302_CSU_nfd-autoreg.log Junxiao Shi, 03/02/2016 08:53 AM
Actions #1

Updated by Junxiao Shi over 9 years ago

  • Tracker changed from Task to Bug
  • Subject changed from nfd-autoreg: Inconsistency between the exisitng faces and reigsterd prefixes to nfd-autoreg: Inconsistency between the existing faces and registered prefixes

nfd-autoreg relies on face status change notifications which is unreliable.

Instead of letting every program implement periodical Face Dataset requests, should we extend FaceMonitor class to make it reliable?
Specifically, caller can ask FaceMonitor to request Face Dataset at specified interval, and trigger event if a status change is detected based on the records in Face Dataset.

Actions #2

Updated by Lan Wang over 9 years ago

agree. If every application needs this functionality, then building it into the FaceMonitor makes sense. Also it can detect out of order or missing events using sequence numbers, right?

Actions #3

Updated by Junxiao Shi over 9 years ago

I'm analyzing one instance of this bug, occurred on ARIZONA HUB:
Face 259, connected before nfd-autoreg starts, does not receive prefixes.

The following is seen in /var/log/ndn/nfd.log:

1410022032.492294 INFO: [FaceTable] Added face id=1 remote=internal:// local=internal://
1410022032.498037 INFO: [FaceTable] Added face id=256 remote=udp4://224.0.23.170:56363 local=udp4://128.196.203.36:56363
1410022032.523814 INFO: [FaceTable] Added face id=257 remote=ether://[01:00:5e:00:17:aa] local=dev://eth0
1410022036.147311 INFO: [FaceTable] Added face id=259 remote=udp4://149.255.100.107:6363 local=udp4://128.196.203.36:6363
1410022050.686803 INFO: [FaceTable] Added face id=278 remote=fd://21 local=unix:///run/nfd.sock
1410022052.165563 INFO: [FaceTable] Added face id=279 remote=fd://22 local=unix:///run/nfd.sock
1410022052.507011 INFO: [FaceTable] Added face id=282 remote=fd://25 local=unix:///run/nfd.sock

The following is seen in /var/log/ndn/nrd.log:

1410022052.101653 INFO: [RibManager] Listening on: /localhost/nfd/rib
1410022052.161907 INFO: [RibManager] Start monitoring face create/destroy events
Face Status Dataset request timed out
Face Status Dataset request timed out

nrd starts 20 seconds after nfd starts, because the operator puts sleep 15 in /etc/ndn/nfd-init.sh.

Face 259 is connected before nrd or nfd-autoreg starts.

The following is seen in /var/log/upstart/nfd-autoreg.log:

(process start)
IGNORED: FaceEventNotification(Kind: created, FaceID: 278, RemoteUri: fd://21, LocalUri: unix:///run/nfd.sock, Flags: 3)
IGNORED: FaceEventNotification(Kind: created, FaceID: 279, RemoteUri: fd://22, LocalUri: unix:///run/nfd.sock, Flags: 3)
(omit)
PROCESSING: FaceEventNotification(Kind: created, FaceID: 286, RemoteUri: udp4://128.97.98.7:6363, LocalUri: udp4://128.196.203.36:6363, Flags: 2)
SUCCEED: register /ndn/broadcast on face 286
FAILED: register /ndn/broadcast on face 258 (code: 10060, reason: Command Interest timed out)
FAILED: register /ndn/broadcast on face 259 (code: 10060, reason: Command Interest timed out)
FAILED: register /ndn/edu/arizona on face 259 (code: 10060, reason: Command Interest timed out)

It's obvious that nfd-autoreg is aware of the existence of face 259, but the prefix registration command has failed.
nrd.conf doesn't have sufficient loglevel to show what's happening to the registration command.
And we cannot determine whether nrd is running when the registration commands are sent.

nfd-status -f | grep unix gives:

faceid=278 remote=fd://21 local=unix:///run/nfd.sock counters={in={854i 0d 51291B} out={0i 545d 236811B}} local on-demand
faceid=279 remote=fd://22 local=unix:///run/nfd.sock counters={in={1098i 2718d 3155153B} out={2718i 783d 1665197B}} local on-demand
faceid=280 remote=fd://23 local=unix:///run/nfd.sock counters={in={1i 0d 453B} out={0i 1d 829B}} local on-demand
faceid=281 remote=fd://24 local=unix:///run/nfd.sock counters={in={1i 3923d 1604482B} out={3923i 1d 200425B}} local on-demand
faceid=282 remote=fd://25 local=unix:///run/nfd.sock counters={in={139991i 0d 14404969B} out={0i 16610d 30080812B}} local on-demand
faceid=283 remote=fd://26 local=unix:///run/nfd.sock counters={in={5i 0d 2212B} out={0i 5d 4109B}} local on-demand
faceid=284 remote=fd://27 local=unix:///run/nfd.sock counters={in={9416i 2536d 2831205B} out={2537i 8762d 5596785B}} local on-demand
faceid=285 remote=fd://28 local=unix:///run/nfd.sock counters={in={7901i 870d 1781238B} out={8532i 1224d 1772984B}} local on-demand
faceid=547 remote=fd://34 local=unix:///run/nfd.sock counters={in={1i 0d 46B} out={0i 0d 0B}} local on-demand

nfd-status -b | grep -P "(278|279|280|281|282|283|284|285)\s" gives:

/localhost/repo-ng/insert nexthops={faceid=283 (cost=0)}
/ndn/NLSR/LSA/edu/arizona/%C1.Router/hobo nexthops={faceid=284 (cost=0)}
/ndn/edu/arizona/%C1.Router/hobo nexthops={faceid=284 (cost=0)}
/ndn/edu/arizona/ping nexthops={faceid=281 (cost=0), faceid=519 (cost=255)}
/localhost/repo-ng/watch nexthops={faceid=283 (cost=0)}
/ndn/broadcast/KEYS nexthops={faceid=284 (cost=0), faceid=289 (cost=17), faceid=287 (cost=18), faceid=286 (cost=25), faceid=288 (cost=25), faceid=258 (cost=33)}
/localhost/repo-ng/delete nexthops={faceid=283 (cost=0)}
/localhop/ndn-autoconf nexthops={faceid=280 (cost=0)}
/localhost/repo-ng/insert%20check nexthops={faceid=283 (cost=0)}
/localhost/repo-ng nexthops={faceid=283 (cost=0)}
/localhost/nfd/rib nexthops={faceid=279 (cost=0)}
/ndn/NLSR/sync nexthops={faceid=285 (cost=0), faceid=289 (cost=17), faceid=287 (cost=18), faceid=286 (cost=25), faceid=288 (cost=25), faceid=258 (cost=33)}

UNIX face 547 is the nfd-status command itself.
Other than that, face 278 and 282 have no Route toward them, which matches the design of nfd-autoreg.
It's evident that face 278 is nfd-autoreg: nfd-autoreg contains a FaceMonitor which sends about one Interest per minute, plus one Interest per face creation/destroying. After been running for 6 hours, NInInterests=854 fits this characteristics.
Therefore, face 278 is nfd-autoreg.

It's obvious that face 279 is nrd, determined by its FIB nexthop record.
This means, nfd-autoreg is started BEFORE nrd.

Conclusion:

This problem instance is due to inproper operational practice.
The startup scripts cause nfd-autoreg to start before nrd, causing the problem.

Applying the solution in note-1 cannot solve this problem.

Actions #4

Updated by John DeHart over 9 years ago

Junxiao Shi wrote:

Other than that, face 278 and 282 have no Route toward them, which matches the design of nfd-autoreg.
It's evident that face 278 is nfd-autoreg: nfd-autoreg contains a FaceMonitor which sends about one Interest per minute, plus one Interest per face creation/destroying. After been running for 6 hours, NInInterests=854 fits this characteristics.
Therefore, face 278 is nfd-autoreg.

It's obvious that face 279 is nrd, determined by its FIB nexthop record.
This means, nfd-autoreg is started BEFORE nrd.

Conclusion:

This problem instance is due to inproper operational practice.
The startup scripts cause nfd-autoreg to start before nrd, causing the problem.

Applying the solution in note-1 cannot solve this problem.

I follow the logic in this argument.

However, the way the Upstart configuration files are set up, nfd-autoreg SHOULD start
after nrd.
Here is the top of /etc/init/nfd-autoreg.conf:

description "NFD Autoreg Server"
author      "Alexander Afanasyev <alexander.afanasyev@ucla.edu>"

start on started  nrd
stop  on stopping nrd

However, what we have found is that a daemon may have started but it has
not gotten to the state necessary for a following daemon to do what it needs.
I'm guessing this is the case here. nrd has actually started already but
hasn't gotten to the point where it can accept incoming requests before
nfd-autoreg starts sending it requests.
We have a similar case with nfd. We have to delay the running of nfd-init.sh
so that nfd is ready to receive the updates we want to make. I found a wide
range of timing on different machines. On some machines 2 seconds is often enough
on at least one of them I have to wait 15 seconds to get consistent results.

I will experiment with this some more to see if I can understand it better.

Actions #5

Updated by John DeHart over 9 years ago

When I look at the actual process IDs on the hubs, it does indeed look
like nfd-autoreg is starting BEFORE nrd. I obviously need to dig into this
some more. I thought the Upstart controls would keep them in order.

From below,

nfd PID: 21151

nfd-autoreg PID: 21234

nrd PID: 21245

ndnops@wundngw:~$ ps auxwww | grep nfd

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

root 21151 2.5 33.7 834496 690624 ? Ssl Sep06 39:06 /usr/bin/nfd --config /etc/ndn/nfd.conf

root 21173 0.0 0.4 110828 9192 ? Ss Sep06 0:13 python2.7 /usr/bin/nfd-status-http-server -a 0.0.0.0 -p 80

ndn 21234 0.0 0.4 71696 8512 ? Ss Sep06 0:20 /usr/bin/nfd-autoreg -b 128.196.203.36 -b 141.225.11.173 -b 72.36.112.82 --prefix /ndn/edu/wustl --all-faces-prefix /ndn/broadcast

ndn 21245 0.0 0.4 71356 10096 ? S Sep06 0:53 /usr/bin/nrd --config /etc/ndn/nfd.conf

ndnops@wundngw:~$

Actions #6

Updated by John DeHart over 9 years ago

As a temporary measure, I have added a pre-start delay of 2 seconds to the nfd-autoreg.conf Upstart file.
This seems to keep nfd-autoreg from starting before nrd is ready to receive requests.
I have done this on all Testbed nodes.

I am also working on some scripts that I plan to put in the post-start phase of nfd and nrd
that will verify that they are each ready to receive requests. These scripts will keep the
Upstart 'started' event for each of them from being emitted until they are actually started
AND ready to receive requests.

Actions #7

Updated by Alex Afanasyev over 9 years ago

Is the observed on the testbed nodes, or in a controlled environment?

It was observed on testbed node (spurs).

If it's from testbed, note-3 has given the reason, and John has fixed the problem.

I understand and believe that this fixed the problem. However, the final fix needs to include update to upstart scripts in https://github.com/named-data/ppa-packaging repo. John, would you mind creating pull requests with the correct version of the scripts?

Actions #8

Updated by John DeHart over 9 years ago

Alex Afanasyev wrote:

Is the observed on the testbed nodes, or in a controlled environment?

It was observed on testbed node (spurs).

If it's from testbed, note-3 has given the reason, and John has fixed the problem.

I understand and believe that this fixed the problem. However, the final fix needs to include update to upstart scripts in https://github.com/named-data/ppa-packaging repo. John, would you mind creating pull requests with the correct version of the scripts?

Sure. I'll work on that.

Actions #9

Updated by Junxiao Shi over 8 years ago

Alex says at 20151117 conference call that there's still problem if (1) nfd-autoreg starts after NFD and RIB (2) faces are connected before nfd-autoreg starts.

Actions #10

Updated by Junxiao Shi over 8 years ago

  • Status changed from New to Closed

note-9 issue is in fact #1863 which has been resolved. Thus, there's nothing else to do here, and this issue is closed.

Actions #11

Updated by Junxiao Shi about 8 years ago

I'm reopening this issue because the problem appears again on CSU node today, as reported by @John.
Client's face is 15614, but no route is created for it.
The node is running NFD v0.4.0.

nfd-autoreg.log is captured and attached to this post.
The log shows:

  • There are many duplicates of the same notification.
  • FaceId in "face created" notifications are not consecutive.

nfd-autoreg uses FaceMonitor which is a specialization of NotificationSubscriber, and its logic is:

  1. send initial Interest for the Notification Stream, rightmost, timeout=60s
  2. after receiving a notification, send Interest with one greater sequence number, timeout=60s
  3. after a timeout, go to step 1

While it's normal for nfd-autoreg to receive duplicate notifications (due to step 1-3-1-3 loop when no new notification is posted), there shouldn't be a gap between FaceId in "face created" notification under normal circumstances.
Most likely, notifications are lost from the notification stream publisher, which is caused by lack of #2182 and cs::PriorityFifoPolicy.

  • Without #2182, notifications are pushed in the ContentStore as unsolicited Data.
  • cs::PriorityFifoPolicy evicts unsolicited Data before anything else.
  • When the ContentStore is filled with non-unsolicited Data, notifications are evicted before the subscriber has a chance to retrieve it.

The solution of this issue is #2182.
A workaround is using cs::LruPolicy which does not prefer to evict unsolicited Data.

Actions #12

Updated by John DeHart about 8 years ago

Is cs::LruPolicy something that can be configured in nfd.conf or is it a build time configuration?

Actions #13

Updated by Jeff Burke about 8 years ago

Can this please be assigned to someone for resolution?

Actions #14

Updated by Junxiao Shi about 8 years ago

Is cs::LruPolicy something that can be configured in nfd.conf or is it a build time configuration?

Until #3148, choosing LRU policy requires modifying code at compile time.

Can this please be assigned to someone for resolution?

There's no need. This bug should resolve itself when #2182 is completed.

Actions #15

Updated by John DeHart about 8 years ago

I am working on analyzing the Testbed management scripts to see how many face events
it causes. Can I rely on the ndn log messages in INFO mode to include ALL face event messages.
Specifically I am looking for these two types of messages:

1457290444.818232 INFO: [FaceTable] Added face id=81175 remote=fd://40 local=unix:///run/nfd.sock
1457290444.827083 INFO: [FaceTable] Removed face id=81175 remote=fd://40 local=unix:///run/nfd.sock
Actions #16

Updated by Junxiao Shi about 8 years ago

Reply to note-15:

Yes.
FaceTable logs all face creation and destroying events, and they directly trigger notifications.

However, you cannot distinguish which application is causing face creations.

Actions #17

Updated by John DeHart about 8 years ago

On the WU node I am seeing an average of about 1 face Added and Removed every 4 seconds.
Similar on Remap.

That doesn't seem to me to be a heavy load.

I think in the past our monitoring of the Testbed used nfd-status more and so caused
more face events.

Actions #18

Updated by Alex Afanasyev about 8 years ago

The load is not heavy, but it creates a condition for nfd-autoconfig to get timeouts and be out-of-sync with the latest face events.

Do you need to create faces so many times?

Actions #19

Updated by John DeHart about 8 years ago

Alex Afanasyev wrote:

The load is not heavy, but it creates a condition for nfd-autoconfig to get timeouts and be out-of-sync with the latest face events.

Do you need to create faces so many times?

I'm not even sure at this point that it is the Testbed management that is causing so many.

How do I analyze where the face events are coming from?

Actions #20

Updated by Alex Afanasyev about 8 years ago

We can enable more logging and check what is being requested over those faces. Do you run nfd-status in your scripts? Or get status directly using the management protocols?

Actions #21

Updated by John DeHart about 8 years ago

Alex Afanasyev wrote:

We can enable more logging and check what is being requested over those faces. Do you run nfd-status in your scripts? Or get status directly using the management protocols?

OK. I'm doing that and will analyze the results and see what I find.

Actions #22

Updated by Junxiao Shi about 8 years ago

Actions #23

Updated by Junxiao Shi about 8 years ago

  • Assignee set to Weiwei Liu

As indicated in note-14, this issue is expected to require no implementation and would resolve by itself after #2182.

However, we should test that #2182 indeed resolves this bug.

To perform a manual test:

  1. Modify nfd.conf: set tables.cs_max_packets to 0.
  2. Start NFD.
  3. Start nfd-autoreg.
  4. Connect to NFD from any source (a local application will do) at a fast interval (at least 20 connections per second; hint: for I in $(seq 20); do ndnpeek /$I & done would cause 20 applications to connect to NFD at the same time).
  5. Look at nfd-autoreg log: FaceId in "face created" notifications should be consecutive.
Actions #24

Updated by John DeHart about 8 years ago

Alex Afanasyev wrote:

We can enable more logging and check what is being requested over those faces. Do you run nfd-status in your scripts? Or get status directly using the management protocols?

A lot of my scripts use nfd-status and some of the status data is gotten via nfd-status-http-server.

Actions #25

Updated by Weiwei Liu about 8 years ago

Reply to note-23.

Hi Junxiao, I didn't find the log in the following path/var/log/upstart/nfd-autoreg.log. Is it enough to just look at the standard output after running nfd-autoreg? After running "for I in $(seq 20); do ndnpeek /$I & done", there should be 20 faces created notifications in the log of nfd-autoreg, right? But from the standard output, it seems the log is incomplete, it only shows several face created notifications.

Actions #26

Updated by Weiwei Liu about 8 years ago

  • Status changed from New to In Progress
Actions #27

Updated by Junxiao Shi about 8 years ago

After running "for I in $(seq 20); do ndnpeek /$I & done", there should be 20 faces created notifications in the log of nfd-autoreg, right?

Yes, if there's no bug.

But from the standard output, it seems the log is incomplete, it only shows several face created notifications.

This means the NFD under test has the bug described in note-11. This is exactly what you are looking for.


@Weiwei, the patches that are intended to fix this bug are ready for your testing.

ndn-cxx Change http://gerrit.named-data.net/2768

NFD Change http://gerrit.named-data.net/2785

After installing these versions, you can repeat the test, and nfd-autoreg output should show consecutive FaceIds in face creation events.

Actions #28

Updated by Junxiao Shi about 8 years ago

Follow-up to note-27: code changes are merged to master branch, so the test can just use the latest version from master branch.

Actions #29

Updated by Weiwei Liu about 8 years ago

I've verified this on both Mac(10.10.5) and Ubuntu(15.04) with the latest version of ndn-cxx and nfd, and everything works normally. All the face created notifications are consecutive, and no duplicate notifications.

Actions #30

Updated by Junxiao Shi about 8 years ago

  • Blocked by deleted (Feature #2182: InMemoryStorage for management)
Actions #31

Updated by Junxiao Shi about 8 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

Thanks @Weiwei for testing.

This issue shall be resolved based on this test result.

Actions

Also available in: Atom PDF