Project

General

Profile

Actions

Bug #3722

closed

Scheduler stops if a callback throws

Added by Junxiao Shi almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Utils
Target version:
Start date:
08/06/2016
Due date:
% Done:

100%

Estimated time:
2.00 h

Description

Steps to reproduce:

  1. declare a ndn::util::scheduler::Scheduler
  2. scheduler two events; the first event callback throws an exception
  3. invoke io.run(), catch the exception thrown by the first event callback, and restart io.run()

Expected: the second event is executed

Actual: the second event is never executed

Actions #1

Updated by Junxiao Shi almost 10 years ago

  • Related to Bug #3691: Scheduler::cancelAllEvents causes memory error in ScopedEventId added
Actions #2

Updated by Davide Pesavento almost 10 years ago

Just because I noticed this problem during the code review for #3691, it doesn't mean it's "related to" #3691...

Actions #3

Updated by Junxiao Shi almost 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Junxiao Shi
Actions #4

Updated by Junxiao Shi almost 10 years ago

  • Related to deleted (Bug #3691: Scheduler::cancelAllEvents causes memory error in ScopedEventId)
Actions #5

Updated by Junxiao Shi almost 10 years ago

  • Status changed from In Progress to Code review
  • % Done changed from 0 to 100
Actions #6

Updated by Junxiao Shi almost 10 years ago

https://gerrit.named-data.net/3037 patchset1 adds a benchmark for Scheduler.

I compared the Scheduler before #3691 changes, https://gerrit.named-data.net/3035 patchset6, https://gerrit.named-data.net/3036 patchset2.
ONL pc2core, configure line --with-tests --without-pch --enable-static --disable-shared, 1000000 events, duration in micros.

version schedule cancel execute schedule+execute
commit:750472b63d7df6254540356bb09647132b1e3104 368277 506833 133406 501684
commit:901f4a1ee7f71aa7a3ab48dafc3e912c504fdef0 412478 492051 112641 525119
commit:a482df9d511e6269844e990095fa64a8d61b92f3 414450 527419 117375 531825

I believe 30 milliseconds extra overhead per million events to resolve two bugs should be acceptable, and no further optimization is needed.

Actions #7

Updated by Alex Afanasyev almost 10 years ago

While 30 milliseconds doesn't sounds a lot, 6 percent increase which it equivalents sounds non-trivial increase.

Given the problem in this bug is kind of corner case, can we have a solution that does not have such an increase of penalty? What is the reason for the penalty?

Actions #8

Updated by Davide Pesavento almost 10 years ago

Alex Afanasyev wrote:

While 30 milliseconds doesn't sounds a lot, 6 percent increase which it equivalents sounds non-trivial increase.

True, 6% is quite a lot.

Looking at the numbers in note-6 again, the increase in cancel benchmark after commit a482df9d511e6269844e990095fa64a8d61b92f3 is very strange, since that commit doesn't touch cancelEvent(). Junxiao, do you have an explanation?

Actions #9

Updated by Junxiao Shi almost 10 years ago

I'll rerun the benchmark on ONL and see what it gives.
Vagrant box has frequent slowdowns recently.

The result is actually from ONL.
I guess I need to run the benchmark multiple times to average out external factors such as OS scheduling.

Actions #10

Updated by Junxiao Shi almost 10 years ago

I ran the benchmarks again. This time it's on a Vagrant trusty 64-bit box, since ONL is unavailable at the moment.

First I compile three versions into three binaries, statically linked.
Then I run those binaries 50 times.

TRIALS=50
CASES='base 3691 3722'

for CASE in $CASES; do
  > $CASE.log
done

for I in $(seq $TRIALS); do
  echo $I
  for CASE in $CASES; do
    echo >> $CASE.log
    date -u >> $CASE.log
    ./scheduler-benchmark-$CASE -l message >> $CASE.log
  done
done

Finally, I collect the results:

for CASE in $CASES; do
  for METRIC in schedule cancel execute; do
    echo -n $CASE $METRIC ' '
    echo '('$(awk '$1=="'$METRIC'"{print $4}' $CASE.log | tr '\n' '+')'0)/'$(awk '$1=="'$METRIC'"' $CASE.log | wc -l)'/1000' | bc
  done
done

This gives the average duration of each million operations:

$CASE patchset schedule cancel execute schedule+execute
base 3037,4 1940027 19582392 413383 2353410
#3691 3035,7 2036830 19590815 324531 2361361
#3722 3036,3 2039858 19676666 319231 2359089

One thing I notice during the benchmarks is: softirqd kernel thread is using up one CPU core.

Another observation is that there are a small number of failed tests, when the event recording completion time is executed before some other events.
When this occurs, "execute" metric would be unavailable.

vagrant@m0212:~/benchmark$ grep 'error in' base.log 3691.log 3722.log
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999995 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
Actions #11

Updated by Junxiao Shi almost 10 years ago

  • Status changed from Code review to Closed
Actions

Also available in: Atom PDF