Project

General

Profile

Actions

Bug #3722

closed

Scheduler stops if a callback throws

Added by Junxiao Shi over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Utils
Target version:
Start date:
08/06/2016
Due date:
% Done:

100%

Estimated time:
2.00 h

Description

Steps to reproduce:

  1. declare a ndn::util::scheduler::Scheduler
  2. scheduler two events; the first event callback throws an exception
  3. invoke io.run(), catch the exception thrown by the first event callback, and restart io.run()

Expected: the second event is executed

Actual: the second event is never executed

Actions #1

Updated by Junxiao Shi over 7 years ago

  • Related to Bug #3691: Scheduler::cancelAllEvents causes memory error in ScopedEventId added
Actions #2

Updated by Davide Pesavento over 7 years ago

Just because I noticed this problem during the code review for #3691, it doesn't mean it's "related to" #3691...

Actions #3

Updated by Junxiao Shi over 7 years ago

  • Status changed from New to In Progress
  • Assignee set to Junxiao Shi
Actions #4

Updated by Junxiao Shi over 7 years ago

  • Related to deleted (Bug #3691: Scheduler::cancelAllEvents causes memory error in ScopedEventId)
Actions #5

Updated by Junxiao Shi over 7 years ago

  • Status changed from In Progress to Code review
  • % Done changed from 0 to 100
Actions #6

Updated by Junxiao Shi over 7 years ago

https://gerrit.named-data.net/3037 patchset1 adds a benchmark for Scheduler.

I compared the Scheduler before #3691 changes, https://gerrit.named-data.net/3035 patchset6, https://gerrit.named-data.net/3036 patchset2.
ONL pc2core, configure line --with-tests --without-pch --enable-static --disable-shared, 1000000 events, duration in micros.

version schedule cancel execute schedule+execute
commit:750472b63d7df6254540356bb09647132b1e3104 368277 506833 133406 501684
commit:901f4a1ee7f71aa7a3ab48dafc3e912c504fdef0 412478 492051 112641 525119
commit:a482df9d511e6269844e990095fa64a8d61b92f3 414450 527419 117375 531825

I believe 30 milliseconds extra overhead per million events to resolve two bugs should be acceptable, and no further optimization is needed.

Actions #7

Updated by Alex Afanasyev over 7 years ago

While 30 milliseconds doesn't sounds a lot, 6 percent increase which it equivalents sounds non-trivial increase.

Given the problem in this bug is kind of corner case, can we have a solution that does not have such an increase of penalty? What is the reason for the penalty?

Actions #8

Updated by Davide Pesavento over 7 years ago

Alex Afanasyev wrote:

While 30 milliseconds doesn't sounds a lot, 6 percent increase which it equivalents sounds non-trivial increase.

True, 6% is quite a lot.

Looking at the numbers in note-6 again, the increase in cancel benchmark after commit a482df9d511e6269844e990095fa64a8d61b92f3 is very strange, since that commit doesn't touch cancelEvent(). Junxiao, do you have an explanation?

Actions #9

Updated by Junxiao Shi over 7 years ago

I'll rerun the benchmark on ONL and see what it gives.
Vagrant box has frequent slowdowns recently.

The result is actually from ONL.
I guess I need to run the benchmark multiple times to average out external factors such as OS scheduling.

Actions #10

Updated by Junxiao Shi over 7 years ago

I ran the benchmarks again. This time it's on a Vagrant trusty 64-bit box, since ONL is unavailable at the moment.

First I compile three versions into three binaries, statically linked.
Then I run those binaries 50 times.

TRIALS=50
CASES='base 3691 3722'

for CASE in $CASES; do
  > $CASE.log
done

for I in $(seq $TRIALS); do
  echo $I
  for CASE in $CASES; do
    echo >> $CASE.log
    date -u >> $CASE.log
    ./scheduler-benchmark-$CASE -l message >> $CASE.log
  done
done

Finally, I collect the results:

for CASE in $CASES; do
  for METRIC in schedule cancel execute; do
    echo -n $CASE $METRIC ' '
    echo '('$(awk '$1=="'$METRIC'"{print $4}' $CASE.log | tr '\n' '+')'0)/'$(awk '$1=="'$METRIC'"' $CASE.log | wc -l)'/1000' | bc
  done
done

This gives the average duration of each million operations:

$CASE patchset schedule cancel execute schedule+execute
base 3037,4 1940027 19582392 413383 2353410
#3691 3035,7 2036830 19590815 324531 2361361
#3722 3036,3 2039858 19676666 319231 2359089

One thing I notice during the benchmarks is: softirqd kernel thread is using up one CPU core.

Another observation is that there are a small number of failed tests, when the event recording completion time is executed before some other events.
When this occurs, "execute" metric would be unavailable.

vagrant@m0212:~/benchmark$ grep 'error in' base.log 3691.log 3722.log
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
base.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999995 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
3722.log:../tests/integrated/scheduler-benchmark.cpp(72): fatal error in "Execute": critical check nExpired == nEvents failed [999999 != 1000000]
Actions #11

Updated by Junxiao Shi over 7 years ago

  • Status changed from Code review to Closed
Actions

Also available in: Atom PDF