Task #1274
closedOptimize Scheduler
Added by Junxiao Shi almost 11 years ago. Updated over 10 years ago.
100%
Description
Optimize the time efficiency of Scheduler.
timeout.c: Tickless Hierarchical Timing Wheel is a possible starting point.
This task MUST NOT change the public API of Scheduler class.
Files
e15fa83a5d993289fb56d40e1b9776cb5f8655c8.zip (4.56 KB) e15fa83a5d993289fb56d40e1b9776cb5f8655c8.zip | wheel-based scheduler code | Junxiao Shi, 08/15/2014 08:41 PM |
Updated by Junxiao Shi over 10 years ago
This task should be moved to ndn-cpp-dev, because Task #1290 stage 1 imports time structs and Scheduler from the library.
Updated by Alex Afanasyev over 10 years ago
- Project changed from NFD to ndn-cxx
- Category deleted (
Core) - Target version deleted (
v0.1)
Updated by Jerald Paul Abraham over 10 years ago
- % Done changed from 0 to 20
I studied the following references for implementing a hierarchical timing wheel based timer:
http://25thandclement.com/~william/projects/timeout.c.html
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.1519
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9667&rep=rep1&type=pdf
I have the following conclusions:
- The timeout.c implementation does not have a notion of time and only explains how a timer implementation can be made without the need of having a clock ticking at the lowest granularity of time. It is therefore sort of an API which implements the data structures for the theory and still requires us to actually call the library timer routines.
- The idea is to use the clock timer to interrupt only when the next timeout is expected and not for every tick of time between current time and next timeout. For example, if the granularity of time in our design is 1ms and the next timeout is 15ms away, then instead of having a ticking routine which ticks every 1ms, we can directly set the inbuilt clock timer to 15ms, thereby avoiding 15 ticks (interrupts) in between.
- However, the advantage of using timing wheels over our current implementation can be explained with an example. Suppose if we have three timeouts requested at 15ms, 30ms and 50ms. Then the current scheduler sets three different deadline timers and tags them onto boost ioService object. As a result three distinct timers run within boost which probably requires three separate threads internally. The timing wheel "tickless" approach would only tag a single deadline timer to 15ms and on its expiry start the next one for 15ms (30-15ms). On the expiry of the second timer, it would start the next timer for 20ms (50-30ms). Therefore at any given time, there would only be one deadline timer active and all bookkeeping for requested timers would be done within our code and not inside boost.
- Based on the idea understood, I am implementing a C++ implementation which would be somewhat different from the timeout.c but more aligned with our current ndn-cpp-dev scheduler. This is a slightly complex implementation so its success can only be accepted once the implementation is complete.
Will keep you all posted on the progress.
Please provide any inputs as it would be helpful to have more thoughts.
Also kindly correct me if you feel I have it wrong.
Updated by Junxiao Shi over 10 years ago
- Status changed from New to In Progress
- Priority changed from Low to Normal
You should have changed Status=In Progress when you start working. You also need to complete #1152 first, which blocks this task.
This task should
- expose same API as
ndn:Scheduler
- after each scheduling/cancelling/firing, call
timeouts_timeout
and schedule adeadline_timer
with this timeout - when
deadline_timer
fires, observe the monotonic clock, and invoketimeouts_update
There shouldn’t be any license issues in using timeouts.c (MIT license aka Expat license) in ndn-cpp-dev (3-clause BSD license), becuase MIT license is compatible with BSD.
Updated by Jerald Paul Abraham over 10 years ago
Quick Question
What is the lowest resolution of time required for the timer? Is seconds good enough?
Or would there be timer intervals requested for minutes, hours days?
I am using the highest resolution of timer as milliseconds (1ms).
Updated by Junxiao Shi over 10 years ago
I suggest microseconds granularity for Scheduler. Milliseconds isn't enough for some strategy operations.
time::Duration
is in nanoseconds. Scheduler should round up to a microsecond.
Do not make assumption on the maximum duration. Scheduler should support any duration that could be represented by time::Duration
and doesn't cause time::Point
to overflow.
Updated by Jerald Paul Abraham over 10 years ago
- % Done changed from 20 to 40
Thanks Junxiao. Excellent Suggestions.
One important update is we cannot use timeout.c.
- The tool has no notion of time and therefore we will still need to implement the time granularity aspect.
- The tool calls itself "tickless" as it does not implement the time ticking aspect.
- All it implements are the data structures which fall along the idea of the timing wheels paper.
- Its utility is to identify the list of timers which have expired at a user input time. It is therefore the responsibility of the application using this tool to call upon its function with a time value.
- The ticking aspect will therefore anyway come into picture as our code needs to call such a function at some time granularity - which is ticking.
- One good feature of this tool is that it uses inexpensive bit rotate operations to do this O(1) lookup.
- One bad feature of this tool is that it does not allow the timing wheels to be designed over user desired granularities of time. A time value is segmented to wheels internally through compile time parameters specified as bits and therefore it is not possible to realize user desired tick granularities of microseconds for instance.
I am developing a CPP implementation and am almost halfway through it.
Will keep my progress updated here.
Updated by Junxiao Shi over 10 years ago
I gave an idea of how to integrate timeouts.c in note-4. Why doesn’t it work?
Updated by Jerald Paul Abraham over 10 years ago
It would work Junxiao. But...
Comment at line 476 of timeout.c file states:
This might return a timeout value sooner than any installed timeout if
only higher-order wheels have timeouts pending. We can only know when to
process a wheel, not precisely when a timeout is scheduled.
This is the comment for timeouts_int() which is called by timeouts_timeout()
I intend to optimize this.
Assume after bit-wise segmentation of our time value, the highest resolution wheel is microseconds (1000us clock) and the next lower resolution wheel is milliseconds (1000ms clock). If the highest wheel has no timeouts and the next available timeout is for 200ms, then we would unnecessarily tick 200 times at 1ms. An optimization would be to directly set a deadline_timer to 200ms.
Hope my understanding is correct.
Updated by Junxiao Shi over 10 years ago
This explanation is convincing. We are looking forward to your implementation with reasonable comments including a citation of the paper or other references.
Updated by Jerald Paul Abraham over 10 years ago
- Status changed from In Progress to Code review
- % Done changed from 40 to 100
timingwheel scheduler was implemented(Jerald, Arizona) and optimized(Hang Zhang, BIT).
Code review has been completed (gerrit #621).
The changes will be merged with ndn-cpp-dev master repository after NFD's first release.
Updated by Alex Afanasyev over 10 years ago
I have a thought, which would allow to resolve my problem of submitting the code right now. Instead of replacing current scheduler implementation, can this code be modified to provide an alternative version of the scheduler that applications can choose?
I propose the following. We would have two scheduler classes:
- BasicScheduler (the current one)
- WheelScheduler (this one).
There should also be a typedef BasicScheduler Scheduler;
so existing applications happy with the scheduler if they don't care to choose.
Updated by Junxiao Shi over 10 years ago
Library itself uses Scheduler.
How does an app pick which Scheduler is used by library?
- traffic-generator-client can have thousands of timers, and needs
WheelScheduler
. - ndn-tlv-peek has few timers, and needs
BasicScheduler
.
Should we make a Scheduler base class?
Or, will all library modules that rely on Scheduler become templates?
Updated by Alex Afanasyev over 10 years ago
We could do a trick similar to BOOST_TEST. The default is the simple implementation, but if a special macro is defined, then specific version is selected... Though it would work only with header-only things...
Baseclass implies that we will use virtual calls for scheduling events.. Not sure how big deal is this for scheduler, could be acceptable.
Updated by Junxiao Shi over 10 years ago
Using macro is also risky unless the library is entirely header-only.
Any .cpp in the library will have its Scheduler fixed at library compile time.
Updated by Alex Afanasyev over 10 years ago
OK. Let's do interface, but name "Scheduler" should be kept to refer to basic scheduler.
What about something like:
BaseScheduler
^ ^
/ \
/ \
/ \
BasicScheduler WheelScheduler
(aka Scheduler)
Updated by Jerald Paul Abraham over 10 years ago
Do we really need two Schedulers which serve the same purpose in the library?
The current scheduler implementation has tiny memory benefit over the new timing wheel implementation. Is this why we are including two variants of Scheduler?
We have run performance tests on both implementations and the timing wheel provides an order 10 performance better than the existing implementation (performance measured as cumulative timer fire slippage in case of 10000 timers requested).
As all the API calls are designed alike, there should not be any harm in including this as a replacement.
Updated by Junxiao Shi over 10 years ago
- Category set to Utils
- Status changed from Code review to Feedback
- Target version set to v0.2
As decided in an earlier conference call, the improved scheduler should be merged "after first release". Thus, it can be merged now.
Updated by Junxiao Shi over 10 years ago
20140612 conference call discusses merging the new scheduler.
Alex says he don't like the code because it's too complicated.
Lixia points out that "too complicated" is not a valid reason of rejecting the code.
Alex needs to find out whether the new scheduler would cause performance or other issues; any findings should be replied to this issue.
Otherwise, we should move forward.
Updated by Alex Afanasyev over 10 years ago
The code hasn't been tested with all applicarions that use schdulder.
In particular, applications remove events within the scheduled event. My brief look at the code suggests that this would cause segfault (there should be a test case in the latest code for that).
Updated by Junxiao Shi over 10 years ago
- Assignee changed from Jerald Paul Abraham to 航 张
Hang, please post your performance measurements of Wheel-based scheduler under this Task.
We need to be convinced of its good performance before we can approve the code.
A few bugs have been found in the old scheduler that we suspect also exist in Wheel-based scheduler.
You should rebase your Change and run the test cases added for those bugs.
Updated by Alex Afanasyev over 10 years ago
I have uploaded a scenario to evaluate scheduling performance (just scheduleEvent part) for different sizes of "batches": http://gerrit.named-data.net/#/c/913/
Here are the results I have on my machine (total number of events scheduled is 1,000,000; time includes overhead of creating and destroying Schedulers---this is a "little" unfair comparison)
Old scheduler
Batch of 10 events
Finished in 0.408565 seconds
Batch of 100 events
Finished in 0.354955 seconds
Batch of 1000 events
Finished in 0.345248 seconds
Batch of 10000 events
Finished in 0.3624 seconds
Batch of 100000 events
Finished in 0.506156 seconds
Batch of 1000000 events
Finished in 0.532927 seconds
==================
New scheduler
Batch of 10 events
Finished in 35.6839 seconds
Batch of 100 events
Finished in 3.78675 seconds
Batch of 1000 events
Finished in 0.615329 seconds
Batch of 10000 events
Finished in 0.303142 seconds
Batch of 100000 events
Finished in 0.301545 seconds
Batch of 1000000 events
Finished in 0.323886 seconds
I have other comments about the code, but in any case I don't see 10x improvement. Just scheduling has significant disadvantage with small batches of events (I guess scheduler initialization takes a lot of time) and less then 2x improvement for 1,000,000 scheduled events and the same for 100,000 events. Other parts of the scheduler are the same or more expensive in the optimized scheduler.
Updated by Alex Afanasyev over 10 years ago
Here are the results with the removed Scheduler create/destroy overhead:
Old scheduler
Batch of 10 events
Finished in 0.267596 seconds
Batch of 100 events
Finished in 0.244033 seconds
Batch of 1000 events
Finished in 0.252088 seconds
Batch of 10000 events
Finished in 0.262658 seconds
Batch of 100000 events
Finished in 0.280631 seconds
Batch of 1000000 events
Finished in 0.312653 seconds
==================
New scheduler
Batch of 10 events
Finished in 1.18234 seconds
Batch of 100 events
Finished in 0.263648 seconds
Batch of 1000 events
Finished in 0.242176 seconds
Batch of 10000 events
Finished in 0.222456 seconds
Batch of 100000 events
Finished in 0.222679 seconds
Batch of 1000000 events
Finished in 0.218744 seconds
Updated by Junxiao Shi over 10 years ago
ndn-cxx library is mainly used by an app.
The number of timers in an app is around 100.
The new Scheduler performs worse than the old Scheduler in this scenario.
Improvements are needed before it can replace the old Scheduler for general app usage.
Updated by Alex Afanasyev over 10 years ago
Btw. A "little bit" delayed comment, but want to correct statements made in note-3: the old scheduler has exactly one timer event scheduled. In this regard, there is no difference between old and new schedulers.
Other comments about the current implementation of the optimized scheduler (some of which could have contributed to reduced performance):
Implementation in
timeout.c
has a number of optimizations, which are not present in the current implementation. I got the problem with unnecessary firing of the timer event in some cases when granularity of the Scheduler is different from the granularity of the scheduled events. Though I'm not sure if it is a major problem.WHY all calculations are not the logical shifts??? Division is way slower than any shift operation and there is no reason to split into mus, ms, s.
There is very confusing large amount of conversions from
steady_clock::TimePoint
toScheduler::TimePoint
. Why there is a need to use "absolute time" (since clock's epoch)? I believe everything is relative to the current time point, unless I missing something.duration_cast should be more efficient and more straightforward way to request the desired granularity. Also, this should probably be private to the TimePoint.
Updated by Junxiao Shi over 10 years ago
- Assignee changed from 航 张 to Hang Zhang
Assignee requests to use a different account.
Updated by Hang Zhang over 10 years ago
Alex Afanasyev wrote:
I have uploaded a scenario to evaluate scheduling performance (just scheduleEvent part) for different sizes of "batches": http://gerrit.named-data.net/#/c/913/
...
I have other comments about the code, but in any case I don't see 10x improvement. Just scheduling has significant disadvantage with small batches of events (I guess scheduler initialization takes a lot of time) and less then 2x improvement for 1,000,000 scheduled events and the same for 100,000 events. Other parts of the scheduler are the same or more expensive in the optimized scheduler.
Each scheduler needs to create and initialize 6000(or 7000) slots, that means 24KB(6000 * 4 Byte) memory. So the number of schedulers should not be too much.
For PerTickBookkeeping:
The average cost per unit time for an average of N timers is:
N * c * m / T
c: the cost of reposition timers(heavily depends on the density of timers)
m: the total number of levels in the hierarchy.
T: average timer interval(from start to stop or expiry)
The design of the wheel-scheduler relies heavily on the number of timers. If the number of timers in an app is around 100, the number of wheels and slots in the wheel-scheduler MUST redesign.
Incidentally, if the number of timers is around 100, I think an non-hierarchy scheduler(eg. hash table with unsorted list in each bucket) will be more simple and faster.
Updated by Junxiao Shi over 10 years ago
After reading the book NETWORK ALGORITHMICS, I realize timing wheel based schedule isn't intended for application usage.
It is intended to be implemented in the kernel: a hardware clock interrupts CPU, and the kernel executes per-tick bookkeeping.
A timing wheel in userspace causes a kernel timer to run userspace per-tick bookkeeping at every time unit, which is otherwise avoidable.
I suggest this Task to be Abandoned.
Updated by Alex Afanasyev over 10 years ago
I also would like to abandon this task. However, I would like to preserve the code somehow, but not sure how (just in case it will be useful for somebody else).
Updated by Junxiao Shi over 10 years ago
- File e15fa83a5d993289fb56d40e1b9776cb5f8655c8.zip e15fa83a5d993289fb56d40e1b9776cb5f8655c8.zip added
- Status changed from Feedback to Abandoned
Unfortunately we have to abandon this Task, because a wheel-based scheduler is not suitable for ndn-cxx library.
Latest code is preserved in the attachment.
Thanks to all contributors.