Project

General

Profile

Actions

Bug #1432

closed

Performance collapse under 400 Interests/s after 1min

Added by Junxiao Shi almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Category:
Tables
Target version:
Start date:
04/01/2014
Due date:
% Done:

100%

Estimated time:

Description

Reported by John DeHart.

Under a constant load of interest traffic (where each interest is for a unique name), NFD's performance degrades to the point where it can no longer keep up with the incoming interest traffic.
Interests then start building up in the socket buffer and eventually interests start getting dropped at the socket layer.
This occurs even with a total load as low as 100 interests/sec.

Using 9 ONL hosts:

  • 1 of the hosts is a central router
  • 4 of the hosts are servers running ndn-traffic-server to serve content
  • 4 of the hosts are clients running ndn-traffic to generate interests.
  • All the clients use the central router to get to the servers.

Each client sending a constant load of 100 Interests/sec. Total load is 400 Interests/sec on the router node. The router keeps up for about 1 minute. Then its performance starts tailing off.

With total of 200 Interests/sec, the router lasts for about 3 minutes before tailing off.

With total of 100 Interests/sec, the router lasts for about 9 minutes before tailing off.

The above is using UDP faces.


Files

Archive.zip (14.3 KB) Archive.zip Ilya Moiseenko, 04/05/2014 03:13 PM
Actions #1

Updated by Junxiao Shi almost 10 years ago

  • Category set to Tables
  • Assignee set to Ilya Moiseenko

From John DeHart:

Our theory is that as the content store grows something is taking
longer and longer to process interests or content.

If we set the content store nMaxpackets=10000 in cs.hpp we do not see any performance degradation even after several minutes.

This means CS match algorithm needs improvement, because it's the only module that is affected by number of stored packets.

Actions #2

Updated by Ilya Moiseenko almost 10 years ago

I'll take a look. It could be cache cleanup.

Actions #3

Updated by Ilya Moiseenko almost 10 years ago

I've submitted a commit http://gerrit.named-data.net/612 which disables cache cleanup based on content staleness. It uses priority queue and I think this is the source of the problems. So without it, cache replacement will be pure LRU.

I need performance feedback on this change. If performance improves significantly, I'll be looking for ways to consider packet staleness without strict ordering.

Actions #4

Updated by John DeHart almost 10 years ago

I've rebuilt with commit 612 and re-run my test and there is no change.
NFD performance still degrades in the same way.

Actions #5

Updated by Ilya Moiseenko almost 10 years ago

Ok

Actions #6

Updated by Junxiao Shi almost 10 years ago

It's reported that there may be a platform-dependent issue:

Just a little bit more to observations. When I tried to run Ilya's code on my laptop, it performed ok.
When I tried to run the same code on one of the lab's Ubuntu machines, I got the same problem as John.
There is also very strange memory "problem". Somehow the test case, even on my machine, consumes a lot of memory (~2G) while it shouldn't --- CS cap is like 50k items and each item is an empty data packet (~100bytes).

We found one of our students who had a new Mac and we ran the test on his laptop. We got results similar to yours.
When we run it on slightly older Macs we get crappy results.

I6c5a312456ffcd8cab4529e3fc107d48992801bc disables ContentStore completely.
Please test this patch on different platforms and compilers, in order to determine whether a platform issue outside of ContentStore is causing the bug.

When you report back, be sure to mention: operating system version, compiler version, Boost library version.

Actions #7

Updated by John DeHart almost 10 years ago

We have done some experiments with smaller sized ContentStores and we
see improvement in performance as the CS gets smaller.
With a size of 0 we are probably simulating the removal of the CS
without changing a lot of code.

We just change this line in cs.hpp:
Cs(int nMaxPackets = 65536); // ~500MB with average packet size = 8KB
to change the nMaxPackets.

Actions #8

Updated by John DeHart almost 10 years ago

When we run on ONL with nMaxPackets=0 it works beautifully at 400 interests/sec. It has been running for about
and hour with no performance degradation.

nfd does however continue to grow in size.

Actions #9

Updated by Davide Pesavento almost 10 years ago

John DeHart wrote:

nfd does however continue to grow in size.

That could be an unrelated memory leak somewhere else in the code. I suggest you file a separate bug report.

Actions #10

Updated by John DeHart almost 10 years ago

I6c5a312456ffcd8cab4529e3fc107d48992801bc disables ContentStore completely.
Please test this patch on different platforms and compilers, in order to determine whether a platform issue outside of ContentStore is causing the bug.

When you report back, be sure to mention: operating system version, compiler version, Boost library version.

When I run (5 minutes so far) this patch on ONL (Ubuntu 12.04.4, g++ ver 4.6.3, boost 1.48) I see no performance degradation and nfd does NOT grow in size.

Actions #11

Updated by Yi Huang almost 10 years ago

Settings:

3 VM nodes connected like this: A-B-C. All 3 nodes are the same.

Platform: Ubuntu 12.04

g++ version 4.6.3

boost 1.48

I ran the test using different frequency with Junxiao's update to completely by-pass CS. Looks like RTT and interest-loss rate has a significant increase when frequency changes from 250 interests/sec to ~333 interests/sec.

Here's the table I made with collected data:

interval(ms) total-interests interest-loss(%) average-RTT(ms)
80 10000 0 73.7064
40 10000 0.04 194.4467
20 10000 0.06 307.9077
10 10000 0.30 330.7906
5 10000 0.79 415.1157
4 10000 2.07 621.9645
3 10000 9.62 1620.2501
2 10000 55.36 2050.1325
1 10000 75.48 2193.4072
Actions #12

Updated by Ilya Moiseenko almost 10 years ago

Heap allocation is expensive. We do make_shared for every Interest and Data packet. In Content Store I call new on per-packet basis as well. But I can pre-allocate memory for my internal processing, so there will be no "new" at any place in Content Store. Not sure what to do with incoming packets.

Actions #13

Updated by Beichuan Zhang almost 10 years ago

Yi's numbers are premature. They were obtained from 3 virtual machines on the same physical machine. We don't know how to interpret the numbers yet. He's looking into it.

Actions #14

Updated by Junxiao Shi almost 10 years ago

  • Status changed from New to In Progress

Heap allocation is expensive. Using a memory pool based allocator is possible, but that would be a major change that needs a lot of work.

I suspect ContentStore is holding more Data packets than nMaxPackets:

  • CS has three queues: unsolicited, staleness, arrival
  • Cs::evictItem looks at those queues one by one, and evicts the first entry found
  • When an entry is evicted, it's deleted from the skip list, and popped from the queue where it's found, but not popped from other queues
  • Suppose an entry is found in staleness queue, it won't be popped from the arrival queue, so its shared_ptr is still kept in arrival queue
  • Data packet will not be deallocated if a shared_ptr to the CS entry exists in arrival queue
  • As long as there is some stale packet, nothing is popped from arrival queue, so it will keep growing

This memory problem is related to this bug, because the unbounded growth of arrival queue causes high memory utilization, and makes further memory allocation slower.

Actions #15

Updated by Ilya Moiseenko almost 10 years ago

Queues are already replaced with boost multi_index. Situation with memory is better, but performance is almost the same.

Actions #16

Updated by Anonymous almost 10 years ago

Is an updated patch set with multi_index available?

Actions #17

Updated by Ilya Moiseenko almost 10 years ago

Yes, archive is attached. Unit test is called "TableCs/StressTest".

Actions #18

Updated by Junxiao Shi almost 10 years ago

I'm uploading I01a6fc08171619f94aa3fe54daa560ecf301adae patchset 3.

This CS stub has full functionality. Cleanup/eviction is also implemented.

Please try this patchset on ONL.

Stress test

Code: I01a6fc08171619f94aa3fe54daa560ecf301adae patchset 3, tests/daemon/cs.cpp, "Stress" test case

Tested units:

  • stubCS: I01a6fc08171619f94aa3fe54daa560ecf301adae patchset 3
  • masterCS: ContentStore in master branch commit b07788af059a6b4128049fe97ddf3fcb6126a5b0

Machines:

  • Ubuntu: Ubuntu 12.04 64-bit, VirtualBox, Xeon E5-2680 2.70GHz, 4GB memory, gcc 4.6.3, Boost 1.48
  • OSX: OSX 10.9 64-bit, Mac Mini, Core i5-3210M 2.50GHz, 4GB memory, clang 500.2.79, Boost 1.55
nIterations stubCS on Ubuntu masterCS on Ubuntu stubCS on OSX masterCS on OSX
1,000 0.200s 0.320s 0.026s 0.042s
10,000 2.119s 13.062s 0.273s 2.028s
25,000 5.243s 74.396s 0.700s 20.264s
100,000 23.783s 1567.259s 3.175s 1062.815s
1,000,000 252.518s NA 33.552s NA

Observations

stubCS is close to linear time. masterCS execution time grows faster than linear.

Execution time on OSX is faster than on Ubuntu, possibly due to clang optimization.

Actions #19

Updated by Ilya Moiseenko almost 10 years ago

Haowei found computational bug in the code. John found a memory leak. I optimized heap allocations and memory consumption.

Patch is here http://gerrit.named-data.net/#/c/626/

Performance with unit-test on Linux (UBUNTU 12.04 VM, i7 2.4 GHZ)
1M insertions/lookups = 18 seconds
2M insertions/lookups = 37 seconds
4M insertions/lookups = 75 seconds

Performance with unit-test on OS X (10.9, i7 2.4GHZ )
1M insertions/lookups = 14 seconds
2M insertions/lookups = 29 seconds
4M insertions/lookups = 60 seconds

Actions #20

Updated by Ilya Moiseenko almost 10 years ago

With cache size = 200K (3.5x increase)

1M insertions/lookups = 15 seconds, 2M insertions/lookups = 30 seconds, 4M insertions/lookups = 62 seconds

Actions #21

Updated by Junxiao Shi almost 10 years ago

  • stubCS: I01a6fc08171619f94aa3fe54daa560ecf301adae patchset 3
  • I626-1: I7f6752ec279e64e81c90c0b3e8d756da34194965 patchset 1
nIterations stubCS on Ubuntu I626-1 on Ubuntu stubCS on OSX I626-1 on OSX
1,000 0.200s 0.114s 0.026s 0.048s
10,000 2.119s 1.275s 0.273s 0.401s
25,000 5.243s 3.288s 0.700s 1.280s
100,000 23.783s 14.186s 3.175s 5.049s
1,000,000 252.518s 149.774s 33.552s 52.406s

With nMaxPackets=50000, Ilya's new code beats the stub on Ubuntu, but not on OSX.

Actions #22

Updated by John DeHart almost 10 years ago

The latest patch from Ilya works well with the 400 interests/sec test on ONL.
I see no performance degradation.

I am now working on beefing up my ONL tests to do higher rates.

I'll also re-check for memory leaks with some longer tests in ONL.

Actions #23

Updated by John DeHart almost 10 years ago

I have made one round of improvements to my ONL tests and am now
running the nfd (with Ilya's latest patch) at 8000 interests/sec without any degradation
and without any sign of memory leaks.

I will continue to beef up the tests and see how high I can go.

Actions #24

Updated by Junxiao Shi almost 10 years ago

Ilya should update the code to:

  • address review comments
  • disable stress test by default (so that Jenkins won't be slow)

and then mark this bug as Resolved.

Actions #25

Updated by Alex Afanasyev almost 10 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF