Project

General

Profile

Task #3610

FIB+PIT profiling

Added by Junxiao Shi about 4 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
12.00 h

Description

Profile the performance of FIB and PIT, and understand where the bottlenecks are.


Files

callgrind.pit-fib-performance-profiling.out (586 KB) callgrind.pit-fib-performance-profiling.out Callgrind profiling output file for pit-fib-performance test on a ONL Host8core machine Chengyu Fan, 08/29/2016 01:39 PM
callgrind.pit-fib-performance-profiling.out (66.1 KB) callgrind.pit-fib-performance-profiling.out Callgrind profiling output file for the portion between two time::steady_clock::now() calls in pit-fib-performance test (9685cc5682c9e1d642747649294f469e316733ab) on a ONL Host8core machine Chengyu Fan, 08/30/2016 01:22 PM
call-graph-for-pit-fib-operations.png (175 KB) call-graph-for-pit-fib-operations.png The profiling call graph for pit-fib operations (exclude generatePacketsAndPopulateFib()) Chengyu Fan, 08/30/2016 02:06 PM

Related issues

Blocked by NFD - Feature #3571: FIB+PIT benchmarkClosedYumin Xia03/28/2016

Actions
#1

Updated by Junxiao Shi about 4 years ago

#2

Updated by Zhuo Li about 4 years ago

  • Start date set to 04/29/2016
#3

Updated by Zhuo Li almost 4 years ago

  • Status changed from New to In Progress
#4

Updated by Junxiao Shi almost 4 years ago

  • Status changed from In Progress to New
  • Assignee deleted (Zhuo Li)
  • Start date deleted (04/29/2016)

Beichuan says Zhuo will not work on this recently.

#5

Updated by Junxiao Shi almost 4 years ago

  • Assignee set to Chengyu Fan

Chengyu agrees to work on this issue but he cannot start until August.

#6

Updated by Chengyu Fan almost 4 years ago

I have run the pit_fib performance test on an ONL Host8core machine. The Callgrind profiling output file is uploaded. Please use kcachegrind to open it for details.

The pit-fib-benchmark.cpp simulates FIB and PIT operations on one million Interest-Data exchanges. It generates packets and populates FIB before the actual tests. The tests include finding longest prefix match in FIB, finding matched Data in PIT and erasing PIT entries.

The results show that the FIB and PIT tests take 11374372 microseconds. The whole pit-fib-benchmark takes 25.655s. This means more than half of the time is used by the generatePacketsAndPopulateFib() function. FIB and PIT operations (exclude generatePacketsAndPopulateFib()) can achieve 87,950 Interest-Data exchanges per second.

More information is given below:

(1) generatePacketsAndPopulateFib() takes 49.15% of the running time.
     Its child, the ndn::Name::Name constructor uses around 26.06%. The other child nfd::fib::insert() takes 8.74%

(2) Pit as a whole takes 44.46%
     nfd::Pit::findOrInsert() takes 19.92%
     nfd::Pit::findAllDataMatches() takes 16.37%
     nfd::Pit::erase() uses the other 8.32%

(3) the most time-consuming ndn function is ndn::Name::wireEncode() 21.51%.
     It is called by both nfd::name_tree::computeHash() 14.00% and nfd::name_tree::computeHashSet() 7.52%

(4) NameTree:
    a. the most time-consuming NameTree function is nfd::NameTree::lookup() 26.01%.
       The Fib::insert() contributes 8.74% (called by generatePacketsAndPopulateFib()). The rest is from Pit::findOrInsert() 19.92%.

    b. another time-consuming function is nfd::NameTree::findAllMatches() 12.90%. They are from nfd::Pit::findAllDataMatches()


Since the ndn::Name::wireEncode() alone uses 21.51% of time, I have tried to use TCMalloc library to optimize the memory management. The results confirmed my assumption. The FIB and PIT operations go down to 8499654 microseconds. This means FIB and PIT operations can achieve 117,647 Interest-Data exchanges per second.

#7

Updated by Junxiao Shi almost 4 years ago

  • Status changed from New to Feedback

I should have said this earlier, but generatePacketsAndPopulateFib should not be part of the profiling: it's the initial setup.

Can you report only the portion between two time::steady_clock::now() calls?

https://github.com/named-data/NFD/blob/dbef6dc8e2e32160b9e826ca0f32b9a00bb759df/tests/other/pit-fib-benchmark.cpp#L118-L133

#8

Updated by Davide Pesavento almost 4 years ago

Please provide the exact versions of NFD and ndn-cxx (git commit SHA) used to perform the test in note-6.

#9

Updated by Chengyu Fan almost 4 years ago

The number I reported is the time between two time::steady_clock::now() calls.

Since I have trouble to install openssl library for ndn-cxx on ONL, I used the older version. The ndn-cxx is 0.4.1 (commit SHA = 2e52d7cab4e03631db7f6c89a2daf7de590dd29e). The NFD is 4f1afaca16ba02b40651f62e0f28e56d0b7cbf9f, which is the commit for face system benchmark, but I modified the pit-fib-benchmark.cpp code as https://gerrit.named-data.net/#/c/3021/.

Please let me know if I chose the wrong version.

Without TCMalloc, the time between two time::steady_clock::now() calls is 11374372 microseconds. With TCMalloc, the time is 8499654 microseconds.

#10

Updated by Junxiao Shi almost 4 years ago

It's necessary to rerun the profiling with latest version, because I have changed NameTree significantly.

The profiling should only report function calls between two nows; it should exclude anything under initial setup procedure.

#11

Updated by Chengyu Fan almost 4 years ago

The profiling results for the latest NFD (9685cc5682c9e1d642747649294f469e316733ab) is uploaded. The ndn-cxx version used for this test is 667370f102f7953968a6d0d60a782cf0ef33d9d5. The profiling is running for the portion between two time::steady_clock::now() calls. Please refer to the uploaded file call-graph-for-pit-fib-operations.png for the call graph.

(1) nfd::Pit::findOrInsert() takes 42.33% of the running time

(2) nfd::Pit::findAllDataMatches() takes 39.51%

(3) nfd::Pit::erase() uses the other 16.23%

(4) ndn::Name::wireEncode() uses 40.27%, and it is called by nfd::name_tree::computesHash()

To measure the time between two time::steady_clock::now() calls, I run 10 experiments for both without TCMalloc and with TCMalloc.
Without TCMalloc, the time is in range (7833889, 10918891) microseconds, and the median is 8606859.5 microseconds. With TCMalloc, the time is in range (6012537, 8827668) microseconds, and the median is 6935110.5 microseconds.

#12

Updated by Davide Pesavento almost 4 years ago

Chengyu Fan wrote:

The profiling is running for the portion between two time::steady_clock::now() calls.

How did you do that? Did you put CALLGRIND_{START,STOP}_INSTRUMENTATION macros in the code?

#13

Updated by Chengyu Fan almost 4 years ago

Davide Pesavento wrote:

Chengyu Fan wrote:

The profiling is running for the portion between two time::steady_clock::now() calls.

How did you do that? Did you put CALLGRIND_{START,STOP}_INSTRUMENTATION macros in the code?

Yes, I added these macros in the pit-fib-benchmark.cpp code. Please let me know if there are any issues

#14

Updated by Davide Pesavento almost 4 years ago

That's good, it's what I had in mind. We should probably add those calls to the committed benchmark code for posterity's sake. I'll prepare a patch.

#16

Updated by Davide Pesavento almost 4 years ago

Junxiao asks on gerrit:

Can you post on Redmine the command line to generate a callgrind report? I need to test this.

$ valgrind --tool=callgrind --instr-atstart=no ./build/whatever

#17

Updated by Davide Pesavento over 2 years ago

  • Category deleted (Integration Tests)
  • Status changed from Feedback to Closed

Also available in: Atom PDF