Task #3567
openFace system benchmark
40%
Description
Benchmark packet processing performance of face system.
- Implement a stripped-down forwarding program that contains just the face system. This program should contains 2n faces, L0, L1, .., Ln-1 and R0, R1, .., Rn-1. Anything received on face Li is sent to face Ri, and anything received on face Ri is sent to face Li.
- In an emulation environment, allocate one node U with the above program, and 2n other nodes with regular NFD that are connected to each face of U. Run consumer programs on nodes connected to Li faces, and producer programs on nodes connected to Ri faces.
- Benchmark the throughput of U until it hits CPU bottleneck, and profile the program using #3564-1 method to find the bottleneck.
- Repeat the test with different
Transport
types andLinkService
choice/settings.
Files
Updated by Davide Pesavento over 8 years ago
What kind of "emulation environment"? Can't we just record and reply a packet trace?
Updated by Junxiao Shi over 8 years ago
What kind of "emulation environment"?
Open Network Laboratory is an emulation environment that can support this benchmark with UDP and TCP transports.
We can start with these.
If a benchmark with other transport types is needed, we'd find some other ways.
Can't we just record and reply a packet trace?
That would need reading a large trace file, which is less efficient than using the traffic generator to create random traffic on the fly.
Updated by Junxiao Shi over 8 years ago
- Assignee set to Zhuo Li
- Estimated time set to 18.00 h
Beichuan decides to assign the programming portion to Zhuo.
We haven't decided how to assign the other portion of this task.
Updated by Junxiao Shi over 8 years ago
- Assignee changed from Zhuo Li to Weiwei Liu
@Weiwei will work on the programming portion of this task.
Exact spec of the program should be developed in consultation with @Chengyu Fan who is expected to do the profiling work.
Updated by Chengyu Fan over 8 years ago
Does this face benchmark exclude the NDNLP? If not, it will affect the results.
Updated by Junxiao Shi over 8 years ago
Answer to note-5:
In the first experiment, NDNLPv2 should not be excluded.
If profiling proves GenericLinkService
is a bottleneck, further analysis can be performed.
Updated by Weiwei Liu over 8 years ago
@Junxiao,
According to your suggestion, in the program I just need create the faces according to the FaceUri pairs and make each pair of faces connected to each other, and the face will automatically listen to the traffic? Is there anything else I need to do?
I tested the program with 5 virtual machines, but it seems not working. In my test, all the nodes(A, B, C, D, R) are in the same private network. R is the router, and in the program, A is connected to B, C is connected to D. Here is what I did:
1. run benchmark program on node R(192.168.56.6)
2. run `nfdc register ndn:/example udp4://192.168.56.6` on node A(192.168.56.2)
3. run `echo 'hello world' | ndnpoke ndn:/example/a` on node B(192.168.56.3)
4. run `ndnpeek -p ndn:/example/a` on node A
And I observed the following log on node A:
1459559366.746666 WARNING: [UnicastUdpTransport] [id=281,local=udp4://192.168.56.2:6363,remote=udp4://192.168.56.6:6363] Send or receive operation failed: Connection refused
Do you know what I might be doing wrong?
Updated by Davide Pesavento over 8 years ago
Hard to say without seeing the code... How are you creating the transports? Make sure that the socket passed to the transport constructor is already connected. Upload the patch to gerrit if you need more detailed advice.
Updated by Weiwei Liu over 8 years ago
reply to note-9.
I'm using protocolFactory to create faces(Am I supposed to use it?). And I think this shouldn't have the socket problem, since the transport is created automatically created. And I also tried to create face with Face constructor, but same thing happens.
And here is the code.
http://gerrit.named-data.net/#/c/2803/
Updated by Davide Pesavento over 8 years ago
The root cause is that you're not calling .listen()
on the channel(s).
There's a bigger problem with patchset 1 design though. You're creating both Li and Ri faces from R as outgoing "connections". However, at least in the UDP case, node A doesn't know anything about this "connection" because the underlying UDP protocol is connection-less, therefore no face is created on A's side until traffic starts flowing from R to A (which never happens in your case). So when A starts communicating with R, it tries to create a new face towards port 6363 of R (regular NFD behavior), which fails because nothing is listening on R's side.
My suggestion is to create and link the faces together only in response to incoming "connections". Actually the Li faces will be created automatically by the listening channel, you simply need to create the corresponding Ri with Channel::connect()
and finally link the two faces together.
Updated by Junxiao Shi over 8 years ago
Reply to note-11:
The "bigger problem" is not an issue with UDP.
From socket point of view, a listening socket on UDP/6363 is not needed to accept packets from a consumer.
Suppose the benchmark program is at 192.0.2.5, a consumer is at 192.0.2.11, and a producer is at 192.0.2.21.
When the benchmark program starts, it creates two UDP sockets. Their five-tuples are:
UDP 192.0.2.5:6363 192.0.2.11:6363 UDP 192.0.2.5:6363 192.0.2.21:6363
At this point, neither consumer nor producer knows about these sockets.
nfdc register / udp4://192.0.2.5:6363
is executed at the consumer.
The consumer NFD creates a UDP socket. Its five-tuple is:UDP 192.0.2.11:6363 192.0.2.5:6363
The producer NFD listens on UDP/6363. Its five-tuple is:
UDP 0.0.0.0:6363 *:6363
When an Interest is expressed at the consumer and forwarded to the benchmark program.
The UDP packet has source=192.0.2.11:6363 and destination=192.0.2.5:6363.
This matches the five-tuple of the benchmark program's first socket, and thus is delivered to the correct face rather than attempting to create a new face.The benchmark program forwards the Interest toward the producer.
The UDP packet has source=192.0.2.5:6363 and destination=192.0.2.21:6363.
Since the producer NFD does not have a socket with five-tupleUDP 192.0.2.21:6363 192.0.2.5:6363
, the packet is delivered to the more generic listening socket.The producer NFD creates a new face with a UDP socket, whose five-tuple is:
UDP 192.0.2.21:6363 192.0.2.5:6363
This socket will receive all subsequent packets from the benchmark program.
This will not work for TCP because the local port number of TcpTransport
is not fixed to 6363.
Updated by Davide Pesavento over 8 years ago
Junxiao Shi wrote:
From socket point of view, a listening socket on UDP/6363 is not needed to accept packets from a consumer.
"listening socket" has no meaning in a UDP context so I don't understand your point here (unless you're stating the obvious, i.e. that you mustn't call listen(2)
on a UDP socket). It also seems to contradict point 1 below (there is an open socket on UDP port 6363).
When the benchmark program starts, it creates two UDP sockets. Their five-tuples are:
UDP 192.12.0.5:6363 192.12.0.11:6363 UDP 192.12.0.5:6363 192.12.0.21:6363
Right. I didn't see the bind()
in the code: https://github.com/named-data/NFD/blob/master/daemon/face/udp-channel.cpp#L164, and I assumed the local port was chosen at random like in the TCP case.
This will not work for TCP because the local port number of
TcpTransport
is not fixed to 6363.
Exactly. So I think that generally speaking my point and suggestion in note-11 still apply, although the specific example is wrong. We should use the same design whether the faces are TCP or UDP.
Updated by Davide Pesavento over 8 years ago
Weiwei Liu wrote:
And I observed the following log on node A:
1459559366.746666 WARNING: [UnicastUdpTransport] [id=281,local=udp4://192.168.56.2:6363,remote=udp4://192.168.56.6:6363] Send or receive operation failed: Connection refused
Are you sure that VM-to-VM traffic is allowed by the hypervisor? Sometimes it's blocked for security reasons.
Updated by Weiwei Liu over 8 years ago
reply to note 14.
I used vagrant to set up the VMs and made them to be in the same private network, and I can ping from one VM to another. Is it enough?
Updated by Weiwei Liu over 8 years ago
reply to note-11.
I tried to modify the code according to your suggestion. But I'm not sure if I understand it correctly, because I'm still getting the "Connection refused" error.
I've uploaded a new patch set. Please tell me if I'm doing something wrong.
Updated by Davide Pesavento over 8 years ago
Weiwei Liu wrote:
I used vagrant to set up the VMs and made them to be in the same private network, and I can ping from one VM to another. Is it enough?
Make sure iperf -u
works between the two machines.
Updated by Junxiao Shi over 8 years ago
Reply to note-13:
"listening socket on UDP/6363" refers to a socket with bind(0.0.0.0:6363)
but without connect()
.
This socket would "listen" the port and receive any packet that arrives at UDP/6363, unless that packet matches a more specific connect
ed socket.
Local port for UnicastUdpTransport
is not chosen randomly. It's always 6363, as seen on the testbed.
To test TCP, the benchmark program can still make outgoing connections.
At the consumer NFD, instead of executing nfdc register / tcp4://192.0.2.5:6363
, the script should determine the FaceId of the on-demand face from nfd-status -f
output, and register the prefix toward this FaceId.
Dynamically making new faces in the benchmark increases the complexity of the benchmark program, and worsens the result.
Updated by Davide Pesavento over 8 years ago
Junxiao Shi wrote:
Dynamically making new faces in the benchmark increases the complexity of the benchmark program, and worsens the result.
We're not "dynamically making new faces". There will still be a "setup time" at the beginning of the benchmark that should be excluded from measurements, but even if it's not, its impact should be negligible if the benchmark run is long enough. What's important is that this initial setup happens only once.
Updated by Weiwei Liu over 8 years ago
It seems the udp test doesn't work between vms.
I run iperf -s -u -p 12345
on 192.168.56.6, and iperf -c 192.198.56.6 -u -p 12345
on 192.168.56.2, and I'm getting the following.
vagrant@A:~/iperf-2.0.5$ iperf -c 192.198.56.6 -u
Client connecting to 192.198.56.6, UDP port 12345
Sending 1470 byte datagrams
UDP buffer size: 208 KByte (default)
[ 3] local 10.0.2.15 port 48330 connected with 192.198.56.6 port 12345
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] Sent 893 datagrams
[ 3] WARNING: did not receive ack of last datagram after 10 tries.vagrant@R:~/iperf-2.0.5$ iperf -s -u -p 12345
Server listening on UDP port 12345
Receiving 1470 byte datagrams
UDP buffer size: 208 KByte (default)
Why does it show local 10.0.2.15 rather than 192.168.56.2? Does anyone know how should I configure them to make the udp work?
Davide Pesavento wrote:
Weiwei Liu wrote:
I used vagrant to set up the VMs and made them to be in the same private network, and I can ping from one VM to another. Is it enough?
Make sure
iperf -u
works between the two machines.
Updated by Eric Newberry over 8 years ago
Weiwei Liu wrote:
It seems the udp test doesn't work between vms.
I runiperf -s -u -p 12345
on 192.168.56.6, andiperf -c 192.198.56.6 -u -p 12345
on 192.168.56.2, and I'm getting the following.vagrant@A:~/iperf-2.0.5$ iperf -c 192.198.56.6 -u
Client connecting to 192.198.56.6, UDP port 12345
Sending 1470 byte datagrams
UDP buffer size: 208 KByte (default)
[ 3] local 10.0.2.15 port 48330 connected with 192.198.56.6 port 12345
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] Sent 893 datagrams
[ 3] WARNING: did not receive ack of last datagram after 10 tries.vagrant@R:~/iperf-2.0.5$ iperf -s -u -p 12345
Server listening on UDP port 12345
Receiving 1470 byte datagrams
UDP buffer size: 208 KByte (default)Why does it show local 10.0.2.15 rather than 192.168.56.2? Does anyone know how should I configure them to make the udp work?
Davide Pesavento wrote:
Weiwei Liu wrote:
I used vagrant to set up the VMs and made them to be in the same private network, and I can ping from one VM to another. Is it enough?
Make sure
iperf -u
works between the two machines.
Using the Vagrantfile you sent me on Friday (with minor modifications), I can perform the iperf test with no issues. What have you changed in the file since then?
Updated by Weiwei Liu over 8 years ago
- File nodes.json nodes.json added
- File Vagrantfile Vagrantfile added
reply to note-21.
With the one I previously sent you, I don't know somehow the VMs can't even ping each other. So I swithched to use the above one. And I used nodes.json to configure all the nodes.
Updated by Eric Newberry over 8 years ago
Here's what I get with the Vagrantfile and the json:
vagrant@R:~$ iperf -c 192.168.56.2 -u -p 12345
------------------------------------------------------------
Client connecting to 192.168.56.2, UDP port 12345
Sending 1470 byte datagrams
UDP buffer size: 208 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.56.6 port 60825 connected with 192.168.56.2 port 12345
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] Sent 893 datagrams
[ 3] Server Report:
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.112 ms 0/ 893 (0%)
vagrant@A:~$ iperf -s -u -p 12345
------------------------------------------------------------
Server listening on UDP port 12345
Receiving 1470 byte datagrams
UDP buffer size: 208 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.56.2 port 12345 connected with 192.168.56.6 port 60825
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.112 ms 0/ 893 (0%)
I noticed in your example that you attempted to reach 192.198.56.6, instead of 192.168.56.6. Is this a typo?
Updated by Weiwei Liu over 8 years ago
Yes, there was a typo.. Now I got the same output as yours. Thanks.
Eric Newberry wrote:
Here's what I get with the Vagrantfile and the json:
vagrant@R:~$ iperf -c 192.168.56.2 -u -p 12345 ------------------------------------------------------------ Client connecting to 192.168.56.2, UDP port 12345 Sending 1470 byte datagrams UDP buffer size: 208 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.56.6 port 60825 connected with 192.168.56.2 port 12345 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec [ 3] Sent 893 datagrams [ 3] Server Report: [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.112 ms 0/ 893 (0%) vagrant@A:~$ iperf -s -u -p 12345 ------------------------------------------------------------ Server listening on UDP port 12345 Receiving 1470 byte datagrams UDP buffer size: 208 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.56.2 port 12345 connected with 192.168.56.6 port 60825 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.112 ms 0/ 893 (0%)
I noticed in your example that you attempted to reach 192.198.56.6, instead of 192.168.56.6. Is this a typo?
Updated by Weiwei Liu over 8 years ago
Sorry, the previous UDP problem was due to my typo. So it turns out UDP does work between VMs. But the program still doesn't work. Did you see any other potential problems either in my code or in the test procedures?
Davide Pesavento wrote:
Junxiao Shi wrote:
Dynamically making new faces in the benchmark increases the complexity of the benchmark program, and worsens the result.
We're not "dynamically making new faces". There will still be a "setup time" at the beginning of the benchmark that should be excluded from measurements, but even if it's not, its impact should be negligible if the benchmark run is long enough. What's important is that this initial setup happens only once.
Updated by Weiwei Liu over 8 years ago
I tested with 2 udp Faceuri pairs:
pair A(192.168.56.2), B(192.168.56.3)
pair C(192.168.56.4), D(192.168.56.5))
The benchmark is on R(192.168.56.6).
The first pair works well with bidirectional traffic. However when I try to add the second pair, R somehow got some "Send or receive operation failed: Connection refused" error, thus didn't get the data back. I don't know what is causing this problem. Is it just something related to the behaviour of ndnpoke/ndnpeek or something is wrong with certain VM, since such error didn't appear in the first pair?
And I tried ndnpeek/ndnpoke for several more times, sometimes the data could be successfully routed back. But from R's log, I could see the the program is creating duplicate faces... In the test of first pair of faces, the program does not create duplicate pair of faces.
Here is the log of R node:
1460071025.261061 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.2:6363] Creating transport
Left face created: remote=udp4://192.168.56.2:6363 local=udp4://192.168.56.6:6363
1460071025.261237 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.3:6363] Creating transport /** first pair of faces created /
Right face created: remote=udp4://192.168.56.3:6363 local=udp4://192.168.56.6:6363
Interest received, forwarding to udp4://192.168.56.3:6363
Data received, forwarding to udp4://192.168.56.2:6363 / traffic from B to A /
Interest received, forwarding to udp4://192.168.56.2:6363
Data received, forwarding to udp4://192.168.56.3:6363 / traffic from A to B /
1460071110.261279 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.5:6363] Creating transport
Left face created: remote=udp4://192.168.56.5:6363 local=udp4://192.168.56.6:6363
1460071110.261375 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Creating transport
Right face created: remote=udp4://192.168.56.4:6363 local=udp4://192.168.56.6:6363 / second pair of face created /
Interest received, forwarding to udp4://192.168.56.4:6363 / no data got back /
1460071110.262076 WARNING: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Send or receive operation failed: Connection refused
1460071110.262096 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState UP -> FAILED
1460071110.262132 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState FAILED -> CLOSED
Interest received, forwarding to udp4://192.168.56.4:6363
Interest received, forwarding to udp4://192.168.56.4:6363
1460071377.625094 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Creating transport
Left face created: remote=udp4://192.168.56.4:6363 local=udp4://192.168.56.6:6363
Right face created: remote=udp4://192.168.56.5:6363 local=udp4://192.168.56.6:6363 / duplicate pair /
Interest received, forwarding to udp4://192.168.56.5:6363
1460071377.626418 WARNING: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Send or receive operation failed: Connection refused
1460071377.626447 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState UP -> FAILED
1460071377.626480 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState FAILED -> CLOSED
1460071698.892197 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Creating transport
Left face created: remote=udp4://192.168.56.4:6363 local=udp4://192.168.56.6:6363
Right face created: remote=udp4://192.168.56.5:6363 local=udp4://192.168.56.6:6363 / duplicate pair /
Interest received, forwarding to udp4://192.168.56.5:6363
Data received, forwarding to udp4://192.168.56.4:6363 / got data here /
Data received, forwarding to udp4://192.168.56.4:6363
Data received, forwarding to udp4://192.168.56.4:6363
1460071698.895647 WARNING: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Send or receive operation failed: Connection refused
1460071698.895668 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState UP -> FAILED
1460071698.895713 INFO: [Transport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] setState FAILED -> CLOSED
1460071738.295207 INFO: [UnicastUdpTransport] [id=0,local=udp4://192.168.56.6:6363,remote=udp4://192.168.56.4:6363] Creating transport
Left face created: remote=udp4://192.168.56.4:6363 local=udp4://192.168.56.6:6363
Right face created: remote=udp4://192.168.56.5:6363 local=udp4://192.168.56.6:6363 / duplicate pair /
Interest received, forwarding to udp4://192.168.56.5:6363
Data received, forwarding to udp4://192.168.56.4:6363 / got data here **/
Data received, forwarding to udp4://192.168.56.4:6363
Data received, forwarding to udp4://192.168.56.4:6363
Data received, forwarding to udp4://192.168.56.4:6363
Updated by Davide Pesavento over 8 years ago
Make sure that both nfd and ndnpoke are still up and running on 192.168.56.4 (node C) when the Interest packet from D comes. ndnpoke
in particular has a timeout (a few seconds by default) and spontaneously quits when the timeout expires (see -w
option).
Updated by Junxiao Shi over 8 years ago
- Assignee changed from Weiwei Liu to Chengyu Fan
- Target version set to v0.5
- % Done changed from 0 to 40
Benchmark program is in place at tests/other/face-benchmark.cpp
.
This task is passed to @Chengyu Fan for running the tests.
I expect the throughput from a benchmark (remember to disable debug mode when compiling) and a profiling report.
Updated by Chengyu Fan over 8 years ago
- File gprof.face-system.no-drop.txt gprof.face-system.no-drop.txt added
- File profiling.face-system.no-drop.txt profiling.face-system.no-drop.txt added
- File expandFunctionsWithInclusiveTimeByDepth.v0.4.py expandFunctionsWithInclusiveTimeByDepth.v0.4.py added
The attached files are the gprof report, and the script that reads gprof report to produce profiling report.
I did the tests on ONL. 16 physical client-server node pairs communicate through a central node. The central node is a core8 host, but each end host is only one core.
The test uses 32 traffic generator client-server pairs, 5 name components, and each component contains 5 chars.
In the test, each client sends an Interest every 9ms, and the face-system forwards both the Interests and content smoothly (no packet drop).
Note that the profiling report ignores the functions whose inclusive time is smaller than 1%. So you may not wonder why some functions are not in the report. Some functions are "spontaneous", and in the report I listed all of them.
Updated by Davide Pesavento over 8 years ago
Chengyu Fan wrote:
The attached files are the gprof report, and the script that reads gprof report to produce profiling report.
Thanks.
I did the tests on ONL. 16 physical client-server node pairs communicate through a central node. The central node is a core8 host, but each end host is only one core.
face-benchmark
is single-threaded so it doesn't matter how many cores the central node has.
In the test, each client sends an Interest every 9ms, and the face-system forwards both the Interests and content smoothly (no packet drop).
So that means about 7100 pkts/s forwarded by the "central node", right? Do you start getting packet losses if you increase the Interest rate?
Note that the profiling report ignores the functions whose inclusive time is smaller than 1%. So you may not wonder why some functions are not in the report.
Yeah, 1% is way too high, it makes the report hard to understand. Can you lower the threshold to about 0.1-0.3% ? (ideally we should at least reach the NFD/ndn-cxx boundary, otherwise we won't be able to identify the culprit)
The report shows 10.30% of the time spent in LpReassembler::receiveFragment
or its children... how big were the Data packets? Were they fragmented?
The 8% in Name
destructor is very surprising, since Name
is basically a Block
and has no user-defined destructor, but ~Block()
doesn't show up in the report. I'm afraid this is another artifact of gprof. Just to double-check: are you building in release mode? is ndn-cxx also built with profiling enabled?
Any chance we can try and run the benchmark under callgrind (which I trust much more than gprof), or use a different (more modern) sampling profiler such as the one provided by gperftools?
Updated by Junxiao Shi over 8 years ago
I looked at profiling.face-system.no-drop.txt
attached to note-29, and I'm unable to determine what's slow in face system.
For a more useful report, do not ignore any function regardless of its inclusive time.
Updated by Chengyu Fan over 8 years ago
My apologies, I just found the compiler flag I used in the script is "c++0x".
In case it introduced issues, I have to re-generate the report.
Davide Pesavento wrote:
Chengyu Fan wrote:
The attached files are the gprof report, and the script that reads gprof report to produce profiling report.
Thanks.
I did the tests on ONL. 16 physical client-server node pairs communicate through a central node. The central node is a core8 host, but each end host is only one core.
face-benchmark
is single-threaded so it doesn't matter how many cores the central node has.In the test, each client sends an Interest every 9ms, and the face-system forwards both the Interests and content smoothly (no packet drop).
So that means about 7100 pkts/s forwarded by the "central node", right? Do you start getting packet losses if you increase the Interest rate?
Yes, 3500 pkts/s for Interests, and 3500 pkts/s for Content.
After compiled using gprof, the program is slow. When I increase the Interest rate, it starts to drop all the content.
Note that the profiling report ignores the functions whose inclusive time is smaller than 1%. So you may not wonder why some functions are not in the report.
Yeah, 1% is way too high, it makes the report hard to understand. Can you lower the threshold to about 0.1-0.3% ? (ideally we should at least reach the NFD/ndn-cxx boundary, otherwise we won't be able to identify the culprit)
The report shows 10.30% of the time spent in
LpReassembler::receiveFragment
or its children... how big were the Data packets? Were they fragmented?
Data payload is 10B. So the udp packet size is around 110B. No fragmentation.
The 8% in
Name
destructor is very surprising, sinceName
is basically aBlock
and has no user-defined destructor, but~Block()
doesn't show up in the report. I'm afraid this is another artifact of gprof. Just to double-check: are you building in release mode? is ndn-cxx also built with profiling enabled?
The ~Block() is called by lots of other functions. And its inclusive time in each branch is too small, so it has been ignored.
Any chance we can try and run the benchmark under callgrind (which I trust much more than gprof), or use a different (more modern) sampling profiler such as the one provided by gperftools?
I will. Since gprof sometimes gives me wired record that I cannot explain, I'm also think about trying other profilers. But it takes me a little time to change my scripts.
Updated by Davide Pesavento over 8 years ago
Chengyu Fan wrote:
My apologies, I just found the compiler flag I used in the script is "c++0x".
In case it introduced issues, I have to re-generate the report.
No need. In gcc-4.7 and later, -std=c++11
and -std=c++0x
have exactly the same meaning and can be used interchangeably.
After compiled using gprof, the program is slow. When I increase the Interest rate, it starts to drop all the content.
Interesting. I did not expect so much overhead from the profiling instrumentation. I insist on trying the gperftools profiler which should have minimal overhead.
The 8% in
Name
destructor is very surprising, sinceName
is basically aBlock
and has no user-defined destructor, but~Block()
doesn't show up in the report. I'm afraid this is another artifact of gprof. Just to double-check: are you building in release mode? is ndn-cxx also built with profiling enabled?The ~Block() is called by lots of other functions. And its inclusive time in each branch is too small, so it has been ignored.
Doesn't make sense. If ~Block()
is <1% and ~Name()
is 8%, where is ~Name()
spending its time? Again, I don't think these numbers are correct. I think the ndn-cxx library should also be instrumented during the benchmark runs.
I will. Since gprof sometimes gives me wired record that I cannot explain, I'm also think about trying other profilers. But it takes me a little time to change my scripts.
Note that gperftools comes with a postprocessing tool (pprof) that generates either a graph or gprof-like textual output, so I don't think you need to change anything. Callgrind output file can be opened directly in kcachegrind. Let's not reinvent the wheel.
Updated by Chengyu Fan over 8 years ago
- File gprof.local-360000-16-udp4-5-5-5-100.txt gprof.local-360000-16-udp4-5-5-5-100.txt added
- File profiling.face-system.gprof.local-360000-16-udp4-5-5-5-100.txt profiling.face-system.gprof.local-360000-16-udp4-5-5-5-100.txt added
- File expandFunctionsWithInclusiveTimeByDepth.v0.5.py expandFunctionsWithInclusiveTimeByDepth.v0.5.py added
Davide Pesavento wrote:
Chengyu Fan wrote:
My apologies, I just found the compiler flag I used in the script is "c++0x".
In case it introduced issues, I have to re-generate the report.No need. In gcc-4.7 and later,
-std=c++11
and-std=c++0x
have exactly the same meaning and can be used interchangeably.After compiled using gprof, the program is slow. When I increase the Interest rate, it starts to drop all the content.
Interesting. I did not expect so much overhead from the profiling instrumentation. I insist on trying the gperftools profiler which should have minimal overhead.
It may be because the end-hosts are not powerful enough.
The 8% in
Name
destructor is very surprising, sinceName
is basically aBlock
and has no user-defined destructor, but~Block()
doesn't show up in the report. I'm afraid this is another artifact of gprof. Just to double-check: are you building in release mode? is ndn-cxx also built with profiling enabled?The ~Block() is called by lots of other functions. And its inclusive time in each branch is too small, so it has been ignored.
Doesn't make sense. If
~Block()
is <1% and~Name()
is 8%, where is~Name()
spending its time? Again, I don't think these numbers are correct. I think the ndn-cxx library should also be instrumented during the benchmark runs.
Also, the scripts ignore the functions that are not in NFD or called NFD functions.
I will. Since gprof sometimes gives me wired record that I cannot explain, I'm also think about trying other profilers. But it takes me a little time to change my scripts.
Note that gperftools comes with a postprocessing tool (pprof) that generates either a graph or gprof-like textual output, so I don't think you need to change anything. Callgrind output file can be opened directly in kcachegrind. Let's not reinvent the wheel.
Well, it is not hurt to have a try.
Well, I have already done a new test...
In this new test, face-system is compiled using c++11.
This test uses 16 client-server pairs. And each client sends an Interest every 5ms. So 6400 pkts/s go through the NFD. Each Interest has 5 name components, and each component contains 5 chars. The content payload is 100B.
Note that this test almost has no drops, and I tried several rounds, but still cannot generate enough Interests to overload the NFD. I may need to ask John for help.
If clients send too many Interests, the ONL GUI (which collect stats on the physical router) shows the face-system can forward all the Interests, but the content coming from the server side reduces. I guess it is because the endhosts are not powerful enough.
In the profiling report, there are a lot of functions appears in different branches. Because they are called by different functions, and the inclusive time is divided accordingly.
If you are interested in a specific function and its child, you can use the script "expandFunctionsWithInclusiveTimeByDepth.v0.5.py" to print out.
The report strictly follows the gprof report to produce the profiling report.
If you think gprof is not good enough, please also let me know. I may need to try other profiler.
Updated by Chengyu Fan over 8 years ago
- File callgrind.face-system.local-360000-16-udp4-5-5-5-100.out callgrind.face-system.local-360000-16-udp4-5-5-5-100.out added
I repeat the above test using callgrind. The callgrind report is attached.
Compare to gprof, callgrind makes the face-system even slower. As a result, it lost a lot of packets. For better call graph browsing experience, it is highly recommended to use KCachegrind.
Updated by Davide Pesavento over 8 years ago
Chengyu Fan wrote:
I repeat the above test using callgrind. The callgrind report is attached.
Did you do this on a profiled build? You're not supposed to compile with profiling enabled when running under callgrind, you should use a normal release build.
Updated by Davide Pesavento over 8 years ago
Anyway, assuming the profiling overhead didn't spoil the results too much, the biggest offenders in this benchmark are: memory allocation/deallocation operations, Block
copy constructor, Block
destructor, and Block::parse
. These functions in turn are usually called from various wire encoding/decoding methods, especially Name::wireDecode
and lp::Packet::wireEncode
. So, I think we could obtain a nice improvement by optimizing buffer and memory handling in the face system and in NFD in general (this is not a surprise).
As far as the face system itself is concerned, something fishy is going on in DatagramTransport<>::doSend
. It seems from callgrind output that each invocation of doSend
calls Block
's copy constructor 6 times. Asio could be doing some of the extra copies (there's at least one in reactive_socket_send_op::do_complete
), but I think this is worth investigating.
Updated by Chengyu Fan over 8 years ago
Davide Pesavento wrote:
Chengyu Fan wrote:
I repeat the above test using callgrind. The callgrind report is attached.
Did you do this on a profiled build? You're not supposed to compile with profiling enabled when running under callgrind, you should use a normal release build.
I see, compile with debugging info, but not with profiling.
Updated by Davide Pesavento over 8 years ago
Chengyu Fan wrote:
I see, compile with debugging info, but not with profiling.
Exactly. And also note that "debugging info" does NOT mean configuring with --debug
. We always add -g
to CXXFLAGS, even for release builds. And we do want release builds because we want to benchmark a binary that has been optimized by the compiler.
Updated by Chengyu Fan over 8 years ago
Davide Pesavento wrote:
Chengyu Fan wrote:
I see, compile with debugging info, but not with profiling.
Exactly. And also note that "debugging info" does NOT mean configuring with
--debug
. We always add-g
to CXXFLAGS, even for release builds. And we do want release builds because we want to benchmark a binary that has been optimized by the compiler.
I see, the release build "./waf configure" is fine for callgrind.
FYI, since there is someone else occupying the ONL resources, there would be a while for me to generate the callgrind results.
Updated by Chengyu Fan over 8 years ago
- File callgrind.face-system.no-gprof-flags.local-360000-16-udp4-5-5-5-100.out callgrind.face-system.no-gprof-flags.local-360000-16-udp4-5-5-5-100.out added
uploaded the callgrind report that generated without gprof flag.
Updated by Chengyu Fan over 8 years ago
Here are some updates I get from the callgrind report.
Top self time functions (note that some functions are correlated) in the face-system
- Total self time for functions below is 26.7%. Basically, they are Block copy constructor, Block destructor, shared_ptr counters, and memory allocation. This indicates we can improve performance by optimizing buffer and memory handling in the face system.
5% ndn::Block::Block(ndn::Block const&)’2
4.19% ndn::Block::~Block() <cycle 1>
3.09% std::_Sp_counted_base(_gnu_cxx::)
2.77% ndn::Block::~Block()’2 <cycle 1>
2.35% std::vector<ndn::Block, std::allocator<ndn::Block>>::~vector()
2.29% ndn::Block::Block(ndn::Block const&) <cycle 2>
1.79% ndn::Block* std::__uninitialized_copy<false>::__uninit_copy<__gnu_cxx:: ...>
1.64% ndn::Block* std::__uninitialized_copy<false>::__uninit_copy<__gnu_cxx:: ...>
1.64% ndn::Block::Block(ndn::Block&&)
1.04% std::_Sp_counted_base(_gnu_cxx::)
0.90% std::vector<ndn::Block, std::allocator<ndn::Block>>::operator=(std::vector<ndn::Block, std::...
The caller of ndn::Block functions maybe wireEncode(), wireDecode(), parse(). And these functions in turn are called by lp::Packet(), Interest(), and Data() constructor etc. They take more than 31% of the time. Surprisingly, another function nfd::face::DatagramTransport::doSend() also called ndn::Block constructor many times, about 11.07% time spent.
- nfd::face::GenericLinkService::sendNetPacket() takes 34.93% of the running time. Its child wireEncode() is 13.38%. wireEncode() called other Block operations (parse() and memory allocation) as well. It seems doSend() is trying to recreate link protocol packets to adapt various MTU. However, this introduces a lot of overhead.
- wireDecode() for Interest and Data are time-consuming, taking 9.74% and 16.83% respectively. Optimizing this function could improve the performance.
- GenericLinkService::doReceivePacket() spends 4.41% of the time in ndn::lp::Packet::Packet() creation. GenericLinkService uses an extra copy for lp::Packet. It created lp::Packet in line 172, and later it copied the returned object lp::Packet from function LpReassembler::receiveFragment().
Updated by Junxiao Shi over 8 years ago
As indicated in note-28, I'm expect both a throughput from benchmark and a profiling report.
Several profiling reports have been uploaded, but there isn't a valid throughput number from benchmark.
The "3500/3500" mentioned in note-32 is a throughput obtained from a compilation with profiling enabled, which is invalid as a benchmark result.
Updated by Chengyu Fan over 8 years ago
Benchmark throughput: When 1 name component is used, around 37k pps for two-way traffic are forwarded. Results of 5 name components are the same. When 25 name components is used, the number is around 33k pps.
Not sure why you need this number... But the face-system forwarded around 770 pps for two-way traffic in the callgrind profiling experiment. Because Valgrind simulates every single instruction the program executes.
Updated by Chengyu Fan over 8 years ago
The ONL people have trouble to connect clusters, so I cannot get enough resource to benchmark the performance with TcMalloc. As soon as ONL is up, I will report the results here.
Here is the message from John:
We had a couple of surprises when re-assembling ONL.
Some of the racks are farther apart now then they were in the old machine
room and the cables we had did not reach. We are in the process or ordering
new cables. It will probably be at least the end of this coming week before
they will be in.The missing cables means that some clusters of machines cannot interconnect
with other clusters. So, for large topologies like yours the CRD is not
able to make the connections it needs to make a reservation for the topology.
Updated by Chengyu Fan over 8 years ago
I have reported the benchmark results for the face-system with TCMalloc in task #3564. Please refer http://redmine.named-data.net/issues/3564#note-24 for the details.
Updated by Junxiao Shi about 8 years ago
COPSS: An Efficient Content Oriented Publish/Subscribe System gives an idea for a different aspect in benchmarking:
The time between the incoming and outgoing instants of each packet is measured using Wireshark
Since NFD is single-threaded, this is a valid way to compare performance.
Updated by Davide Pesavento almost 7 years ago
- Status changed from In Progress to Feedback
- Target version deleted (
v0.5)