Feature #4172: Optimize TLV decoding for contiguous memory input - ndn-cxx - NDN project issue tracking system

Actions

Copy link

Feature #4172

closed

Optimize TLV decoding for contiguous memory input

Added by Junxiao Shi about 8 years ago. Updated almost 8 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Junxiao Shi

Category:

Base

Target version:

v0.6

Start date:

Due date:

% Done:

100%

Estimated time:

3.00 h

Description

ndn::tlv::readVarNumber and ndn::tlv::readNonNegativeInteger function templates are designed for InputIterator, through which only one octet of input can be read at a time.
A majority of use cases for these function templates are over contiguous memory, such as const uint8_t* and std::vector<uint8_t> (underlying type of ndn::Buffer).
Specializations of ndn::tlv::readVarNumber and ndn::tlv::readNonNegativeInteger may deliver faster performance.
Caution: those specializations should avoid misaligned memory accesses (#4097).

Files

Download all files

c4017,1_benchmark.txz (912 Bytes) c4017,1_benchmark.txz		Junxiao Shi, 07/09/2017 12:24 AM
c4016,4_vs_c4017,5.tar.xz (1.13 KB) c4016,4_vs_c4017,5.tar.xz		Junxiao Shi, 07/12/2017 06:23 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Junxiao Shi about 8 years ago

Related to Bug #4097: Misaligned memory accesses in TLV decoding added

Actions

Copy link

Updated by Junxiao Shi about 8 years ago

Subject changed from Optimize TLV decoding for continuous memory input to Optimize TLV decoding for contiguous memory input
Description updated (diff)

Actions

Copy link

Updated by Junxiao Shi about 8 years ago

Target version set to v0.6

Currently, ndn-cxx itself uses const uint8_t* and std::vector<uint8_t>::const_iterator which are both contiguous.
My question is: what concept check or type traits can I used to identify whether an iterator is contiguous?

Actions

Copy link

Updated by Davide Pesavento about 8 years ago

None. The concept of ContiguousIterator was introduced in C++17, but there's no corresponding tag to identify them (http://en.cppreference.com/w/cpp/iterator/iterator_tags).
Regardless, my suggestion is don't bother, just document the requirement, or use a raw pointer.

Actions

Copy link

Updated by Junxiao Shi about 8 years ago

File c4017,1_benchmark.txz c4017,1_benchmark.txz added
Status changed from New to Code review
Assignee set to Junxiao Shi
% Done changed from 0 to 100

https://gerrit.named-data.net/4017

Benchmark uses same method as #4097-9. "old" refers to the state prior to #4097.

octets	offset	old avg	old stdev	contiguous avg	contiguous stdev	contiguous/old
3	0	812482053.6	701817.6516277572	720417098	588594.0613844146	0.886687
3	1	774634608	1319320.5595839852	742218213	2133410.8850399633	0.958153
3	2	782558356.2	60501227.94289818	747937921.4	33637689.2025371	0.95576
3	3	788475709.2	948486.7703187536	741573266.6	1278954.0214060864	0.940515
3	4	779499019	646236.0000595758	741723655.2	1223270.4198952494	0.951539
3	5	773129479	360371.03450332966	741554582.4	973496.9046477755	0.95916
3	6	762311734.6	2132297.9371582433	722301239.6	3976593.032321047	0.947514
3	7	788610076.8	480931.3301851939	740936253.2	750475.7637267176	0.939547
5	0	799881597.4	28430233.06545826	754906487.4	31987273.35104906	0.943773
5	1	811857209.8	35273182.40241241	760312106	30742664.30965074	0.93651
5	2	800209664.6	27915129.02644787	755112628.8	31329576.904880438	0.943643
5	3	796349938.6	1095874.877914354	737592967.6	812740.2293188765	0.926217
5	4	787944984	999771.4858626445	731449738.6	686685.8611354248	0.928301
5	5	787570458.6	1363913.150264818	737945971.6	946333.7080014639	0.93699
5	6	796333850.6	1415933.5906098492	732128598.6	1067301.6668914652	0.919374
5	7	788270998.2	1387687.358025827	738639585.8	1712455.3727124396	0.937038
9	0	821622216.2	33191559.53856336	851590708	16900278.793912705	1.03647
9	1	804525165.8	35446554.45638711	828553603.2	27589886.059208523	1.02987
9	2	821298807.2	32628174.377149355	824625671.8	23805258.964047432	1.00405
9	3	788152437.2	1367256.6681333832	808753683	1260138.4800185652	1.02614
9	4	807503746	1777432.257083656	839466433.4	1224056.4257783627	1.03958
9	5	806602488.2	542079.9331820908	809092126.4	1133905.7031957288	1.00309
9	6	787717199.4	267770.6504357787	839255660.4	1578352.8814556333	1.06543
9	7	806359870	263729.44687975215	808629558.6	1309703.1054553166	1.00281

Actions

Copy link

Updated by Alex Afanasyev about 8 years ago

Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...

Old code
size=1 offset=0 219556000 nanoseconds
size=3 offset=0 268252000 nanoseconds
size=3 offset=1 267114000 nanoseconds
size=5 offset=0 264818000 nanoseconds
size=5 offset=1 270172000 nanoseconds
size=5 offset=2 255616000 nanoseconds
size=5 offset=3 262931000 nanoseconds
size=9 offset=0 287426000 nanoseconds
size=9 offset=1 289188000 nanoseconds
size=9 offset=2 291034000 nanoseconds
size=9 offset=3 295733000 nanoseconds
size=9 offset=4 291629000 nanoseconds
size=9 offset=5 289105000 nanoseconds
size=9 offset=6 277025000 nanoseconds
size=9 offset=7 303509000 nanoseconds

New code
size=1 offset=0 215387000 nanoseconds
size=3 offset=0 337295000 nanoseconds
size=3 offset=1 328851000 nanoseconds
size=5 offset=0 347530000 nanoseconds
size=5 offset=1 344035000 nanoseconds
size=5 offset=2 361555000 nanoseconds
size=5 offset=3 358717000 nanoseconds
size=9 offset=0 347946000 nanoseconds
size=9 offset=1 343940000 nanoseconds
size=9 offset=2 334505000 nanoseconds
size=9 offset=3 347926000 nanoseconds
size=9 offset=4 328437000 nanoseconds
size=9 offset=5 348441000 nanoseconds
size=9 offset=6 333521000 nanoseconds
size=9 offset=7 348286000 nanoseconds

Actions

Copy link

Updated by Davide Pesavento about 8 years ago

Alex Afanasyev wrote:

Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...

Did you build in release mode?

Actions

Copy link

Updated by Davide Pesavento about 8 years ago

Intel Xeon CPU E5-2670 v2, Linux 4.9.20, gcc-4.9.4

OLD
size=1 offset=0 343432052 nanoseconds
size=3 offset=0 395435459 nanoseconds
size=3 offset=1 366619089 nanoseconds
size=5 offset=0 379292242 nanoseconds
size=5 offset=1 379371107 nanoseconds
size=5 offset=2 380444428 nanoseconds
size=5 offset=3 379511179 nanoseconds
size=9 offset=0 383363360 nanoseconds
size=9 offset=1 390663322 nanoseconds
size=9 offset=2 390677738 nanoseconds
size=9 offset=3 389844139 nanoseconds
size=9 offset=4 378852299 nanoseconds
size=9 offset=5 380164130 nanoseconds
size=9 offset=6 379236474 nanoseconds
size=9 offset=7 385322147 nanoseconds

NEW
size=1 offset=0 364241926 nanoseconds
size=3 offset=0 368986408 nanoseconds
size=3 offset=1 365006562 nanoseconds
size=5 offset=0 358115669 nanoseconds
size=5 offset=1 369409896 nanoseconds
size=5 offset=2 357927919 nanoseconds
size=5 offset=3 365090751 nanoseconds
size=9 offset=0 365555763 nanoseconds
size=9 offset=1 361008081 nanoseconds
size=9 offset=2 356402152 nanoseconds
size=9 offset=3 356399986 nanoseconds
size=9 offset=4 356410952 nanoseconds
size=9 offset=5 356396298 nanoseconds
size=9 offset=6 360957043 nanoseconds
size=9 offset=7 356505715 nanoseconds

Except for size=1, the new code is consistently faster.

Actions

Copy link

Updated by Alex Afanasyev about 8 years ago

Yes, optimized mode (./waf configure --with-tests). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.

Actions

Copy link

#10

Updated by Davide Pesavento about 8 years ago

Alex Afanasyev wrote:

Yes, optimized mode (./waf configure --with-tests). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.

Ok, I used exactly the same setup. I'll try with clang later.

Actions

Copy link

#11

Updated by Junxiao Shi almost 8 years ago

Ubuntu 16.04 32-bit ./waf configure --enable-static --disable-shared --with-tests --without-pch --with-sanitizer=address,undefined
UBSan output for ndn-cxx:commit:81a088afbc249c50398d38f1cea77092cc95a830 has no complaint about misaligned access.

ubuntu@m0213:~/san-ndn-cxx$ build/unit-tests -t Encoding/TestTlv -l test_suite
Running 9 test cases...
Entering test suite "ndn-cxx Tests"
Entering test suite "Encoding"
Entering test suite "TestTlv"
Entering test suite "VarNumber"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 156mks
Entering test case "Write"
Leaving test case "Write"; testing time: 183mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 490mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 422mks
Leaving test suite "VarNumber"
Entering test suite "NonNegativeInteger"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 103mks
Entering test case "Write"
Leaving test case "Write"; testing time: 100mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 271mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 278mks
Leaving test suite "NonNegativeInteger"
Entering test suite "PrintHelpers"
Entering test case "PrintSignatureTypeValue"
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 54 a0 b2 81 54 a0 b2  81 54 a0 b2 80 54 a0 b2  8c 54 a0 b2 80 56 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 51 a0 b2 81 51 a0 b2  81 51 a0 b2 80 51 a0 b2  96 51 a0 b2 80 53 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 4e a0 b2 81 4e a0 b2  81 4e a0 b2 80 4e a0 b2  98 4e a0 b2 80 50 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
../src/encoding/tlv.cpp:30:3: runtime error: load of value 4294967295, which is not a valid value for type 'SignatureTypeValue'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 4b a0 b2 81 4b a0 b2  81 4b a0 b2 80 4b a0 b2  96 4b a0 b2 80 4d a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
Leaving test case "PrintSignatureTypeValue"; testing time: 1079mks
Leaving test suite "PrintHelpers"
Leaving test suite "TestTlv"
Leaving test suite "Encoding"
Leaving test suite "ndn-cxx Tests"

*** No errors detected

Actions

Copy link

#12

Updated by Junxiao Shi almost 8 years ago

My benchmark on macOS 10.12 shows new code is 30% slower for 3-octet and larger VAR-NUMBERs.
./waf configure --enable-static --disable-shared --with-tests --without-osx-keychain

size	offset	c4016,4	stdev	c4017,5	stdev	c4017,5 / c4016,4
1	0	259385800	11925135.332565414	250835800	7100459.787929229	0.967038
3	0	311677600	11584539.947706167	423802800	5795004.201896665	1.35975
3	1	303656200	12306442.97512486	417332600	9488965.370365728	1.37436
5	0	301721000	9048607.627696097	393108200	11236185.26013166	1.30289
5	1	314792600	12129824.413403518	397433600	9264698.608157743	1.26253
5	2	306116800	8804954.951616732	392509000	14246129.176025325	1.28222
5	3	311836600	7208830.5084805535	396697200	10412486.696270013	1.27213
9	0	329590800	8072253.786892481	400371600	8150640.269328539	1.21475
9	1	329503400	10183242.425671699	402694000	12135753.314071607	1.22212
9	2	334964000	11775083.48165736	418771000	9154890.41441786	1.2502
9	3	336735000	20010016.554216042	403890400	7884189.641301127	1.19943
9	4	332599800	8948226.176175924	419815600	9103528.81030208	1.26222
9	5	335047800	8720066.467636585	401694200	10620385.030685093	1.19892
9	6	332325000	10885061.047141628	408148800	14561003.784080273	1.22816
9	7	336079400	8372866.193843062	406360000	16991624.304344773	1.20912

However, I think the Change still should be merged:

For 3-octet VAR-NUMBER which experienced the greatest slowdown ratio, the absolute difference per thousand operations is 0.0022 milliseconds. I estimate no more than one thousand VAR-NUMBER decodings are needed when processing each packet, and 0.0022ms is too small to worry about.
macOS is mainly used as end hosts, not routers. Performance is less critical on macOS.
Benchmark in note-5 indicates the new code is consistently faster for all but 9-octet inputs on Linux, the usual platform for routers. 9-octet VAR-NUMBERs are not used in ndn-cxx, because TLV-TYPE is implemented as uint32_t which encodes to up to 5 octets, and the practical limit of packet size is 8800 octets so TLV-LENGTH is up to 3 octets.

Actions

Copy link

#13

Updated by Junxiao Shi almost 8 years ago

File c4016,4_vs_c4017,5.tar.xz c4016,4_vs_c4017,5.tar.xz added

Actions

Copy link

#14

Updated by Davide Pesavento almost 8 years ago

Same linux system as note-8, this time using clang-4.0.1

OLD
size=1 offset=0 225139293 nanoseconds
size=3 offset=0 286069693 nanoseconds
size=3 offset=1 285300278 nanoseconds
size=5 offset=0 286577277 nanoseconds
size=5 offset=1 281542178 nanoseconds
size=5 offset=2 286274724 nanoseconds
size=5 offset=3 292053121 nanoseconds
size=9 offset=0 292926056 nanoseconds
size=9 offset=1 299447933 nanoseconds
size=9 offset=2 291830825 nanoseconds
size=9 offset=3 307051689 nanoseconds
size=9 offset=4 292697168 nanoseconds
size=9 offset=5 298853287 nanoseconds
size=9 offset=6 293802120 nanoseconds
size=9 offset=7 298033793 nanoseconds

NEW
size=1 offset=0 222543953 nanoseconds
size=3 offset=0 395634150 nanoseconds
size=3 offset=1 395043022 nanoseconds
size=5 offset=0 364022183 nanoseconds
size=5 offset=1 357715890 nanoseconds
size=5 offset=2 364271467 nanoseconds
size=5 offset=3 357249949 nanoseconds
size=9 offset=0 355419601 nanoseconds
size=9 offset=1 368415047 nanoseconds
size=9 offset=2 357166517 nanoseconds
size=9 offset=3 355884553 nanoseconds
size=9 offset=4 356522603 nanoseconds
size=9 offset=5 361962657 nanoseconds
size=9 offset=6 356480328 nanoseconds
size=9 offset=7 352119360 nanoseconds

The "new code" results with clang are consistent with those obtained with gcc. Notable exception is the case size=1, which doesn't use memcpy. The "old code" is somewhat faster with clang. This suggests that clang is able to optimize the non-memcpy code better than gcc (and better than the memcpy code).

Actions

Copy link

#15

Updated by Alex Afanasyev almost 8 years ago

Davide, what's your opinion on this? Proceed with the merge?

Actions

Copy link

#16

Updated by Davide Pesavento almost 8 years ago

I'd agree with merging the change.

Actions

Copy link

#17

Updated by Junxiao Shi almost 8 years ago

Status changed from Code review to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

NFD » ndn-cxx

Tags

Feature #4172

Optimize TLV decoding for contiguous memory input

Updated by Junxiao Shi about 8 years ago

Updated by Junxiao Shi about 8 years ago

Updated by Junxiao Shi about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Junxiao Shi about 8 years ago

Updated by Alex Afanasyev about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Alex Afanasyev about 8 years ago

Updated by Davide Pesavento about 8 years ago

Updated by Junxiao Shi almost 8 years ago

Updated by Junxiao Shi almost 8 years ago

Updated by Junxiao Shi almost 8 years ago

Updated by Davide Pesavento almost 8 years ago

Updated by Alex Afanasyev almost 8 years ago

Updated by Davide Pesavento almost 8 years ago

Updated by Junxiao Shi almost 8 years ago