Feature #4172
closedOptimize TLV decoding for contiguous memory input
Added by Junxiao Shi over 7 years ago. Updated over 7 years ago.
100%
Description
ndn::tlv::readVarNumber
and ndn::tlv::readNonNegativeInteger
function templates are designed for InputIterator, through which only one octet of input can be read at a time.
A majority of use cases for these function templates are over contiguous memory, such as const uint8_t*
and std::vector<uint8_t>
(underlying type of ndn::Buffer
).
Specializations of ndn::tlv::readVarNumber
and ndn::tlv::readNonNegativeInteger
may deliver faster performance.
Caution: those specializations should avoid misaligned memory accesses (#4097).
Files
c4017,1_benchmark.txz (912 Bytes) c4017,1_benchmark.txz | Junxiao Shi, 07/09/2017 12:24 AM | ||
c4016,4_vs_c4017,5.tar.xz (1.13 KB) c4016,4_vs_c4017,5.tar.xz | Junxiao Shi, 07/12/2017 06:23 PM |
Updated by Junxiao Shi over 7 years ago
- Related to Bug #4097: Misaligned memory accesses in TLV decoding added
Updated by Junxiao Shi over 7 years ago
- Subject changed from Optimize TLV decoding for continuous memory input to Optimize TLV decoding for contiguous memory input
- Description updated (diff)
Updated by Junxiao Shi over 7 years ago
- Target version set to v0.6
Currently, ndn-cxx itself uses const uint8_t*
and std::vector<uint8_t>::const_iterator
which are both contiguous.
My question is: what concept check or type traits can I used to identify whether an iterator is contiguous?
Updated by Davide Pesavento over 7 years ago
None. The concept of ContiguousIterator
was introduced in C++17, but there's no corresponding tag to identify them (http://en.cppreference.com/w/cpp/iterator/iterator_tags).
Regardless, my suggestion is don't bother, just document the requirement, or use a raw pointer.
Updated by Junxiao Shi over 7 years ago
- File c4017,1_benchmark.txz c4017,1_benchmark.txz added
- Status changed from New to Code review
- Assignee set to Junxiao Shi
- % Done changed from 0 to 100
https://gerrit.named-data.net/4017
Benchmark uses same method as #4097-9. "old" refers to the state prior to #4097.
octets | offset | old avg | old stdev | contiguous avg | contiguous stdev | contiguous/old |
---|---|---|---|---|---|---|
3 | 0 | 812482053.6 | 701817.6516277572 | 720417098 | 588594.0613844146 | 0.886687 |
3 | 1 | 774634608 | 1319320.5595839852 | 742218213 | 2133410.8850399633 | 0.958153 |
3 | 2 | 782558356.2 | 60501227.94289818 | 747937921.4 | 33637689.2025371 | 0.95576 |
3 | 3 | 788475709.2 | 948486.7703187536 | 741573266.6 | 1278954.0214060864 | 0.940515 |
3 | 4 | 779499019 | 646236.0000595758 | 741723655.2 | 1223270.4198952494 | 0.951539 |
3 | 5 | 773129479 | 360371.03450332966 | 741554582.4 | 973496.9046477755 | 0.95916 |
3 | 6 | 762311734.6 | 2132297.9371582433 | 722301239.6 | 3976593.032321047 | 0.947514 |
3 | 7 | 788610076.8 | 480931.3301851939 | 740936253.2 | 750475.7637267176 | 0.939547 |
5 | 0 | 799881597.4 | 28430233.06545826 | 754906487.4 | 31987273.35104906 | 0.943773 |
5 | 1 | 811857209.8 | 35273182.40241241 | 760312106 | 30742664.30965074 | 0.93651 |
5 | 2 | 800209664.6 | 27915129.02644787 | 755112628.8 | 31329576.904880438 | 0.943643 |
5 | 3 | 796349938.6 | 1095874.877914354 | 737592967.6 | 812740.2293188765 | 0.926217 |
5 | 4 | 787944984 | 999771.4858626445 | 731449738.6 | 686685.8611354248 | 0.928301 |
5 | 5 | 787570458.6 | 1363913.150264818 | 737945971.6 | 946333.7080014639 | 0.93699 |
5 | 6 | 796333850.6 | 1415933.5906098492 | 732128598.6 | 1067301.6668914652 | 0.919374 |
5 | 7 | 788270998.2 | 1387687.358025827 | 738639585.8 | 1712455.3727124396 | 0.937038 |
9 | 0 | 821622216.2 | 33191559.53856336 | 851590708 | 16900278.793912705 | 1.03647 |
9 | 1 | 804525165.8 | 35446554.45638711 | 828553603.2 | 27589886.059208523 | 1.02987 |
9 | 2 | 821298807.2 | 32628174.377149355 | 824625671.8 | 23805258.964047432 | 1.00405 |
9 | 3 | 788152437.2 | 1367256.6681333832 | 808753683 | 1260138.4800185652 | 1.02614 |
9 | 4 | 807503746 | 1777432.257083656 | 839466433.4 | 1224056.4257783627 | 1.03958 |
9 | 5 | 806602488.2 | 542079.9331820908 | 809092126.4 | 1133905.7031957288 | 1.00309 |
9 | 6 | 787717199.4 | 267770.6504357787 | 839255660.4 | 1578352.8814556333 | 1.06543 |
9 | 7 | 806359870 | 263729.44687975215 | 808629558.6 | 1309703.1054553166 | 1.00281 |
Updated by Alex Afanasyev over 7 years ago
Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...
Old code
size=1 offset=0 219556000 nanoseconds
size=3 offset=0 268252000 nanoseconds
size=3 offset=1 267114000 nanoseconds
size=5 offset=0 264818000 nanoseconds
size=5 offset=1 270172000 nanoseconds
size=5 offset=2 255616000 nanoseconds
size=5 offset=3 262931000 nanoseconds
size=9 offset=0 287426000 nanoseconds
size=9 offset=1 289188000 nanoseconds
size=9 offset=2 291034000 nanoseconds
size=9 offset=3 295733000 nanoseconds
size=9 offset=4 291629000 nanoseconds
size=9 offset=5 289105000 nanoseconds
size=9 offset=6 277025000 nanoseconds
size=9 offset=7 303509000 nanoseconds
New code
size=1 offset=0 215387000 nanoseconds
size=3 offset=0 337295000 nanoseconds
size=3 offset=1 328851000 nanoseconds
size=5 offset=0 347530000 nanoseconds
size=5 offset=1 344035000 nanoseconds
size=5 offset=2 361555000 nanoseconds
size=5 offset=3 358717000 nanoseconds
size=9 offset=0 347946000 nanoseconds
size=9 offset=1 343940000 nanoseconds
size=9 offset=2 334505000 nanoseconds
size=9 offset=3 347926000 nanoseconds
size=9 offset=4 328437000 nanoseconds
size=9 offset=5 348441000 nanoseconds
size=9 offset=6 333521000 nanoseconds
size=9 offset=7 348286000 nanoseconds
Updated by Davide Pesavento over 7 years ago
Alex Afanasyev wrote:
Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...
Did you build in release mode?
Updated by Davide Pesavento over 7 years ago
Intel Xeon CPU E5-2670 v2, Linux 4.9.20, gcc-4.9.4
OLD
size=1 offset=0 343432052 nanoseconds
size=3 offset=0 395435459 nanoseconds
size=3 offset=1 366619089 nanoseconds
size=5 offset=0 379292242 nanoseconds
size=5 offset=1 379371107 nanoseconds
size=5 offset=2 380444428 nanoseconds
size=5 offset=3 379511179 nanoseconds
size=9 offset=0 383363360 nanoseconds
size=9 offset=1 390663322 nanoseconds
size=9 offset=2 390677738 nanoseconds
size=9 offset=3 389844139 nanoseconds
size=9 offset=4 378852299 nanoseconds
size=9 offset=5 380164130 nanoseconds
size=9 offset=6 379236474 nanoseconds
size=9 offset=7 385322147 nanoseconds
NEW
size=1 offset=0 364241926 nanoseconds
size=3 offset=0 368986408 nanoseconds
size=3 offset=1 365006562 nanoseconds
size=5 offset=0 358115669 nanoseconds
size=5 offset=1 369409896 nanoseconds
size=5 offset=2 357927919 nanoseconds
size=5 offset=3 365090751 nanoseconds
size=9 offset=0 365555763 nanoseconds
size=9 offset=1 361008081 nanoseconds
size=9 offset=2 356402152 nanoseconds
size=9 offset=3 356399986 nanoseconds
size=9 offset=4 356410952 nanoseconds
size=9 offset=5 356396298 nanoseconds
size=9 offset=6 360957043 nanoseconds
size=9 offset=7 356505715 nanoseconds
Except for size=1, the new code is consistently faster.
Updated by Alex Afanasyev over 7 years ago
Yes, optimized mode (./waf configure --with-tests
). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.
Updated by Davide Pesavento over 7 years ago
Alex Afanasyev wrote:
Yes, optimized mode (
./waf configure --with-tests
). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.
Ok, I used exactly the same setup. I'll try with clang later.
Updated by Junxiao Shi over 7 years ago
Ubuntu 16.04 32-bit ./waf configure --enable-static --disable-shared --with-tests --without-pch --with-sanitizer=address,undefined
UBSan output for ndn-cxx:commit:81a088afbc249c50398d38f1cea77092cc95a830 has no complaint about misaligned access.
ubuntu@m0213:~/san-ndn-cxx$ build/unit-tests -t Encoding/TestTlv -l test_suite
Running 9 test cases...
Entering test suite "ndn-cxx Tests"
Entering test suite "Encoding"
Entering test suite "TestTlv"
Entering test suite "VarNumber"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 156mks
Entering test case "Write"
Leaving test case "Write"; testing time: 183mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 490mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 422mks
Leaving test suite "VarNumber"
Entering test suite "NonNegativeInteger"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 103mks
Entering test case "Write"
Leaving test case "Write"; testing time: 100mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 271mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 278mks
Leaving test suite "NonNegativeInteger"
Entering test suite "PrintHelpers"
Entering test case "PrintSignatureTypeValue"
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
04 76 3b 14 28 76 3b 14 81 54 a0 b2 81 54 a0 b2 81 54 a0 b2 80 54 a0 b2 8c 54 a0 b2 80 56 a0 b2
^~~~~~~~~~~
vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
04 76 3b 14 28 76 3b 14 81 51 a0 b2 81 51 a0 b2 81 51 a0 b2 80 51 a0 b2 96 51 a0 b2 80 53 a0 b2
^~~~~~~~~~~
vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
04 76 3b 14 28 76 3b 14 81 4e a0 b2 81 4e a0 b2 81 4e a0 b2 80 4e a0 b2 98 4e a0 b2 80 50 a0 b2
^~~~~~~~~~~
vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
../src/encoding/tlv.cpp:30:3: runtime error: load of value 4294967295, which is not a valid value for type 'SignatureTypeValue'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
04 76 3b 14 28 76 3b 14 81 4b a0 b2 81 4b a0 b2 81 4b a0 b2 80 4b a0 b2 96 4b a0 b2 80 4d a0 b2
^~~~~~~~~~~
vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
Leaving test case "PrintSignatureTypeValue"; testing time: 1079mks
Leaving test suite "PrintHelpers"
Leaving test suite "TestTlv"
Leaving test suite "Encoding"
Leaving test suite "ndn-cxx Tests"
*** No errors detected
Updated by Junxiao Shi over 7 years ago
My benchmark on macOS 10.12 shows new code is 30% slower for 3-octet and larger VAR-NUMBERs.
./waf configure --enable-static --disable-shared --with-tests --without-osx-keychain
size | offset | c4016,4 | stdev | c4017,5 | stdev | c4017,5 / c4016,4 |
---|---|---|---|---|---|---|
1 | 0 | 259385800 | 11925135.332565414 | 250835800 | 7100459.787929229 | 0.967038 |
3 | 0 | 311677600 | 11584539.947706167 | 423802800 | 5795004.201896665 | 1.35975 |
3 | 1 | 303656200 | 12306442.97512486 | 417332600 | 9488965.370365728 | 1.37436 |
5 | 0 | 301721000 | 9048607.627696097 | 393108200 | 11236185.26013166 | 1.30289 |
5 | 1 | 314792600 | 12129824.413403518 | 397433600 | 9264698.608157743 | 1.26253 |
5 | 2 | 306116800 | 8804954.951616732 | 392509000 | 14246129.176025325 | 1.28222 |
5 | 3 | 311836600 | 7208830.5084805535 | 396697200 | 10412486.696270013 | 1.27213 |
9 | 0 | 329590800 | 8072253.786892481 | 400371600 | 8150640.269328539 | 1.21475 |
9 | 1 | 329503400 | 10183242.425671699 | 402694000 | 12135753.314071607 | 1.22212 |
9 | 2 | 334964000 | 11775083.48165736 | 418771000 | 9154890.41441786 | 1.2502 |
9 | 3 | 336735000 | 20010016.554216042 | 403890400 | 7884189.641301127 | 1.19943 |
9 | 4 | 332599800 | 8948226.176175924 | 419815600 | 9103528.81030208 | 1.26222 |
9 | 5 | 335047800 | 8720066.467636585 | 401694200 | 10620385.030685093 | 1.19892 |
9 | 6 | 332325000 | 10885061.047141628 | 408148800 | 14561003.784080273 | 1.22816 |
9 | 7 | 336079400 | 8372866.193843062 | 406360000 | 16991624.304344773 | 1.20912 |
However, I think the Change still should be merged:
- For 3-octet VAR-NUMBER which experienced the greatest slowdown ratio, the absolute difference per thousand operations is 0.0022 milliseconds. I estimate no more than one thousand VAR-NUMBER decodings are needed when processing each packet, and 0.0022ms is too small to worry about.
- macOS is mainly used as end hosts, not routers. Performance is less critical on macOS.
- Benchmark in note-5 indicates the new code is consistently faster for all but 9-octet inputs on Linux, the usual platform for routers.
9-octet VAR-NUMBERs are not used in ndn-cxx, because TLV-TYPE is implemented as
uint32_t
which encodes to up to 5 octets, and the practical limit of packet size is 8800 octets so TLV-LENGTH is up to 3 octets.
Updated by Junxiao Shi over 7 years ago
Updated by Davide Pesavento over 7 years ago
Same linux system as note-8, this time using clang-4.0.1
OLD
size=1 offset=0 225139293 nanoseconds
size=3 offset=0 286069693 nanoseconds
size=3 offset=1 285300278 nanoseconds
size=5 offset=0 286577277 nanoseconds
size=5 offset=1 281542178 nanoseconds
size=5 offset=2 286274724 nanoseconds
size=5 offset=3 292053121 nanoseconds
size=9 offset=0 292926056 nanoseconds
size=9 offset=1 299447933 nanoseconds
size=9 offset=2 291830825 nanoseconds
size=9 offset=3 307051689 nanoseconds
size=9 offset=4 292697168 nanoseconds
size=9 offset=5 298853287 nanoseconds
size=9 offset=6 293802120 nanoseconds
size=9 offset=7 298033793 nanoseconds
NEW
size=1 offset=0 222543953 nanoseconds
size=3 offset=0 395634150 nanoseconds
size=3 offset=1 395043022 nanoseconds
size=5 offset=0 364022183 nanoseconds
size=5 offset=1 357715890 nanoseconds
size=5 offset=2 364271467 nanoseconds
size=5 offset=3 357249949 nanoseconds
size=9 offset=0 355419601 nanoseconds
size=9 offset=1 368415047 nanoseconds
size=9 offset=2 357166517 nanoseconds
size=9 offset=3 355884553 nanoseconds
size=9 offset=4 356522603 nanoseconds
size=9 offset=5 361962657 nanoseconds
size=9 offset=6 356480328 nanoseconds
size=9 offset=7 352119360 nanoseconds
The "new code" results with clang are consistent with those obtained with gcc. Notable exception is the case size=1, which doesn't use memcpy
. The "old code" is somewhat faster with clang. This suggests that clang is able to optimize the non-memcpy
code better than gcc (and better than the memcpy
code).
Updated by Alex Afanasyev over 7 years ago
Davide, what's your opinion on this? Proceed with the merge?
Updated by Davide Pesavento over 7 years ago
I'd agree with merging the change.
Updated by Junxiao Shi over 7 years ago
- Status changed from Code review to Closed