Project

General

Profile

Feature #4172

Optimize TLV decoding for contiguous memory input

Added by Junxiao Shi about 3 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Base
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
3.00 h

Description

ndn::tlv::readVarNumber and ndn::tlv::readNonNegativeInteger function templates are designed for InputIterator, through which only one octet of input can be read at a time.
A majority of use cases for these function templates are over contiguous memory, such as const uint8_t* and std::vector<uint8_t> (underlying type of ndn::Buffer).
Specializations of ndn::tlv::readVarNumber and ndn::tlv::readNonNegativeInteger may deliver faster performance.
Caution: those specializations should avoid misaligned memory accesses (#4097).


Files

c4017,1_benchmark.txz (912 Bytes) c4017,1_benchmark.txz Junxiao Shi, 07/09/2017 12:24 AM
c4016,4_vs_c4017,5.tar.xz (1.13 KB) c4016,4_vs_c4017,5.tar.xz Junxiao Shi, 07/12/2017 06:23 PM

Related issues

Related to ndn-cxx - Bug #4097: Misaligned memory accesses in TLV decodingClosedJunxiao Shi

Actions
#1

Updated by Junxiao Shi about 3 years ago

  • Related to Bug #4097: Misaligned memory accesses in TLV decoding added
#2

Updated by Junxiao Shi about 3 years ago

  • Subject changed from Optimize TLV decoding for continuous memory input to Optimize TLV decoding for contiguous memory input
  • Description updated (diff)
#3

Updated by Junxiao Shi about 3 years ago

  • Target version set to v0.6

Currently, ndn-cxx itself uses const uint8_t* and std::vector<uint8_t>::const_iterator which are both contiguous.
My question is: what concept check or type traits can I used to identify whether an iterator is contiguous?

#4

Updated by Davide Pesavento about 3 years ago

None. The concept of ContiguousIterator was introduced in C++17, but there's no corresponding tag to identify them (http://en.cppreference.com/w/cpp/iterator/iterator_tags).
Regardless, my suggestion is don't bother, just document the requirement, or use a raw pointer.

#5

Updated by Junxiao Shi about 3 years ago

https://gerrit.named-data.net/4017

Benchmark uses same method as #4097-9. "old" refers to the state prior to #4097.

octets offset old avg old stdev contiguous avg contiguous stdev contiguous/old
3 0 812482053.6 701817.6516277572 720417098 588594.0613844146 0.886687
3 1 774634608 1319320.5595839852 742218213 2133410.8850399633 0.958153
3 2 782558356.2 60501227.94289818 747937921.4 33637689.2025371 0.95576
3 3 788475709.2 948486.7703187536 741573266.6 1278954.0214060864 0.940515
3 4 779499019 646236.0000595758 741723655.2 1223270.4198952494 0.951539
3 5 773129479 360371.03450332966 741554582.4 973496.9046477755 0.95916
3 6 762311734.6 2132297.9371582433 722301239.6 3976593.032321047 0.947514
3 7 788610076.8 480931.3301851939 740936253.2 750475.7637267176 0.939547
5 0 799881597.4 28430233.06545826 754906487.4 31987273.35104906 0.943773
5 1 811857209.8 35273182.40241241 760312106 30742664.30965074 0.93651
5 2 800209664.6 27915129.02644787 755112628.8 31329576.904880438 0.943643
5 3 796349938.6 1095874.877914354 737592967.6 812740.2293188765 0.926217
5 4 787944984 999771.4858626445 731449738.6 686685.8611354248 0.928301
5 5 787570458.6 1363913.150264818 737945971.6 946333.7080014639 0.93699
5 6 796333850.6 1415933.5906098492 732128598.6 1067301.6668914652 0.919374
5 7 788270998.2 1387687.358025827 738639585.8 1712455.3727124396 0.937038
9 0 821622216.2 33191559.53856336 851590708 16900278.793912705 1.03647
9 1 804525165.8 35446554.45638711 828553603.2 27589886.059208523 1.02987
9 2 821298807.2 32628174.377149355 824625671.8 23805258.964047432 1.00405
9 3 788152437.2 1367256.6681333832 808753683 1260138.4800185652 1.02614
9 4 807503746 1777432.257083656 839466433.4 1224056.4257783627 1.03958
9 5 806602488.2 542079.9331820908 809092126.4 1133905.7031957288 1.00309
9 6 787717199.4 267770.6504357787 839255660.4 1578352.8814556333 1.06543
9 7 806359870 263729.44687975215 808629558.6 1309703.1054553166 1.00281
#6

Updated by Alex Afanasyev about 3 years ago

Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...

Old code
size=1 offset=0 219556000 nanoseconds
size=3 offset=0 268252000 nanoseconds
size=3 offset=1 267114000 nanoseconds
size=5 offset=0 264818000 nanoseconds
size=5 offset=1 270172000 nanoseconds
size=5 offset=2 255616000 nanoseconds
size=5 offset=3 262931000 nanoseconds
size=9 offset=0 287426000 nanoseconds
size=9 offset=1 289188000 nanoseconds
size=9 offset=2 291034000 nanoseconds
size=9 offset=3 295733000 nanoseconds
size=9 offset=4 291629000 nanoseconds
size=9 offset=5 289105000 nanoseconds
size=9 offset=6 277025000 nanoseconds
size=9 offset=7 303509000 nanoseconds

New code
size=1 offset=0 215387000 nanoseconds
size=3 offset=0 337295000 nanoseconds
size=3 offset=1 328851000 nanoseconds
size=5 offset=0 347530000 nanoseconds
size=5 offset=1 344035000 nanoseconds
size=5 offset=2 361555000 nanoseconds
size=5 offset=3 358717000 nanoseconds
size=9 offset=0 347946000 nanoseconds
size=9 offset=1 343940000 nanoseconds
size=9 offset=2 334505000 nanoseconds
size=9 offset=3 347926000 nanoseconds
size=9 offset=4 328437000 nanoseconds
size=9 offset=5 348441000 nanoseconds
size=9 offset=6 333521000 nanoseconds
size=9 offset=7 348286000 nanoseconds

#7

Updated by Davide Pesavento about 3 years ago

Alex Afanasyev wrote:

Hmm. These are the numbers I see when running the benchmark (macOS, 2.3 GHz Intel Core i7). The new code slower...

Did you build in release mode?

#8

Updated by Davide Pesavento about 3 years ago

Intel Xeon CPU E5-2670 v2, Linux 4.9.20, gcc-4.9.4

OLD
size=1 offset=0 343432052 nanoseconds
size=3 offset=0 395435459 nanoseconds
size=3 offset=1 366619089 nanoseconds
size=5 offset=0 379292242 nanoseconds
size=5 offset=1 379371107 nanoseconds
size=5 offset=2 380444428 nanoseconds
size=5 offset=3 379511179 nanoseconds
size=9 offset=0 383363360 nanoseconds
size=9 offset=1 390663322 nanoseconds
size=9 offset=2 390677738 nanoseconds
size=9 offset=3 389844139 nanoseconds
size=9 offset=4 378852299 nanoseconds
size=9 offset=5 380164130 nanoseconds
size=9 offset=6 379236474 nanoseconds
size=9 offset=7 385322147 nanoseconds

NEW
size=1 offset=0 364241926 nanoseconds
size=3 offset=0 368986408 nanoseconds
size=3 offset=1 365006562 nanoseconds
size=5 offset=0 358115669 nanoseconds
size=5 offset=1 369409896 nanoseconds
size=5 offset=2 357927919 nanoseconds
size=5 offset=3 365090751 nanoseconds
size=9 offset=0 365555763 nanoseconds
size=9 offset=1 361008081 nanoseconds
size=9 offset=2 356402152 nanoseconds
size=9 offset=3 356399986 nanoseconds
size=9 offset=4 356410952 nanoseconds
size=9 offset=5 356396298 nanoseconds
size=9 offset=6 360957043 nanoseconds
size=9 offset=7 356505715 nanoseconds

Except for size=1, the new code is consistently faster.

#9

Updated by Alex Afanasyev about 3 years ago

Yes, optimized mode (./waf configure --with-tests). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.

#10

Updated by Davide Pesavento about 3 years ago

Alex Afanasyev wrote:

Yes, optimized mode (./waf configure --with-tests). Repeated tests again with the same results as before (new a bit slower). Just in case, my "old" is 5f34d30e93b87b25e1cca14e91cfcacc9a0b8784 and "new" 11feffda993611dde4286934b7248816e8f13b37.

Ok, I used exactly the same setup. I'll try with clang later.

#11

Updated by Junxiao Shi about 3 years ago

Ubuntu 16.04 32-bit ./waf configure --enable-static --disable-shared --with-tests --without-pch --with-sanitizer=address,undefined
UBSan output for ndn-cxx:commit:81a088afbc249c50398d38f1cea77092cc95a830 has no complaint about misaligned access.

ubuntu@m0213:~/san-ndn-cxx$ build/unit-tests -t Encoding/TestTlv -l test_suite
Running 9 test cases...
Entering test suite "ndn-cxx Tests"
Entering test suite "Encoding"
Entering test suite "TestTlv"
Entering test suite "VarNumber"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 156mks
Entering test case "Write"
Leaving test case "Write"; testing time: 183mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 490mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 422mks
Leaving test suite "VarNumber"
Entering test suite "NonNegativeInteger"
Entering test case "SizeOf"
Leaving test case "SizeOf"; testing time: 103mks
Entering test case "Write"
Leaving test case "Write"; testing time: 100mks
Entering test case "ReadFromBuffer"
Leaving test case "ReadFromBuffer"; testing time: 271mks
Entering test case "ReadFromStream"
Leaving test case "ReadFromStream"; testing time: 278mks
Leaving test suite "NonNegativeInteger"
Entering test suite "PrintHelpers"
Entering test case "PrintSignatureTypeValue"
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 54 a0 b2 81 54 a0 b2  81 54 a0 b2 80 54 a0 b2  8c 54 a0 b2 80 56 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 51 a0 b2 81 51 a0 b2  81 51 a0 b2 80 51 a0 b2  96 51 a0 b2 80 53 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 4e a0 b2 81 4e a0 b2  81 4e a0 b2 80 4e a0 b2  98 4e a0 b2 80 50 a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
../src/encoding/tlv.cpp:30:3: runtime error: load of value 4294967295, which is not a valid value for type 'SignatureTypeValue'
/usr/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:238:17: runtime error: downcast of address 0xbf82eaa4 which does not point to an object of type 'basic_unlockedbuf'
0xbf82eaa4: note: object is of type 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
  04 76 3b 14 28 76 3b 14  81 4b a0 b2 81 4b a0 b2  81 4b a0 b2 80 4b a0 b2  96 4b a0 b2 80 4d a0 b2
              ^~~~~~~~~~~
              vptr for 'std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >'
Leaving test case "PrintSignatureTypeValue"; testing time: 1079mks
Leaving test suite "PrintHelpers"
Leaving test suite "TestTlv"
Leaving test suite "Encoding"
Leaving test suite "ndn-cxx Tests"

*** No errors detected
#12

Updated by Junxiao Shi about 3 years ago

My benchmark on macOS 10.12 shows new code is 30% slower for 3-octet and larger VAR-NUMBERs.
./waf configure --enable-static --disable-shared --with-tests --without-osx-keychain

size offset c4016,4 stdev c4017,5 stdev c4017,5 / c4016,4
1 0 259385800 11925135.332565414 250835800 7100459.787929229 0.967038
3 0 311677600 11584539.947706167 423802800 5795004.201896665 1.35975
3 1 303656200 12306442.97512486 417332600 9488965.370365728 1.37436
5 0 301721000 9048607.627696097 393108200 11236185.26013166 1.30289
5 1 314792600 12129824.413403518 397433600 9264698.608157743 1.26253
5 2 306116800 8804954.951616732 392509000 14246129.176025325 1.28222
5 3 311836600 7208830.5084805535 396697200 10412486.696270013 1.27213
9 0 329590800 8072253.786892481 400371600 8150640.269328539 1.21475
9 1 329503400 10183242.425671699 402694000 12135753.314071607 1.22212
9 2 334964000 11775083.48165736 418771000 9154890.41441786 1.2502
9 3 336735000 20010016.554216042 403890400 7884189.641301127 1.19943
9 4 332599800 8948226.176175924 419815600 9103528.81030208 1.26222
9 5 335047800 8720066.467636585 401694200 10620385.030685093 1.19892
9 6 332325000 10885061.047141628 408148800 14561003.784080273 1.22816
9 7 336079400 8372866.193843062 406360000 16991624.304344773 1.20912

However, I think the Change still should be merged:

  • For 3-octet VAR-NUMBER which experienced the greatest slowdown ratio, the absolute difference per thousand operations is 0.0022 milliseconds. I estimate no more than one thousand VAR-NUMBER decodings are needed when processing each packet, and 0.0022ms is too small to worry about.
  • macOS is mainly used as end hosts, not routers. Performance is less critical on macOS.
  • Benchmark in note-5 indicates the new code is consistently faster for all but 9-octet inputs on Linux, the usual platform for routers. 9-octet VAR-NUMBERs are not used in ndn-cxx, because TLV-TYPE is implemented as uint32_t which encodes to up to 5 octets, and the practical limit of packet size is 8800 octets so TLV-LENGTH is up to 3 octets.
#14

Updated by Davide Pesavento about 3 years ago

Same linux system as note-8, this time using clang-4.0.1

OLD
size=1 offset=0 225139293 nanoseconds
size=3 offset=0 286069693 nanoseconds
size=3 offset=1 285300278 nanoseconds
size=5 offset=0 286577277 nanoseconds
size=5 offset=1 281542178 nanoseconds
size=5 offset=2 286274724 nanoseconds
size=5 offset=3 292053121 nanoseconds
size=9 offset=0 292926056 nanoseconds
size=9 offset=1 299447933 nanoseconds
size=9 offset=2 291830825 nanoseconds
size=9 offset=3 307051689 nanoseconds
size=9 offset=4 292697168 nanoseconds
size=9 offset=5 298853287 nanoseconds
size=9 offset=6 293802120 nanoseconds
size=9 offset=7 298033793 nanoseconds

NEW
size=1 offset=0 222543953 nanoseconds
size=3 offset=0 395634150 nanoseconds
size=3 offset=1 395043022 nanoseconds
size=5 offset=0 364022183 nanoseconds
size=5 offset=1 357715890 nanoseconds
size=5 offset=2 364271467 nanoseconds
size=5 offset=3 357249949 nanoseconds
size=9 offset=0 355419601 nanoseconds
size=9 offset=1 368415047 nanoseconds
size=9 offset=2 357166517 nanoseconds
size=9 offset=3 355884553 nanoseconds
size=9 offset=4 356522603 nanoseconds
size=9 offset=5 361962657 nanoseconds
size=9 offset=6 356480328 nanoseconds
size=9 offset=7 352119360 nanoseconds

The "new code" results with clang are consistent with those obtained with gcc. Notable exception is the case size=1, which doesn't use memcpy. The "old code" is somewhat faster with clang. This suggests that clang is able to optimize the non-memcpy code better than gcc (and better than the memcpy code).

#15

Updated by Alex Afanasyev about 3 years ago

Davide, what's your opinion on this? Proceed with the merge?

#16

Updated by Davide Pesavento about 3 years ago

I'd agree with merging the change.

#17

Updated by Junxiao Shi about 3 years ago

  • Status changed from Code review to Closed

Also available in: Atom PDF