Bug #2768: Misaligned memory accesses in TLV decoding (and probably elsewhere) - NFD - NDN project issue tracking system

Actions

Copy link

Bug #2768

closed

Misaligned memory accesses in TLV decoding (and probably elsewhere)

Added by Davide Pesavento about 10 years ago. Updated about 8 years ago.

Status:

Rejected

Priority:

Low

Assignee:

Category:

Faces

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

../tests/daemon/face/ndnlp.t.cpp:131:20: runtime error: load of misaligned address 0x000004992b32 for type 'const uint64_t' (aka 'const unsigned long'), which requires 8 byte alignment
0x000004992b32: note: pointer points here
 02 8c  51 08 00 00 00 00 00 00  00 03 52 01 03 53 01 04  54 fd 02 78 cc cc cc cc  cc cc cc cc cc cc
              ^ 
../daemon/face/ndnlp-data.cpp:56:16: runtime error: load of misaligned address 0x00000498bf44 for type 'const uint64_t' (aka 'const unsigned long'), which requires 8 byte alignment
0x00000498bf44: note: pointer points here
  50 4a 51 08 00 00 00 00  00 00 00 00 54 3e 01 3c  cc cc cc cc cc cc cc cc  cc cc cc cc cc cc cc cc
              ^

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Davide Pesavento about 10 years ago

You can reproduce using CXX=clang++ (clang version 3.5.0), CXXFLAGS=-fsanitize=undefined, LINKFLAGS=-fsanitize=undefined.

Actions

Copy link

Updated by Junxiao Shi about 10 years ago

Subject changed from NDNLP misaligned memory access to FaceNdnlp/Slice4 test case: misaligned memory access
Target version deleted (~~v0.4~~)

I can confirm that misaligned read access in this test case is intentional.

Please note that clang on non-OSX platform is unsupported.

Actions

Copy link

Updated by Davide Pesavento about 10 years ago

Junxiao Shi wrote:

I can confirm that misaligned read access in this test case is intentional.

I don't particularly care about the test case, but the same problem occurs in NdnlpData::fromBlock()

  parsed.seq = be64toh(*reinterpret_cast<const uint64_t*>(&*sequenceElement.value_begin()));

This is a performance sensitive code path and unaligned accesses hurt performance on some platforms. On some CPU architectures unaligned accesses are outright illegal and will abort execution.

Please note that clang on non-OSX platform is unsupported.

Here I used clang just to detect the problem, which exists in the code regardless of the compiler used.

Actions

Copy link

Updated by Junxiao Shi about 10 years ago

the same problem occurs in NdnlpData::fromBlock()
This is a performance sensitive code path and unaligned accesses hurt performance on some platforms.

This is a design limitation of NDNLPv1. Unaligned access is intentional.

I don't think eight single-octet reads (plus arithmetic operations) would be faster than one misaligned dword read.

NDN-TLV has the same design limitation in VAR-NUMBER.

On some CPU architectures unaligned accesses are outright illegal and will abort execution.

Maybe, but they are unsupported.

If C++ standard doesn't forbid unaligned access, it's compiler's responsibility to emit instructions that emulate an unaligned read by doing multiple single-octet reads.

Actions

Copy link

Updated by Junxiao Shi about 10 years ago

Subject changed from FaceNdnlp/Slice4 test case: misaligned memory access to NdnlpData::fromBlock misaligned memory access

Actions

Copy link

Updated by Davide Pesavento about 10 years ago

Junxiao Shi wrote:

This is a design limitation of NDNLPv1. Unaligned access is intentional.

It has nothing to do with the protocol (except for the fact that the wire format could be optimized for faster decoding). Unaligned access is a bug in my opinion, and UBSan considers it undefined behavior. Are you saying that there's an intentional bug in the code?

I don't think eight single-octet reads (plus arithmetic operations) would be faster than one misaligned dword read.

Depending on the processor architecture, yes, it can be much faster.

Anyway, you don't need to do single-byte reads, you can use memcpy which the compiler will be able to optimize appropriately in most cases if the target platform supports efficient unaligned accesses.

NDN-TLV has the same design limitation in VAR-NUMBER.

Again, maybe the wire format could be optimized, but it's not a design problem. Nothing in the design says that you have to perform a misaligned memory access.

If C++ standard doesn't forbid unaligned access, it's compiler's responsibility to emit instructions that emulate an unaligned read by doing multiple single-octet reads.

Where did you get this idea from? Anything not specified by the standard is, by definition, unspecified, i.e. each compiler is free to do whatever it wants, therefore you cannot rely on it. Moreover, memory layout and alignment issues are not part of the C++ language specification, but are covered by ABI definitions.

Actions

Copy link

Updated by Junxiao Shi about 10 years ago

Assignee deleted (~~Junxiao Shi~~)

Does unaligned memory access always cause bus errors? answer points out that:
unaligned access
(1) maybe slower
(2) is not supported on some platforms
(3) can be emulated by compiler when requested with a significant performance penalty (but it's not always enabled).

Daniel Lemire, Data alignment for speed: myth or reality? does some measurements and sees no measurable difference on Core i7 processor, but a 10% penalty on cheaper Core 2 processor.

This article also points out that unaligned access is supported on x86 and x64 processors, and newer ARM processors.

I did a benchmark with unaligned access vs memcpy method proposed in note-6.

// g++ -o y -std=c++0x y.cpp $(pkg-config --cflags --libs libndn-cxx)
#include <ndn-cxx/util/time.hpp>

using namespace ndn;

int
main(int argc, char** argv)
{
  const int SZ = 1048576;
  uint8_t arr[SZ * 4 + 3];
  std::memset(arr, 1, SZ * 4 + 3);

  int offset = atoi(argv[1]);
  uint32_t* rep = reinterpret_cast<uint32_t*>(arr + offset);

  auto t1 = time::steady_clock::now();

  int c = 0;
  for (int i = 0; i < SZ; ++i) {
    c += rep[i];
  }

  auto t2 = time::steady_clock::now();
  std::cout << "x " << offset << " " << (t2 - t1) << " " << c << "\n";

  return 0;
}

// g++ -o y -std=c++0x y.cpp $(pkg-config --cflags --libs libndn-cxx)
#include <ndn-cxx/util/time.hpp>

using namespace ndn;

int
main(int argc, char** argv)
{
  const int SZ = 1048576;
  uint8_t arr[SZ * 4 + 3];
  std::memset(arr, 1, SZ * 4 + 3);

  int offset = atoi(argv[1]);
  uint32_t* rep = reinterpret_cast<uint32_t*>(arr + offset);

  auto t1 = time::steady_clock::now();

  int c = 0;
  uint32_t n;
  for (int i = 0; i < SZ; ++i) {
    std::memcpy(&n, &rep[i], sizeof(uint32_t));
    c += n;
  }

  auto t2 = time::steady_clock::now();
  std::cout << "y " << offset << " " << (t2 - t1) << " " << c << "\n";

  return 0;
}

I ran the benchmark with this command:

for I in $(seq 20); do for OFFSET in $(seq 0 3); do ./x $OFFSET; ./y $OFFSET; done; done

The results on x64 are:

offset	unaligned access	memcpy
0	3233526	5971367
1	3573664	5951130
2	3388489	5935154
3	3384905	6038024

Clearly, using memcpy would be much slower than keeping unaligned access.

Therefore, I disagree with doing anything for now.

When there is a supported platform that does not support unaligned access, we could enable memcpy using conditional compilation.

Actions

Copy link

Updated by Davide Pesavento about 10 years ago

Junxiao Shi wrote:

(3) can be emulated by compiler when requested with a significant performance penalty (but it's not always enabled).

For completeness: (4) can be emulated by the OS kernel with an even higher performance penalty.

This article also points out that unaligned access is supported on x86 and x64 processors, and newer ARM processors.

Also on PowerPC if I remember correctly. On ARM, I don't know what the performance penalty is though.

Clearly, using memcpy would be much slower than keeping unaligned access.

On x86/x86_64, yes, it's a known fact. The Linux kernel has get_unaligned/put_unaligned macros that are used when unaligned access is unavoidable, and on x86 they just do the unaligned load/store because everything else would be slower.

Therefore, I disagree with doing anything for now.

When there is a supported platform that does not support unaligned access, we could enable memcpy using conditional compilation.

Fair enough.

Actions

Copy link

Updated by Junxiao Shi over 9 years ago

NDNLPv1 implementation is being deleted in #3170. This issue can close after #3170 is complete.

Actions

Copy link

#10

Updated by Davide Pesavento over 9 years ago

Yes and no. NDNLPv1 will be deleted, but unaligned memory accesses occur in many other places that perform TLV parsing. So move this ticket to ndn-cxx if you want.

Actions

Copy link

#11

Updated by Davide Pesavento almost 9 years ago

Subject changed from NdnlpData::fromBlock misaligned memory access to Misaligned memory accesses in TLV decoding (and probably elsewhere)
Priority changed from Normal to Low

Actions

Copy link

#12

Updated by Junxiao Shi about 8 years ago

Status changed from New to Rejected

unaligned memory accesses occur in many other places that perform TLV parsing.

NDNLPv1 is gone. I'm rejecting this because the issue description is written specifically for NDNLPv1.
If there are other places with the same problem, open an new bug with proper description.

Actions

Copy link

#13