Project

General

Profile

Actions

Bug #4479

closed

Crash in LpReliability::deleteUnackedFrag

Added by susmit shannigrahi about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Faces
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:

Description

I am trying to transfer a reasonably sized file (10MB) over Ethernet unicast using putchunks and catchunks.
The segment size for putchunks is 8800, MTU on the interface is 9000.
Configurations are same as #4012-10.

I find nfd receives a SIGSEGV but not because the host is out of memory.
GDB backtrace below and attached.

#0  0x00007ffff4f68342 in std::_Rb_tree_increment(std::_Rb_tree_node_base const*) () from /lib64/libstdc++.so.6
#1  0x0000000000474029 in std::_Rb_tree_iterator<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> >::operator++ (this=<synthetic pointer>)
    at /usr/include/c++/6.3.1/bits/stl_tree.h:209
#2  std::_Rb_tree<unsigned long, std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag>, std::_Select1st<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> >) (__position=..., this=0x7030f8) at /usr/include/c++/6.3.1/bits/stl_tree.h:1046
#3  std::map<unsigned long, nfd::face::LpReliability::UnackedFrag, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> > >::erase[abi:cxx11](std::_Rb_tree_iterator<std::pair<unsigned long const, nfd::face::LpReliability::UnackedFrag> >) (__position=..., this=0x7030f8)
    at /usr/include/c++/6.3.1/bits/stl_map.h:951
#4  nfd::face::LpReliability::deleteUnackedFrag (this=this@entry=0x7030a0, fragIt=...) at ../daemon/face/lp-reliability.cpp:327
#5  0x000000000047671a in nfd::face::LpReliability::onLpPacketLost (this=this@entry=0x7030a0, txSeqIt=...) at ../daemon/face/lp-reliability.cpp:248
#6  0x00000000004795fb in nfd::face::LpReliability::processIncomingPacket (this=this@entry=0x7030a0, pkt=...) at ../daemon/face/lp-reliability.cpp:132
#7  0x0000000000459932 in nfd::face::GenericLinkService::doReceivePacket(nfd::face::Transport::Packet&&) (this=0x702f00,
    packet=<unknown type in /usr/local/bin/nfd, CU 0x20a911, DIE 0x266d97>) at ../daemon/face/generic-link-service.cpp:279
#8  0x000000000057b6e4 in nfd::face::EthernetTransport::receivePayload (this=this@entry=0x6f17d0,
    payload=payload@entry=0x1a19dee "dM\375\003D\001\302\375\003D\001\303\375\003D\001\304\375\003D\001\305\375\003D\001\306\375\003D\001\307\375\003H\002\023+P'\005%\a\030\b\004test\b\001\062\b", <incomplete sequence \375>, length=length@entry=79, sender=...) at ../daemon/face/ethernet-transport.cpp:199
#9  0x000000000057e87a in nfd::face::EthernetTransport::handleRead (this=0x6f17d0, error=...) at ../daemon/face/ethernet-transport.cpp:163
#10 0x000000000058107a in boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>::operator() (a1=..., p=<optimized out>, this=<optimized out>)
    at /usr/include/boost/bind/mem_fn_template.hpp:165
#11 boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()>::operator()<boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::rrlist2<boost::system::error_code const&, unsigned long const&> > (a=<synthetic pointer>..., f=..., this=0x7fffffffca60)
    at /usr/include/boost/bind/bind.hpp:319
#12 boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> >::operator()<boost::system::error_code const&, unsigned long const&> (a2=@0x7fffffffca78: 0, a1=..., this=0x7fffffffca50)
    at /usr/include/boost/bind/bind.hpp:1246
#13 boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long>::operator() (this=0x7fffffffca50)
    at /usr/include/boost/asio/detail/bind_handler.hpp:127
#14 boost::asio::asio_handler_invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long> > (function=...)
    at /usr/include/boost/asio/handler_invoke_hook.hpp:69
#15 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder2<boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> >, boost::system::error_code, unsigned long>, boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> > > (context=..., function=...) at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#16 boost::asio::detail::reactive_null_buffers_op<boost::_bi::bind_t<void, boost::_mfi::mf1<void, nfd::face::EthernetTransport, boost::system::error_code const&>, boost::_bi::list2<boost::_bi::value<nfd::face::EthernetTransport*>, boost::arg<1> (*)()> > >::do_complete (owner=0x6ae630, base=<optimized out>)
    at /usr/include/boost/asio/detail/reactive_null_buffers_op.hpp:75
#17 0x00000000004430e6 in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=0, ec=..., owner=..., this=<optimized out>)
    at /usr/include/boost/asio/detail/task_io_service_operation.hpp:38
#18 boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x6ae630, base=0x706a70, ec=..., bytes_transferred=<optimized out>)
    at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:651
#19 0x00000000004442a7 in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=1, ec=..., owner=..., this=<optimized out>)
    at /usr/include/boost/asio/detail/task_io_service_operation.hpp:38
#20 boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=0x6ae630) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:372
#21 boost::asio::detail::task_io_service::run (this=0x6ae630, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:149
#22 0x000000000044adfe in boost::asio::io_service::run (this=0x6ae5c0) at /usr/include/boost/asio/impl/io_service.ipp:59
#23 nfd::NfdRunner::run (this=this@entry=0x7fffffffd250) at ../daemon/main.cpp:143
#24 0x000000000042e8cc in main (argc=<optimized out>, argv=<optimized out>) at ../daemon/main.cpp:291

Files

crash (6.25 KB) crash GDB bt susmit shannigrahi, 01/26/2018 08:55 PM
Actions #1

Updated by Junxiao Shi about 6 years ago

  • Description updated (diff)

Please provide:

  • tcpdump (not ndndump) trace
  • bt full (not bt) output

See http://www.lists.cs.ucla.edu/pipermail/nfd-dev/2016-May/001748.html .

Actions #2

Updated by Davide Pesavento about 6 years ago

  • Subject changed from Data transfer over ethernet unicast face crashes NFD to Crash in LpReliability::deleteUnackedFrag
  • Description updated (diff)
  • Category set to Faces
Actions #3

Updated by Junxiao Shi about 6 years ago

  • Assignee set to Eric Newberry
  • Target version set to v0.7

Assigning to Eric who authored this LpReliability class. Susmit still needs to upload requested information in note-1.

Actions #4

Updated by Eric Newberry about 6 years ago

@susmit shannigrahi what release/git revision are you using?

Actions #5

Updated by susmit shannigrahi about 6 years ago

Eric Newberry wrote:

@susmit shannigrahi what release/git revision are you using?

Hi Eric, I am using master. NFD revision is 97a010178c0bd340babb3dfabcc6f8f6abbaef42.

I am trying to reproduce the bug.

Actions #6

Updated by susmit shannigrahi about 6 years ago

I am currently unable to reproduce the bug. I will update if I find it again.
Do you think the existing trace would be enough to debug this?

Actions #7

Updated by Davide Pesavento about 6 years ago

susmit shannigrahi wrote:

Do you think the existing trace would be enough to debug this?

Possibly, but I'm really unfamiliar with the LpReliability code. @Eric, did you have time to look into this?

Actions #8

Updated by Eric Newberry about 6 years ago

Davide Pesavento wrote:

susmit shannigrahi wrote:

Do you think the existing trace would be enough to debug this?

Possibly, but I'm really unfamiliar with the LpReliability code. @Eric, did you have time to look into this?

I briefly did and I think it's likely caused by incorrectly tracking one of the iterators (or one of the pointers to an iterator) in the LpReliability class.

Actions #9

Updated by Davide Pesavento about 6 years ago

susmit shannigrahi wrote:

The segment size for putchunks is 8800, MTU on the interface is 9000.

Are you sure about this? I found a fairly serious bug in LpReliability when used in conjunction with fragmentation, but the above settings imply that packets never get fragmented...

Actions #10

Updated by Eric Newberry about 6 years ago

Davide Pesavento wrote:

susmit shannigrahi wrote:

The segment size for putchunks is 8800, MTU on the interface is 9000.

Are you sure about this? I found a fairly serious bug in LpReliability when used in conjunction with fragmentation, but the above settings imply that packets never get fragmented...

Where did you find this other bug? Has it been fixed?

Actions #11

Updated by Davide Pesavento about 6 years ago

  • Priority changed from Normal to High
  • % Done changed from 0 to 20

The following assertion in LpReliability is wrong:

void
LpReliability::onLpPacketLost(UnackedFrags::iterator txSeqIt)
{
  BOOST_ASSERT(m_unackedFrags.count(txSeqIt->first) > 0);
  ...
}

This test case https://gerrit.named-data.net/c/4590/1/tests/daemon/face/lp-reliability.t.cpp#658 will trigger the aforementioned assert if built in debug mode, and will probably crash in release mode (where asserts are disabled).

Actions #12

Updated by Davide Pesavento about 6 years ago

Davide Pesavento wrote:

The following assertion in LpReliability is wrong:

This was rather cryptic, let me clarify. What I meant is that the code makes an assumption that is not correct, and that assumption is "checked" by the aforementioned assertion.

More specifically, onLpPacketLost assumes that txSeqIt is a valid iterator into the m_unackedFrags map. But this is not true if another fragment belonging to the same net-layer packet was deemed lost during the same rx event (i.e. within a single processIncomingPacket invocation) and that fragment also exceeded its max retx count. In that case, one or more iterators returned by findLostLpPackets may become dangling, because all fragments of that net-layer packet are removed from the map, thus invalidating the iterators.

Actions #13

Updated by susmit shannigrahi about 6 years ago

Eric Newberry wrote:

Davide Pesavento wrote:

susmit shannigrahi wrote:

The segment size for putchunks is 8800, MTU on the interface is 9000.

Are you sure about this? I found a fairly serious bug in LpReliability when used in conjunction with fragmentation, but the above settings imply that packets never get fragmented...

This has been a few weeks, so no, not very sure.
I will try to trigger the behavior you described.

Actions #14

Updated by Eric Newberry about 6 years ago

  • Status changed from New to In Progress
Actions #15

Updated by Eric Newberry about 6 years ago

  • Status changed from In Progress to Code review
  • % Done changed from 20 to 100
Actions #16

Updated by Eric Newberry about 6 years ago

  • Status changed from Code review to Closed
Actions

Also available in: Atom PDF