Task #4095
closedDNS Lookups on Ubuntu 17.04 Fail Sometimes
100%
Description
IPv6 DNS lookups appear to fail sometimes when running unit tests, depending upon the network environment. In environments where they fail, the test cases in question succeed on second and further runs. The test cases in question are Util/TestDns/AsynchronousV6
and Util/TestDns/AsychronousV4AndV6
.
Removing systemd-resolved from the hosts line in /etc/nsswitch.conf
appears to fix this issue.
Files
Updated by Eric Newberry over 7 years ago
- Blocks Task #4002: Jenkins: Ubuntu 17.04 slave added
Updated by Junxiao Shi over 7 years ago
Can you upload a Wireshark capture of the machine when the problematic unit test is running?
The capture should show port 53 traffic on both IPv4 and IPv6.
Updated by Eric Newberry over 7 years ago
Strangely, I'm not getting this issue anymore. I was only getting it on the University network and not on my home network. However, now when I run the tests from within the University network (on Vagrant), the tests pass.
Updated by Junxiao Shi over 7 years ago
http://jenkins.named-data.net/job/NFD/5409/ failed due to DNS issue. Can you keep a tcpdump continuously running on 17.04 slaves, capturing DNS traffic only, so that we have a log whenever the problem appears again?
Updated by Eric Newberry over 7 years ago
- Subject changed from IPv6 DNS Lookups on Ubuntu 17.04 Fail Sometimes to DNS Lookups on Ubuntu 17.04 Fail Sometimes
It appears that the DNS lookup issues happen when doing reverse lookups (IP address to hostname) during FaceUri canonization and affect IPv4 (and potentially IPv6). The queries appear to be timing out.
Updated by Davide Pesavento over 7 years ago
Eric Newberry wrote:
It appears that the DNS lookup issues happen when doing reverse lookups (IP address to hostname) during FaceUri canonization
AFAIK we don't do any reverse lookups during FaceUri canonization.
Updated by Junxiao Shi over 7 years ago
Is it a problem with the host node, or a problem with Ubuntu 17.04 OS?
I also notice that all slaves on that host node is down at the moment.
I'd suggest to ensure each OS version has slaves on at least two host nodes. This can expose any host-node specific issue, and also prevents single point of failure.
UPDATE: Arizona UITS is upgrading firewall, causing remote nodes to appear offline. But it's still a good idea to shift slaves onto two or more sites.
Updated by Junxiao Shi over 7 years ago
- File 5415-enp0s3_nossh.pcap 5415-enp0s3_nossh.pcap added
- File 5415-lo.pcap 5415-lo.pcap added
I did a tcpdump capture on Ubuntu-17.04-64bit-csu-10022.
Unlike previous attempts, all DNS related tests are passing.
Surprisingly, there is no DNS packet seen in tcpdump.
Updated by Eric Newberry over 7 years ago
Junxiao Shi wrote:
Surprisingly, there is no DNS packet seen in tcpdump.
Yes, I noticed this when I did tcpdumps on these nodes yesterday. I believe the DNS queries are being resolved using alternative means, perhaps systemd-resolved. However, systemd-resolved should also use port 53 for its traffic.
Updated by Eric Newberry over 7 years ago
Davide Pesavento wrote:
Eric Newberry wrote:
It appears that the DNS lookup issues happen when doing reverse lookups (IP address to hostname) during FaceUri canonization
AFAIK we don't do any reverse lookups during FaceUri canonization.
Here's some output from the tools unit tests run on Ubuntu 17.04:
../tests/tools/nfdc/face-module.t.cpp(395): error: in "Nfdc/TestFaceModule/CreateCommand/ErrorConflict": check exitCode == 1 has failed [4 != 1]
../tests/tools/nfdc/face-module.t.cpp(397): error: in "Nfdc/TestFaceModule/CreateCommand/ErrorConflict": check err.is_equal("Error 409 when creating face: conflict-409\n") has failed. Output content: "Error when canonizing 'udp://20.53.73.45': Hostname resolution timed out
"
Updated by Davide Pesavento over 7 years ago
I don't have an explanation for that output, but there's no code in ndn-cxx, NFD, or nfdc to perform reverse resolutions, so something else must be going on.
As a side note, IpHostCanonizeProvider
could avoid calling dns::asyncResolve()
if the host portion of the URI already contains a valid IP address.
Updated by Davide Pesavento over 7 years ago
Updated by Eric Newberry over 7 years ago
I've launched a second 17.04 node at UA and disconnected both 17.04 nodes at CSU. Hopefully, this will resolve the issues as this issue seems specific to the CSU environment.
Updated by Davide Pesavento over 7 years ago
- Status changed from New to Feedback
Seems to be working for now. Should we close (or reject) this issue?
Updated by Eric Newberry over 7 years ago
Davide Pesavento wrote:
Seems to be working for now. Should we close (or reject) this issue?
Probably not reject since there were changes merged for this issue.
Updated by Eric Newberry over 7 years ago
- Status changed from Feedback to Closed
It appears that this issue was potentially fixed by the merged changes and any remaining issues are possibly specific to the CSU environment. Closing for now.