Debugging Neutron DHCP Failures on OpenStack
Hi All. Today’s adventure touches on a common OpenStack Neutron theme: “I just created a VM. It’s running. It says it has an IP address. And I can’t ping it!!” Read on for tips on debugging Neutron DHCP Failures on OpenStack that we use in our lab.
The Problem with Neutron
Here is the deal: I created a VM against a standard network and it had an assigned IP as follows:
[l.abruce@co1 rc_scripts(lvosksclu100-rc-admin)]$ nova --os-tenant-name=infrastructure show testnew +--------------------------------------+----------------------------------------------------------------+ | Property | Value | +--------------------------------------+----------------------------------------------------------------+ [...output cut...] | infrastructure-net network | 10.0.0.10 | [...output cut...] +--------------------------------------+----------------------------------------------------------------+
You see that, right? 10.0.0.10? That’s expected as I’m using per-tenant private networking. But on the actual running VM, I had a teensy problem: DHCP was timing out. Even though OpenStack thought my VM had that 10.0.0.10 IP address, my running VM didn’t actually get it.
Bummer.
So let’s look at what happens. In my setup (using Neutron GRE) the flow is:
**PHYSICAL Nova Compute HOST*** VM -> qbridge host -> tap device -> br-int -> br-tun -> [physical switch] -> **Neutron Controller (running as a VM)** br-tun -> br-int -> [DHCP namespace] -> DHCP process (dnsmasq)
The problem could be happening at any point in the mix, but I have other VMs that I can start / restart just fine. So let’s look explicitly at the DHCP server.
Locate the DHCP Namespace
Notice above where I used [DHCP namespace] as an identifier. To permit network isolation, Neutron creates a separate network namespace so that each per-tenant network can have its own router, its own DHCP server, and so on. To locate this namespace, first find out your network ID; here’s a command for you:
[l.abruce@co1 rc_scripts(lvosksclu100-rc-admin)]$ nova network-list | grep " infrastructure-net " | 88513920-091b-4bee-a47c-5636593c69f5 | infrastructure-net | None |
So my network ID is 88513920-091b-4bee-a47c-5636593c69f5. Armed with that information, let’s go to the Neutron Controller node and look for the DHCP namespace:
[root@lvosneutr100 ~]# ip netns | grep 88513920-091b-4bee-a47c-5636593c69f5 qdhcp-88513920-091b-4bee-a47c-5636593c69f5
Woohoo, we found it. Now let’s look at the network interfaces associated with that namespace:
[root@lvosneutr100 ~]# ip netns exec qdhcp-88513920-091b-4bee-a47c-5636593c69f5 ip a 36: tap315b3aa8-c3:mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:86:d8:5a brd ff:ff:ff:ff:ff:ff inet 10.0.0.3/24 brd 10.0.0.255 scope global tap315b3aa8-c3 inet6 fe80::f816:3eff:fe86:d85a/64 scope link valid_lft forever preferred_lft forever 
We’ve got our DHCP namespace and we know the network interface, so we’re closing in on the problem.
Watch DHCP Traffic
Now, we don’t just want to watch *all* DHCP traffic, only the traffic specific to our failing tenant network (with the failing VM on it). Let’s go over what we know so far:
- Network ID: 88513920-091b-4bee-a47c-5636593c69f5
- DHCP Namespace: qdhcp-88513920-091b-4bee-a47c-5636593c69f5
- Associated Network Interface: tap315b3aa8-c3
Now let’s take a look at the traffic:
[root@lvosneutr100 ~]# ip netns exec qdhcp-88513920-091b-4bee-a47c-5636593c69f5 tcpdump -i tap315b3aa8-c3 -nnvvvNXs 512 tcpdump: listening on tap315b3aa8-c3, link-type EN10MB (Ethernet), capture size 512 bytes
I’m not going into the murky depths of the ip netns exec command mode, nor into the help pages of tcpdump. Suffice to say we’re catching all traffic going to the DHCP server. Let’s take a look at some correct interactions from a working VM:
14:36:07.056008 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from fa:16:3e:e3:7f:4c, length 300, xid 0x66d35574, Flags [none] (0x0000)
	  Client-Ethernet-Address fa:16:3e:e3:7f:4c
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Request
	    Requested-IP Option 50, length 4: 10.0.0.7
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
	    END Option 255, length 0
	    PAD Option 0, length 0, occurs 35
	0x0000:  4510 0148 0000 0000 8011 3996 0000 0000  E..H......9.....
[...output cut...]
	0x0140:  0000 0000 0000 0000                      ........
14:36:07.062052 IP (tos 0x0, ttl 64, id 11420, offset 0, flags [none], proto UDP (17), length 351)
    10.0.0.3.67 > 10.0.0.7.68: [bad udp cksum 2a24!] BOOTP/DHCP, Reply, length 323, xid 0x66d35574, Flags [none] (0x0000)
	  Your-IP 10.0.0.7
	  Server-IP 10.0.0.3
	  Client-Ethernet-Address fa:16:3e:e3:7f:4c
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: ACK
	    Server-ID Option 54, length 4: 10.0.0.3
	    Lease-Time Option 51, length 4: 86400
	    RN Option 58, length 4: 43200
	    RB Option 59, length 4: 75600
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 10.0.0.255
	    Domain-Name Option 15, length 14: "openstacklocal"
	    Hostname Option 12, length 13: "host-10-0-0-7"
	    Default-Gateway Option 3, length 4: 10.0.0.1
	    Domain-Name-Server Option 6, length 4: 192.168.1.2
	    END Option 255, length 0
	0x0000:  4500 015f 2c9c 0000 4011 38e9 0a00 0003  E.._,...@.8.....
[...output cut...]
	0x0150:  2d37 0304 0a00 0001 0604 c0a8 0102 ff    -7.............
14:36:07.084041 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.0.7 (ff:ff:ff:ff:ff:ff) tell 0.0.0.0, length 28
	0x0000:  0001 0800 0604 0001 fa16 3ee3 7f4c 0000  ..........>..L..
	0x0010:  0000 ffff ffff ffff 0a00 0007            ............
The above shows the DHCP request coming from the VM, and the DHCP server responding with IP 10.0.0.7 (and a bunch of other information). So let’s take a look at the failing node:
14:36:32.229699 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from fa:16:3e:3e:e9:00, length 300, xid 0x98693309, Flags [none] (0x0000)
	  Client-Ethernet-Address fa:16:3e:3e:e9:00
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
	    END Option 255, length 0
	    PAD Option 0, length 0, occurs 41
	0x0000:  4510 0148 0000 0000 8011 3996 0000 0000  E..H......9.....
[...output cut...]
	0x0140:  0000 0000 0000 0000                      ........
Hmm, there’s no response. Why could that be? Let’s keep going!
Tear into the DHCP Server
The real question now is: what process is running that DHCP server? Because the process is tied to the namespace interface tap315b3aa8-c3, let’s try looking for that in the process list:
[root@lvosneutr100 ~]# ps -efa | grep tap315b3aa8-c3 nobody 4689 1 0 Oct23 ? 00:00:03 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap315b3aa8-c3 --except-interface=lo --pid-file=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/host --addn-hosts=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/opts --leasefile-ro --dhcp-range=tag0,10.0.0.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
We have our answer: dnsmasq is the DHCP server. While I won’t cover the gory details of dnsmasq here, let’s take a look at the arguments passed into that program. Specifically, let’s look at the hosts defined to the server:
[root@lvosneutr100 ~]# cat /var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/host fa:16:3e:c6:d4:c9,host-10-0-0-2.openstacklocal,10.0.0.2 fa:16:3e:86:d8:5a,host-10-0-0-3.openstacklocal,10.0.0.3 fa:16:3e:50:de:2a,host-10-0-0-5.openstacklocal,10.0.0.5 fa:16:3e:91:d3:6e,host-10-0-0-1.openstacklocal,10.0.0.1 fa:16:3e:e3:7f:4c,host-10-0-0-7.openstacklocal,10.0.0.7 fa:16:3e:0b:21:d1,host-10-0-0-9.openstacklocal,10.0.0.9 fa:16:3e:1f:42:15,host-10-0-0-6.openstacklocal,10.0.0.6 fa:16:3e:c6:e1:48,host-10-0-0-4.openstacklocal,10.0.0.4
And with that…believe it or not…we just found Our Answer. Which deserves another section ๐
The Missing Link
If you look at the output above, you will see that every tenant-private IP address is associated with a MAC address. The way that Neutron configures dnsmasq, if there isn’t a corresponding entry in the hosts file, then no answer will be returned to queries from that MAC.
The MAC address of our failing VM is available from the tcpdump output: fa:16:3e:3e:e9:00. Do you see that MAC anywhere in the hosts file? I don’t!
So the problem is that Neutron is not creating the entry for us in the hosts file. Let’s see if we can give Neutron a Kick by stopping the DHCP service:
[root@lvosneutr100 ~]# service neutron-dhcp-agent stop Stopping neutron-dhcp-agent: [ OK ]
You might reasonably expect Neutron to stop any running DHCP server processes when the Neutron DHCP Agent is stopped. But, you’d be wrong: a known OpenStack bug highlights this problem that the dnsmasq processes aren’t killed when the Neutron DHCP Agent is stopped. That could be a feature rather than a bug, but it’s a problem for us because the Neutron DHCP Agent will only restart dnsmasq for a tenant network…if it isn’t already running. So let’s kill the dnsmasq process:
[root@lvosneutr100 ~]# kill 4689 [root@lvosneutr100 ~]# ps -efa | grep tap315b3aa8-c3 root 20013 13156 0 14:48 pts/0 00:00:00 grep tap315b3aa8-c3
Good, the process is gone. Now let’s restart the Neutron DHCP Agent and check that dnsmasq is restarted on our DHCP interface:
[root@lvosneutr100 ~]# service neutron-dhcp-agent start Starting neutron-dhcp-agent: [ OK ] [root@lvosneutr100 ~]# ps -efa | grep tap315b3aa8-c3 nobody 20212 1 0 14:48 ? 00:00:00 dnsmasq --no-hosts --no-resolv --strict-order --bind-interfaces --interface=tap315b3aa8-c3 --except-interface=lo --pid-file=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/host --addn-hosts=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/opts --leasefile-ro --dhcp-range=tag0,10.0.0.0,static,86400s --dhcp-lease-max=256 --conf-file= --domain=openstacklocal
We see that our dnsmasq DHCP server process came back to life. Let’s go check out the hosts file that was passed into it to see if it has the MAC address of our “broken” VM:
[root@lvosneutr100 ~]# cat /var/lib/neutron/dhcp/88513920-091b-4bee-a47c-5636593c69f5/host fa:16:3e:c6:d4:c9,host-10-0-0-2.openstacklocal,10.0.0.2 fa:16:3e:86:d8:5a,host-10-0-0-3.openstacklocal,10.0.0.3 fa:16:3e:50:de:2a,host-10-0-0-5.openstacklocal,10.0.0.5 fa:16:3e:91:d3:6e,host-10-0-0-1.openstacklocal,10.0.0.1 fa:16:3e:e3:7f:4c,host-10-0-0-7.openstacklocal,10.0.0.7 fa:16:3e:0b:21:d1,host-10-0-0-9.openstacklocal,10.0.0.9 fa:16:3e:3e:e9:00,host-10-0-0-10.openstacklocal,10.0.0.10 fa:16:3e:1f:42:15,host-10-0-0-6.openstacklocal,10.0.0.6 fa:16:3e:c6:e1:48,host-10-0-0-4.openstacklocal,10.0.0.4
Yup, the MAC address is in there. All looks good so far…let’s restart networking on our broken VM and watch the traffic:
[root@lvosneutr100 ~]# ip netns exec qdhcp-88513920-091b-4bee-a47c-5636593c69f5 tcpdump -i tap315b3aa8-c3 -nnvvvNXs 512
tcpdump: listening on tap315b3aa8-c3, link-type EN10MB (Ethernet), capture size 512 bytes
14:51:22.269526 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from fa:16:3e:3e:e9:00, length 300, xid 0x1d798f61, Flags [none] (0x0000)
	  Client-Ethernet-Address fa:16:3e:3e:e9:00
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Classless-Static-Route
	      Domain-Name, Domain-Name-Server, Hostname, YD
	      YS, NTP, MTU, Option 119
	      Default-Gateway
	    END Option 255, length 0
	    PAD Option 0, length 0, occurs 41
	0x0000:  4510 0148 0000 0000 8011 3996 0000 0000  E..H......9.....
[...output cut...]
	0x0140:  0000 0000 0000 0000                      ........
14:51:22.275186 IP (tos 0x0, ttl 64, id 2824, offset 0, flags [none], proto UDP (17), length 352)
    10.0.0.3.67 > 10.0.0.10.68: [bad udp cksum 62bb!] BOOTP/DHCP, Reply, length 324, xid 0x1d798f61, Flags [none] (0x0000)
	  Your-IP 10.0.0.10
	  Server-IP 10.0.0.3
	  Client-Ethernet-Address fa:16:3e:3e:e9:00
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: 10.0.0.3
	    Lease-Time Option 51, length 4: 86400
	    RN Option 58, length 4: 43200
	    RB Option 59, length 4: 75600
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 10.0.0.255
	    Domain-Name Option 15, length 14: "openstacklocal"
	    Hostname Option 12, length 14: "host-10-0-0-10"
	    Default-Gateway Option 3, length 4: 10.0.0.1
	    Domain-Name-Server Option 6, length 4: 192.168.1.2
	    END Option 255, length 0
	0x0000:  4500 0160 0b08 0000 4011 5a79 0a00 0003  E..`....@.Zy....
[...output cut...]
	0x0150:  2d31 3003 040a 0000 0106 04c0 a801 02ff  -10.............
And back on our client VM…we now have an IP address.
Moral? Be prepared to give Neutron a Big Kick sometimes just to keep it in line.
Happy Computing!
I thought “sum error” meant the message will be thrown and re-requested.
A number of stupid questions:
What is the cause of the error?
How does the re-start fix it?
Why not give a simple explanation of the used parameters? Surely not hard and, as a seperate section, can be marked as skippable by tyros.
Thanks for an interesting article.
Please help, On first command :
nova –os-tenant-name=infrastructure show testnew
Several instances do not have an ip. The ones that do are working fine.
The ones that don’t have IP are on different projects and are on different compute nodes, they all worked at some point and just stopped. I suspect something on those specific virtual networks stopped working.
Good Luck, Bobby. OpenStack is frustrating, isn’t it? Things stop working, you start mucking through logs, turning up debug, it’s a terrible way to burn up precious hours of existence. I feel for you!
Hi ANdrew,
So I got a little further, I found that when I ran neutron port-list, the instance was not bound to a compute node, even though horizon showed which compete node, and I checked and confirmed the instance is built on specified compute node. I noticed that the Virtual Tap (interface) for instance was not built either.
I then ran neutron port-update and linked instance to compute node.
Everything shows correctly in neutron now and compute node has proper TAP (virtual interface), I reset everything, but I am still not getting IP. Can you please guide me where I should look next.
Andrew, do you know if this is fixed in Icehouse or another future release of Openstack? I saw some chatter about it being fixed with a ProcessMonitor class, but I couldn’t find if it was in a future release. https://review.openstack.org/#/c/115935/
I think it is fixed in Kilo – I’ve implemented Kilo with HA but I have more to do before I can verify. (Need to implement multiple identity domains to permit Heat to work with WaitHandles – using LDAP as backend fails.)
Hi,
Thanks for your post, very useful.
Right now I am fighting with the neutron dhcp-agent. Instances cannot get the IPs. Host file has the MAC and the ip, dncmasq.log shows the request but
ip netns exec qdhcp- tcpdump -i -nnvvvNXs 512
doesn’t show anything when I start a new VM.
Another thing that is wrong is my setup is related to metadata
[root@smog: ~] # ip netns exec qrouter-6687f9b9-461e-499e-bf70-7718845768f4 iptables -L | grep 9697
DROP tcp — anywhere anywhere tcp dpt:9697
So this should not be “DROP” it should be “ACCEPT” right. I am using nova-metadata.
Can you please give me a hint from where is DROP role is coming?
Regards..
Salman.
Very useful, thanks. Was debugging another issue with neutron/dnsmasq, these steps proved helpful ๐