...making Linux just a little more fun!

<-- prev | next -->

Debugging WiFi

By Jeff Root

It Works For Me

Don't you hate it when you have a problem, only to be told, "On my machine, it just works"? I know I do. So when Ben Okopnik wrote in Linux Gazette #127:

 [ I do a lot of travelling, and connect to a wide variety of
   strange WLANs. In my experience, at least, connecting to a
   wireless LAN with Linux is usually just as simple as Edgar
   describes. -- Ben ]

This motivated me to solve a long-standing problem: I could not connect to "open" wifi access points. I had no trouble at all with closed access points; someone gives me the ESSID and the WEP key, and I'm on. But plop me down at a hotel, airport, or coffee shop, and no amount of fiddling would get me connected.

So, being determined, I wrote to Ben and offered to write this article, if The Answer Gang would help me troubleshoot my problem. Ben and the Gang held up their end of the bargain, so here is my tale.

The Approach

First, a word about my specific configuration, so you'll have a shot at changing things to suit your particular setup. I am running on a Dell C600 laptop, with an unsupported wifi card built in. I therefore use a PCMCIA card, a Microsoft MN-520. The machine boots into Debian Sarge, with all current updates.

Since I could easily connect to any closed access point (AP), I knew that the hardware, driver, and networking layers were all working. My problem was almost certainly a configuration issue.

The main config file for Debian is the /etc/network/interfaces file. My work configuration included the ESSID and WEP key, which should not be needed for an open AP. So I stripped the config file down to a minimum:

    iface wlan0 inet dhcp
	wireless-mode managed

Now, a trip to the coffee shop, and see what we get in /var/log/messages:

localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 6
localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 12
localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 11
localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 21
localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 7
localhost dhclient: DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 4
localhost dhclient: No DHCPOFFERS received.
localhost dhclient: No working leases in persistent database - sleeping.

Not much of use to me here. A bit of googling shows that my machine is broadcasting a request for an address, but is not getting any answers.

In Ben's first email to me, he included a sample of successful IP address negotiation:

Fenrir pumpd[4711]: starting at (uptime 0 days, 2:43:09)
Fenrir pumpd[4711]: PUMP: sending discover
Fenrir pumpd[4711]: got dhcp offer
Fenrir pumpd[4711]: PUMP: sending second discover
Fenrir pumpd[4711]: PUMP: got an offer
Fenrir pumpd[4711]: PUMP: got lease
Fenrir pumpd[4711]: intf: device: wlan0
Fenrir pumpd[4711]: intf: set: 416
Fenrir pumpd[4711]: intf: bootServer: 10.0.0.1
Fenrir pumpd[4711]: intf: reqLease: 43200
Fenrir pumpd[4711]: intf: ip: 10.0.0.219
Fenrir pumpd[4711]: intf: next server: 10.0.0.1
Fenrir pumpd[4711]: intf: netmask: 255.255.255.0
Fenrir pumpd[4711]: intf: gateways[0]: 10.0.0.1
Fenrir pumpd[4711]: intf: numGateways: 1
Fenrir pumpd[4711]: intf: dnsServers[0]: 10.0.0.1
Fenrir pumpd[4711]: intf: numDns: 1
Fenrir pumpd[4711]: intf: broadcast: 10.0.0.255
Fenrir pumpd[4711]: intf: network: 10.0.0.0
Fenrir pumpd[4711]: configured interface wlan0

Ben uses "pump" rather than "dhcpd", but you can clearly see his machine sending a "discover" request, which is answered by the AP with a "dhcp offer". This is simply a suggested address, which your machine is free to reject. (I don't know why you would, though.)

[ It might taste a little strange, or smell sorta funky. Or perhaps you might already have an interface mapped to that IP. One never knows. :) -- Ben ]

When In Doubt, Use Brute Force

Ken Thompson's aphorism is handy here. I couldn't get a connection via the automated tools, so I switched to a manual configuration. I knew I needed the ESSID for the AP network, so I started looking for tools that could discover this without a connection.

First, I tried the iwlist command. This tool uses the radio in the WiFi card, and listens for what are known as "beacon" frames. These are simply broadcast messages that advertise the presence of an AP. One problem with iwlist is that it's a one-shot tool; you get a list of whatever happens to be broadcasting at the moment it listens. While you can simply repeat the iwlist command, I felt more comfortable with a tool that listens continuously.

That tool was Kismet. I used APT to install it, but still had to edit the config file (/etc/kismet/kismet.conf) to tell it to use my HostAP-driven card. Once configured, Kismet showed me a continuously updated list of APs in my vicinity, and told me which ones were open and which were masked (usually done simply by not broadcasting the network ID, known as the SSID or ESSID). Sure enough, one of the open APs used the name of the coffee shop as its ESSID.

So, this means my machine can see the shop's AP. Even the signal strength looked good. Why, then, couldn't I get any response to the DHCPDISCOVER?

This meditation brings enlightenment: wifi is a radio system, and just because I can hear the AP just fine does not mean that the AP can hear me. After all, an AP usually has a strong transmitter, but there's only so much power a PCMCIA transmitter (like the one in my wifi card) can send.

Running Blind

Since I can't change the power transmitted by my card, I moved closer to the AP. Now, I get:

localhost pumpd[1275]: PUMP: sending discover 
localhost pumpd[1275]: got dhcp offer 
localhost pumpd[1275]: PUMP: sending second discover
localhost cardmgr[1068]: + Operation failed.
localhost cardmgr[1068]: + Failed to bring up wlan0.

(As you can see, I switched from dhcpd3 to pump, based on the Gang's suggestions.) This looks promising, since I now get an offer, but now there's a "second discover" that fails. What's that all about?

My guess is that the first discover is used to gather the AP information (ESSID, etc.), which is then used to do the actual IP address negotiation.

At this point, I was somewhat frustrated by the lack of information contained in the logs. But one reason I run Linux is because the system keeps no secrets; you can always get more information, even if you have to recompile the kernel. It was time to stop running blind.

TCPDUMP To The Rescue

At this point, The Answer Gang was as starved for information as I was. They wanted to see tcpdump traces. Tcpdump is a very low-level network tool. It monitors the underlying physical medium (ethernet, radio, carrier pigeon, whatever is carrying the actual bits).

Tcpdump is normally run against an existing network interface, but in my case I not only didn't have one yet, but also couldn't get one. This was the whole object of my quest. So the Gang shared the incantation necessary: tcpdump -vvv -i any. (You can use more or fewer 'v' flags, for more or less detail.)

Category: Free software
The most obvious benefit is the existence of such groups as The Answer Gang. While there's no guarantee that anyone will help you with your specific problem, it is much more likely to happen than with any proprietary product, even if you pay for a support contract.

So now my sequence was: bring up the machine, start capturing the log files, start tcpdump and save the output to a file, then insert the card.

Having captured the failed traffic, I passed it along to the Gang. (Ben wanted me to be sure and mention that copy and paste is the preferred method for this; too often they get hand-transcribed data that was corrupted by a fat finger.) Since they did such a good job of walking through the tcpdump data, I'm just going to copy and paste it right here.

The Gang Speaks

The TCPDUMP output is indented by a '>'; Ben Okopnik's explanation follows.

> 15:08:58.385577 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:02:2d:74:4a:fa, length: 300

OK, there goes your "all stations" broadcast for DHCP.

> 15:08:58.395954 arp who-has 169.254.1.51 tell 169.254.1.1
> 15:08:58.397085 arp who-has 169.254.1.51 tell 169.254.1.1
> 15:08:58.398200 arp who-has 169.254.1.51 tell 169.254.1.1

Wait a minute - I've just smelled a rat. 169.254.X.X? That's an IP that's auto-assigned to an interface that requested a DHCP assignment but failed; note that it's not assigned by the server but by your kernel (google for 'APIPA' for more info; this is ridiculously common Windows behavior, and I should have spotted it earlier.) The ARP request is the standard "Before I assign this IP to this interface, is anyone else using it?" check.

In other words, it looks like your system never acknowledges any communication with the DHCP server other than the ESSID assignment. Even that's probably being assigned by your own system, based on nothing more than beacon detection.

Or... OK, this is silly - but could it be that the coffee shop server's IP is actually set to be 169.254.1.1? In that case, your machine could just be rejecting that assignment, and properly so: that /16 is a reserved address range. Not 'reserved' as in '10.x.x.x' or '192.168.x.x', but as in 'do not use for anything but failed DHCP requests.'

It's a weird but possible explanation. Let's see if it holds up:

> 15:09:02.080455 IP 169.254.1.1.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length: 293

Well, something with an address of 169.254.1.1 actually does reply to you. Tres weird...

> 15:09:02.083702 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:02:2d:74:4a:fa, length: 548
> 15:09:06.083860 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:02:2d:74:4a:fa, length: 548
> 15:09:13.082798 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:02:2d:74:4a:fa, length: 548
> 15:09:19.895539 IP 169.254.1.31.53127 > 255.255.255.255.2222: UDP, length: 106

...and something with another IP in the reserved range (169.254.1.31) answers. Yeah, it's beginning to look like a seriously misconfigured network.

> 15:09:21.896457 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:02:2d:74:4a:fa, length: 548

These are all replies from your machine - possibly saying "nope, can't use that IP. How about something else?" - and that request never gets honored. I'm not certain of this, but it seems likely.

I want to thank The Answer Gang for this analysis; I'd have spent weeks looking up all these packets in the RFCs and other specs.

Zeroconf

At this point, it appears that the AP I'm trying to use is misconfigured. It is using an IP address range that is not supposed to be used by anyone. But stop and think; if this configuration is invalid, how come other people can use the AP without error. (I even had a friend try it with his SUSE laptop and: "On that machine, it just works".)

We need to dig some more. At this point, I somehow dropped off the email list for this issue, so I'll pick up the Gang's deliberations in progress. Kapil Hari Paranjape thought my tcpdump was capturing a "zeroconf" session. Here's his explanation:

 
   Under "zeroconf" (RFC 3927) two machines can network if they use
   addresses in the range 169.254.0.0/16. So what may be happening here:

   1. The access point tries to communicate with you, using zeroconf.
   2. Provided you establish a zeroconf link, it then requires
   authentication information.
   3. Upon authentication, it provides a "real" DHCP address.

   So, you should allow DHCP to fail and then establish a zeroconf
   address picked "at random". (The zeroconf package in Debian does
   this.)

The bit about "authentication" seems to apply to systems that are not truly open; airports, for example, often require that you pay for connectivity via a Web page. So you'd get a temp IP via zeroconf, then go anywhere with your browser. This request is redirected to the "pay first" page. Once you've paid, you get a "real" IP address and are on the Net.

At this point, I started looking at Debian support for Zeroconf. It turns out that full zeroconf is only in "unstable", and I prefer not to use that. Debian stable includes a "zcip" tool which is supposed to handle exactly this problem, so I installed that.

But to no avail. I was able to get a bit further, and had a "real" IP address, but I was not able to get to the outside world. So close...

Worldcon

At this point, the opinion of the Gang was that this was a seriously broken coffee shop, and that I should take my trade elsewhere. Also at this point, I took a vacation to the World Science Fiction Convention in Anaheim, California. My first stop was the Salt Lake City airport, which has an open wifi network. You are required to pay for it, but I thought I could at least test the connectivity portion.

And guess what? "On my machine, it just worked right out of the box." I hate that answer.

I tried this experiment again with the hotel wifi network, and it just worked again.

At this point, I realized that my goal was achieved: I could connect to open wifi AP's. True, I could not connect to that lone coffee shop, but that appeared to be their problem. Or, perhaps just incomplete zeroconf support in Debian.

I was also able to enlist the help of a local network engineer, who took his Linux laptop down to the coffee shop. When he, too, failed to connect, I felt confident that this was just a problem with the AP. While it might be educational to keep after that AP until I figure out what's going on, that was not my original goal.

The Proof of the Pudding ...

... is in the tasting. My friend the network guy recommended another coffee shop that he used regularly. And guess what? It just worked.

So here are the log file captures showing my final result:

From /var/log/daemon.log:

localhost cardmgr[1239]: + Internet Systems Consortium DHCP Client V3.0.1
localhost cardmgr[1239]: + Copyright 2004 Internet Systems Consortium.
localhost cardmgr[1239]: + All rights reserved.
localhost cardmgr[1239]: + For info, please visit https://www.isc.org/products/DHCP
localhost cardmgr[1239]: + 
localhost cardmgr[1239]: + wifi0: unknown hardware address type 801
localhost cardmgr[1239]: + wifi0: unknown hardware address type 801
localhost cardmgr[1239]: + Listening on LPF/wlan0/00:50:f2:c3:aa:07
localhost cardmgr[1239]: + Sending on   LPF/wlan0/00:50:f2:c3:aa:07
localhost cardmgr[1239]: + Sending on   Socket/fallback
localhost cardmgr[1239]: + DHCPDISCOVER on wlan0 to 255.255.255.255 port 67 interval 3
localhost cardmgr[1239]: + DHCPOFFER from 192.168.0.1
localhost cardmgr[1239]: + DHCPREQUEST on wlan0 to 255.255.255.255 port 67
localhost cardmgr[1239]: + DHCPACK from 192.168.0.1
localhost cardmgr[1239]: + bound to 192.168.0.4 -- renewal in 982546229 seconds.

And the Tcpdump log:

tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
18:43:38.008431 IP (tos 0x0, ttl  32, id 33709, offset 0, flags [DF], length: 576) 192.168.0.1.bootps > 192.168.0.4.bootpc: BOOTP/DHCP, Reply, length: 548, xid:0xfbfc2356, flags: [none]
	  Your IP: 192.168.0.4
	  Client Ethernet Address: 00:50:f2:c3:aa:07
	  sname "M-^?" [|bootp]
18:43:38.017371 IP (tos 0x0, ttl  32, id 32941, offset 0, flags [DF], length: 576) 192.168.0.1.bootps > 192.168.0.4.bootpc: BOOTP/DHCP, Reply, length: 548, xid:0xfbfc2356, flags: [none]
	  Your IP: 192.168.0.4
	  Client Ethernet Address: 00:50:f2:c3:aa:07
	  sname "M-^?" [|bootp]

2 packets captured
2 packets received by filter
0 packets dropped by kernel

Final Thoughts

Initially, I had been unable to connect to any of several APs that I had tried. Then, the one I picked for a concentrated debug session turned out to be misconfigured. But I finally got it all working, and the next time an open AP fails me, I'll know it's not my fault. More important, I'll know how to see what's going on, and have a good chance of fixing it.

But this does illustrate some things that are unique to the Linux community and Free Software in general. The most obvious benefit is the existence of such groups as The Answer Gang. While there's no guarantee that anyone will help you with your specific problem, it is much more likely to happen than with any proprietary product, even if you pay for a support contract.

The next important difference is how transparent Linux is. I could easily trace the kernel activity from the moment I inserted the PCMCIA card until I had my real IP address. I also have confidence that this level of transparency is always available, since I have the source code to every bit of software involved. Access to source might not help a non-programmer, but simply adding some "kprintf" statements to a driver is really not that hard a task. That freedom allows me to add transparency where none may have existed before.

And last, I think you have to recognize how powerful choice is when debugging. At each step, I had several choices of tools to use. When iwlist wasn't working for me, I could easily swap in Kismet.

I can't thank The Answer Gang enough for their help. Robert Heinlein told people to "pay it forward". So, I hope the next person with this sort of trouble will find my article and get the help they need.

Talkback: Discuss this article with The Answer Gang


[BIO]

Jeff Root's first computer was an IBM System 3, programmed in FORTRAN IV using 64-column punch cards. His first love was his Apple ][e, which still sees occasional use today. Jeff never learned BASIC, preferring the "no excuses" nature of assembly language instead.

Jeff started using Linux with Yggdrasil, which was a "run from CD" distribution back in the days of kernel 1.1.13. This was necessary, since Jeff's computer had no hard drive. He's now a dedicated Debian GNU/Linux user, but keeps a Damn Small Linux pendrive handy.

He currently works as a programmer for a semiconductor maker in Pocatello, Idaho. In a previous life, he tested bombers for the USAF.


Copyright © 2007, Jeff Root. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 135 of Linux Gazette, February 2007

<-- prev | next -->
Tux