What’s wrong with these pictures?
I was given logon access to the WCS console at GCCEC at the start of May this year. Shortly (10 minutes) later I started e-mailing “DANGER, WILL ROBINSON!” messages to the venue and the tech•ed technology team.
The first image shows the current receive and transmit utilisation (Rx. Util. and Tx. Util. respectively) for a given access point (we’re still trying to get to the bottom of understanding how they’re calculated given there are multiple radios in each access point). The Controller also has the access points doing a passive scan in the background to determine the actual RF spectrum utilisation on the channel to which the access point is assigned (that’s the third “Channel Util.” figure off each access point).
The second image shows the number of client associations across the same area.
See the problem? The venue is between events. No one is there. But there is a massive RF utilisation across the building. Some of the access points are yakking their heads off to no one and there is not much bandwidth left over for any users.
When we raised the issue, Cisco TAC and the installers of the network were of the opinion that the issue was caused by external interference. I rejected this explanation immediately because:
- GCCEC is on the coast of Queensland and so constructed to withstand cyclones and severe tropical storms. This means robust reinforced construction materials throughout.
- The venue is bounded by water on three sides and the Gold Coast highway (building foyer + drive way + park out front + 4 lane road + block of shops before you even get to the nearest residential building)
Anyone who has diagnosed complex problems with multiple suppliers in the mix knows that getting traction from people in resolving problems is sometimes hard, especially when people invariably have a foregone conclusion as to a root cause in their minds. Our logical fault finding steps needed to be clear and bulletproof to gain traction and ownership from all involved.
Step 1 – Isolate the cause of interference as internal or external to the building
This part was pretty easy. There is a company called Metageek that sells a great little device called a Wi-Spy (it presents itself to Windows as a HID device so there is no mucking around with special drivers and other nonsense) and a companion piece of software called Chanalyzer. Chanalyzer and the Wi-Spy together allow you to see peak and average utilisation of the entire 2.4 ghz spectrum (there’s a version you can buy now that does 5 ghz). You can simultaneously use your lappy’s onboard WLAN NIC to grab a list of SSIDs with the corresponding channels and signal strength information and then overlay that over the actual RF activity on the network.
There are two useful views here. The Spectral View (aka waterfall) shows a time series utilisation graph of the RF spectrum. You can adjust the sampling period and play back different periods. The other useful view is the topographic view that shows the signature of the RF utilisation pattern overlaid with the SSIDs found using the WLAN NIC in your laptop.
As I mentioned above, it was amply clear that the interference was inside the building because the WCS console was saying as much, but we really needed a smoking gun. This was too easy to produce (by sheer brute force):
- Run up wi-spy, and grab a sample of what the spectrum is doing;
- Shut down the entire wireless system in the venue;
- Repeat step one and compare.
Step 1 showed a lot of RF utilisation. Step 3 showed nearly none. Case closed: The interference was the wireless system in the building. Now we just had to work out why!
Step 2 – What is the RF interference exactly?
After two months and a number of Cisco TAC cases the utilisation figures at the venue were still unacceptably high. We had not received a decent explanation from the parties involved as to the true root cause (well, not one that would satisfy me anyway) so I chose to employ brute force again. 🙂
Brute force this time came in the form of an embedded wireless platform that allowed us raw and unfettered access to the underlying WLAN NIC to do some packet captures of the RF-side of the wireless interface. We needed to use this specialised platform to capture packets due to limitations within the Windows kernel in which 802.11 traffic is presented to applications as 802.3 (Ethernet) traffic as it moves up the driver stack. Therefore, under Windows, it is not possible to capture raw management frames unless you use devices that use a proprietary raw miniport driver that bypasses most of Window’s normal networking. These drivers are never certified.
The embedded device we used is normally stuck on mining vehicles with neodymium magnets (David Eagles from iVolve brought it in a nice green Coles friends-of-the-Earth recycled shopping bag and told everyone not to put your laptop near it unless you wanted a blank hard drive).
We were pretty much the only users of the WLAN in the North-West of the centre. We ran a packet capture on the RF to see what on Earth was going on and fed the raw file to Wireshark. The results were very revealing:
- We ran the packet capture for 185 seconds
- 39193 frames were captured (remember no one is using the network at this point!)
- 38,088 frames were 802.11 beacons … !
- Only 1105 frames were not 802.11 beacons … !!
Further from this you can work out:
- There were approximately 220 beacons per second with a size of 258 bytes each.
At this point we knew we were onto something … but why so much traffic?
Step 3 – Analyse the logs
Beacon frames are sent as a normal part of 802.11 management traffic. Normally an access point will send (about) 10 frames per second to advertise their SSID and various information about the capabilities offered. That would account for but a small fraction of the traffic above. We were only pulling traffic from channel 6 in this case so there could not possibly be sufficient access points to generate that much traffic.
Remember we’re looking at 220 beacons per second. A single access point should only generate 10.
The Wireshark traces showed that GCCEC has 5 SSIDs being advertised for use (their public one, internal, one for Telstra and some other stuff). Each of these were being advertised in its own beacon packet. This is helpful as it shows us now to expect 5x the number of beacons per access point and importantly we’re now in the realm of feasibly accounting for the quantity of beacon packets being seen in our packet captures (i.e. our packet capture device would easily see 4-5 access points on channel 6).
This answers part of the problem as to why there was so many frames from an ‘unused’ wireless network. Now we just needed to answer the original question we came on site for – why so much RF utilisation?
Step 4 – Punch some numbers into a calculator
To understand the nature of the problem we need to understand a bit about data rates and 802.11 networking. There are a number of bit rates defined for 802.11 networking and clients will choose a bit rate based on signal strength, configuration of the base station, and other things.
The important thing to note here is that all management traffic is sent at the lowest bit rate supported by the base station. In this case that would be … 1 mbps.
A 1 mbps bit rate gives you typical data throughput speeds of 500 kbps.
Let’s go back to those figures again:
- 220 beacon frames per second;
- 258 bytes each;
- multiplies out to 454,080 bits per second;
- Typical throughput for 802.11b at 1 mbps is about half-a-megabit … which would be about 500,000 bits per second.
We now can account for 80-90% RF utilisation figures based on beacon frames alone. All of these marry up more or less and so now we understand the problem.
Recommendation for tech•ed
There are a few very logical outcomes from this exercise that provide ‘easy wins’.
- We will turn off all advertised BSSIDs except for MicrosoftEvent;
- We will get GCCEC to make their corporate network’s WLAN access require a probe request so it is not causing another SSID to be advertised;
- We will disable 802.11b at the event (sorry to all of you with an iMate Jamin, but it might be time for an upgrade! :));
- We will up the basic rate to 18 mbps. This alone will ensure that management traffic will take up 1/18th of the RF spectrum that it was before.
Hooray! Beers all around at Q1 … not quite
In a case of “solve one problem, find another” we unfortunately did uncover a fair few more issues while conducting the work above over a full day on-site. The main outstanding issue that we have now is that we noticed that some of the radio interfaces in particular access points perform very poorly (3 mbps typical throughput even on 802.11n). This is now our critical concern with the wireless at GCCEC and we will be scheduling another visit onsite as soon as possible to get to the bottom of these problematic radio interfaces to work out what the issue is. At this stage it looks like a real puzzler because we did some quick tests in the waning hours of our time on site and we received excellent performance from other radios of the same access point, and excellent performance from the edge switchports that service the access points.
We like a good mystery and hopefully solving this next one won’t bring up more problems so we can have those beers at Q1. 🙂