Diagnosing and resolving extremely high RF utilisation

What’s wrong with these pictures?

Per AP TX/RX figures with channel utilisation across the ground floor of GCCEC

Per AP TX/RX figures with channel utilisation across the ground floor of GCCEC

Number of Client Associations across the ground floor of GCCEC

Number of Client Associations across the ground floor of GCCEC

I was given logon access to the WCS console at GCCEC at the start of May this year. Shortly (10 minutes) later I started e-mailing “DANGER, WILL ROBINSON!” messages to the venue and the tech•ed technology team.

What’s wrong?

The first image shows the current receive and transmit utilisation (Rx. Util. and Tx. Util. respectively) for a given access point (we’re still trying to get to the bottom of understanding how they’re calculated given there are multiple radios in each access point). The Controller also has the access points doing a passive scan in the background to determine the actual RF spectrum utilisation on the channel to which the access point is assigned (that’s the third “Channel Util.”  figure off each access point).

The second image shows the number of client associations across the same area.

See the problem? The venue is between events. No one is there. But there is a massive RF utilisation across the building. Some of the access points are yakking their heads off to no one and there is not much bandwidth left over for any users.

When we raised the issue, Cisco TAC and the installers of the network were of the opinion that the issue was caused by external interference. I rejected this explanation immediately because:

  1. GCCEC is on the coast of Queensland and so constructed to withstand cyclones and severe tropical storms. This means robust reinforced construction materials throughout.
  2. The venue is bounded by water on three sides and the Gold Coast highway (building foyer + drive way + park out front + 4 lane road + block of shops before you even get to the nearest residential building)
Gold Coast Convention and Exhibition Centre (from Bing Maps)

Gold Coast Convention and Exhibition Centre (from Bing Maps)

Anyone who has diagnosed complex problems with multiple suppliers in the mix knows that getting traction from people in resolving problems is sometimes hard, especially when people invariably have a foregone conclusion as to a root cause in their minds. Our logical fault finding steps needed to be clear and bulletproof to gain traction and ownership from all involved.

Step 1 – Isolate the cause of interference as internal or external to the building

This part was pretty easy. There is a company called Metageek that sells a great little device called a Wi-Spy (it presents itself to Windows as a HID device so there is no mucking around with special drivers and other nonsense) and a companion piece of software called Chanalyzer. Chanalyzer and the Wi-Spy together allow you to see peak and average utilisation of the entire 2.4 ghz spectrum (there’s a version you can buy now that does 5 ghz). You can simultaneously use your lappy’s onboard WLAN NIC to grab a list of SSIDs with the corresponding channels and signal strength information and then overlay that over the actual RF activity on the network.

GCCEC RF Utilisation Level 1, Meeting Room 5-9 Foyer Area

Central Room A Spectrum Utilisation - not you can clearly see the signature of the 802.11n network on channel 1 (the squarish pattern with a dip in the middle)

Central Room A Spectrum Utilisation - not you can clearly see the signature of the 802.11n network on channel 1 (the squarish pattern with a dip in the middle)

There are two useful views here. The Spectral View (aka waterfall) shows a time series utilisation graph of the RF spectrum. You can adjust the sampling period and play back different periods. The other useful view is the topographic view that shows the signature of the RF utilisation pattern overlaid with the SSIDs found using the WLAN NIC in your laptop.

As I mentioned above, it was amply clear that the interference was inside the building because the WCS console was saying as much, but we really needed a smoking gun. This was too easy to produce (by sheer brute force):

  1. Run up wi-spy, and grab a sample of what the spectrum is doing;
  2. Shut down the entire wireless system in the venue;
  3. Repeat step one and compare.

Step 1 showed a lot of RF utilisation. Step 3 showed nearly none. Case closed: The interference was the wireless system in the building. Now we just had to work out why!

Step 2 – What is the RF interference exactly?

After two months and a number of Cisco TAC cases the utilisation figures at the venue were still unacceptably high. We had not received a decent explanation from the parties involved as to the true root cause (well, not one that would satisfy me anyway) so I chose to employ brute force again. 🙂

Brute force this time came in the form of an embedded wireless platform that allowed us raw and unfettered access to the underlying WLAN NIC to do some packet captures of the RF-side of the wireless interface. We needed to use this specialised platform to capture packets due to limitations within the Windows kernel in which 802.11 traffic is presented to applications as 802.3 (Ethernet) traffic as it moves up the driver stack. Therefore, under Windows, it is not possible to capture raw management frames unless you use devices that use a proprietary raw miniport driver that bypasses most of Window’s normal networking. These drivers are never certified.

The embedded device we used is normally stuck on mining vehicles with neodymium magnets (David Eagles from iVolve brought it in a nice green Coles friends-of-the-Earth recycled shopping bag and told everyone not to put your laptop near it unless you wanted a blank hard drive).

Running packet capture of the RF in North-West expo hall.

Running packet capture of the RF in North-West hall.

Running RF capture in North West of building - the four circles on the brackets on the device will erase your lappy.

Running RF capture in North-West hall - the four circles on the brackets on the device will erase your lappy.

We were pretty much the only users of the WLAN in the North-West of the centre. We ran a packet capture on the RF to see what on Earth was going on and fed the raw file to Wireshark. The results were very revealing:

  • We ran the packet capture for 185 seconds
  • 39193 frames were captured (remember no one is using the network at this point!)
  • 38,088 frames were 802.11 beacons … !
  • Only 1105 frames were not 802.11 beacons … !!

Further from this you can work out:

  • There were approximately 220 beacons per second with a size of 258 bytes each.

At this point we knew we were onto something … but why so much traffic?

Step 3 – Analyse the logs

Beacon frames are sent as a normal part of 802.11 management traffic. Normally an access point will send (about) 10 frames per second to advertise their SSID and various information about the capabilities offered. That would account for but a small fraction of the traffic above. We were only pulling traffic from channel 6 in this case so there could not possibly be sufficient access points to generate that much traffic.

Remember we’re looking at 220 beacons per second. A single access point should only generate 10.

Wireshark Trace from Central Room A showing the beacon frame spam

Wireshark Trace from Central Room A showing the beacon frame spam

The Wireshark traces showed that GCCEC has 5 SSIDs being advertised for use (their public one, internal, one for Telstra and some other stuff). Each of these were being advertised in its own beacon packet. This is helpful as it shows us now to expect 5x the number of beacons per access point and importantly we’re now in the realm of feasibly accounting for the quantity of beacon packets being seen in our packet captures (i.e. our packet capture device would easily see 4-5 access points on channel 6).

This answers part of the problem as to why there was so many frames from an ‘unused’ wireless network. Now we just needed to answer the original question we came on site for – why so much RF utilisation?

Step 4 – Punch some numbers into a calculator

To understand the nature of the problem we need to understand a bit about data rates and 802.11 networking. There are a number of bit rates defined for 802.11 networking and clients will choose a bit rate based on signal strength, configuration of the base station, and other things.

The important thing to note here is that all management traffic is sent at the lowest bit rate supported by the base station. In this case that would be … 1 mbps.

A 1 mbps bit rate gives you typical data throughput speeds of 500 kbps.

Let’s go back to those figures again:

  • 220 beacon frames per second;
  • 258 bytes each;
  • multiplies out to 454,080 bits per second;
  • Typical throughput for 802.11b at 1 mbps is about half-a-megabit … which would be about 500,000 bits per second.

BINGO!

We now can account for 80-90% RF utilisation figures based on beacon frames alone. All of these marry up more or less and so now we understand the problem.

Recommendation for tech•ed

There are a few very logical outcomes from this exercise that provide ‘easy wins’.

  1. We will turn off all advertised BSSIDs except for MicrosoftEvent;
  2. We will get GCCEC to make their corporate network’s WLAN access require a probe request so it is not causing another SSID to be advertised;
  3. We will disable 802.11b at the event (sorry to all of you with an iMate Jamin, but it might be time for an upgrade! :));
  4. We will up the basic rate to 18 mbps. This alone will ensure that management traffic will take up 1/18th of the RF spectrum that it was before.

Hooray! Beers all around at Q1 … not quite

In a case of “solve one problem, find another” we unfortunately did uncover a fair few more issues while conducting the work above over a full day on-site. The main outstanding issue that we have now is that we noticed that some of the radio interfaces in particular access points perform very poorly (3 mbps typical throughput even on 802.11n). This is now our critical concern with the wireless at GCCEC and we will be scheduling another visit onsite as soon as possible to get to the bottom of these problematic radio interfaces to work out what the issue is. At this stage it looks like a real puzzler because we did some quick tests in the waning hours of our time on site and we received excellent performance from other radios of the same access point, and excellent performance from the edge switchports that service the access points.

We like a good mystery and hopefully solving this next one won’t bring up more problems so we can have those beers at Q1. 🙂

8 Comments
  1. There was an event in house at the time of the very top screen shot. The day after they left things went back to normal.

  2. @Nathan: Negatory – those top two shots were taken by yours truly in May. They are not the ones we collected last week.

  3. So what was pumping out “220 beacon frames per second”? The SSID broadcasting @ 10 frames per second = 22 SSIDs?

    So what if you turned off the SSID broadcasting all together? This seems like a pretty big design floor, glad your sorting them out David.

    I was at a US event last year where the DHCP server just stopped handing out IP addresses so only the first few hundred people got access. Will this network handle 3000+ devices?

  4. @John: 10 frames per second * 5 SSIDs on one base station = 50 beacons from one base station. Given the density of access points in a given area, other access points on the same channel were doing the same so you then multiple the 50 * the number on a given channel in a given area – some for collisions etc. The 220 was an approximate figure for one sample. We got a lot of different figures from different areas of the venue on the given day… but you can see where the 220 came from. It was made worse by having such a low minimum data rate on the network.

    Each ap MUST broadcast a beacon as the beacon does more than advertise an SSID. The Service Set IDentifier tells clients stuff like: a) the ap exists :), b) supported data rates, and c) lots of other capability information. Only one guest SSID can be advertised per beacon so the more public SSIDs, then the more beacons that must be sent.

    The model for the event (as per the recommendations above) is that there will be one public SSID only and that will be advertised in the beacon from each ap. That is the minimum we can get away with while still having wifi work.

    The issue above you mention sounds like a classic case of small DHCP scope or lease times too long. We had many thousands of active leases at TechEd last year and it is certainly something that is on my mind from a capacity standpoint.

    In regards to supporting 3000+ clients at once – that is going to be problematic as the design goal for the GCCEC WLAN was only 1000 concurrent users. I’ll cover that off in another post in the coming week or so.

  5. Excellent writeup, and great analysis.

    Particularly agree with this point:

    >getting traction from people in resolving problems is
    >sometimes hard, especially when people invariably have a
    >foregone conclusion as to a root cause in their minds

    cheers
    lb