Previously you will recall we talked about the methodology we used to diagnose why the RF utilisation at GCCEC was so stratospherically high in relation to the actual wifi network utilisation and number of associated clients. In the last moments of that day on site we did observe a few anomalies with regards to real-world network performance. Given that the wireless infrastructure is state of the art and was one of the first “enterprise” deployments of 802.11n in Australia 12 months ago – this was odd and definitely warranted further investigation before the event even if to find there wasn’t a problem at all.
We left Brisbane bright and early on Thursday the 16th of July to spend a day with the guys from GCCEC to get to the bottom of this latest issue.
Step 1: Knowing what the first step is!
Random reboots, firmware updates, and twiddling of settings does not constitute logical fault finding. If you think there is a problem with a system there is only one place to start and that is at producing a clinical and unambiguous statement of the problem at hand. We almost always work in tricky multi-supplier scenarios at these events and so it is really important to reduce the problem to a clinical document we can all look at so you’re not fingering individuals or companies – but rather working together with all parties to nail the main pain.
With this in mind we set out allocating the first half of our day to an entirely tedious but necessary exercise of conducting a complete access point by access point, radio interface by radio interface survey of the venue (by hand :(). The goal of the survey was not to measure the RF characteristics of the wireless network (been there, done that) but rather the real-world throughput characteristics.
Step 2: Sanity check the fibre/copper network and backhaul before looking the wifi performance
GCCEC’s copper and fibre network is a relatively modern star topology of a single gigE fibre core servicing a number of 10/100 copper edge switches with fibre backhaul. This is good as exhibition venues that have grown over time tend to have a lot of zany and miscellaneous long fibre runs with random Vendor X fibre transceivers all over the place making you unsure of what exactly it is that you’re measuring. The GCCEC network on the other hand is simple to test definitively due to a sane overall architecture.
Additionally, the MDF that houses the core features in-room Foxtel. I could have done with that in the basement of the Sheraton Mirage at Port Douglas for last year’s Australian Partner Conference.
Testing Rig @ MDF/network core
We deployed a high performance x64 machine with 4GB of RAM and IIS7 directly connected to the core switch via 1000BaseTX copper. We deployed several files to this IIS7 server being 1 megabyte, 10 megabytes, 50 megabytes, and 100 megabytes in size.
Consistency in client testing
We used the same clunky Lenovo (waves at the other David :)) for all client tests to ensure that we were always using a consistent platform on both ends of the equation. This slowed us down somewhat because it meant we could not split up to make things faster, but I felt strongly that being able to say that all results have as high a degree of consistency as possible was very important. We configured the Lenovo for maximum power wastage performance on all interfaces to ensure that results would not be inconsistent if we were to conduct some tests off battery versus off mains. Wherever possible we used mains power for tests.
We used a simple script with wget downloading the test files from the testing rig in the MDF three times. We recorded these results together with some other data at each test point (more on that later).
Validating the test rig and core
A Lenovo notebook was directly connected to a port on the core switch via 1000BaseTX copper. We did several transfer tests between the test rig and the Lenovo receiving approximately 800mbps of real world throughput which is pretty good with a notebook as a client any day of the week – and certainly sufficient for our tests.
Validating the edge switch at the first test point
The last step before heading off on the Tedium Express was to test the first edge port that services the first access points we were to test. Again, this was the case of running the test scripts from the Lenovo directly connected to the edge switch on the same VLAN as our private test network. These reported 99.99% of the maximum throughput one would expect from a 100 meg link – so that was certainly very good.
The performance of the copper and fibre network validated as satisfactory.
The next step was to confirm whether or not there was a systemic issue with wifi performance (as per my hunch) or whether there was a simple problem with a loose TNC connector on the one of two APs we saw as problematic during the last exercise.
Anyone who has worked with me knows that I’m a bit like an old dog with a bone when it comes to pushing for a quality outcome and ensuring that things do what they say on the tin. Given that I tend to push damned hard for what I think is the right thing to do – it also means that I was experiencing rather some trepidation at the survey as I’d possibly have some explaining to do regarding billable time if my hunch proved to be false.
Step 3: Conduct the wireless survey
The wireless component of the GCCEC network is comprised of the following elements:
- Cisco 4402 wireless lan controller
- 50 x Cisco 1252 abgn wifi access points
- Each access point has a 2.4GHz and 5GHz radio interface – you can think of these two radios as being akin to having two separate Ethernet ports in your PC … if you’re troubleshooting they BOTH need testing.
Nathan at GCCEC did a wonderful job of building us a mobile testing rig out of a road case, trestle table and ratchet straps + onboard GPOs a 50m power lead and anti-slip mats! Nath – you’re a legend! Christened the Tedium Express, this was our working environment as we tested each AP.
We wanted to ensure that we could conduct a single sweep of the entire venue and not have to go back out to reassess it as we only had funding clearance for a day on site. With this in mind we prepared a comprehensive spreadsheet listing:
- each copper MAC address
- each base radio MAC address
- the associated IDF and edge switch port
The process we used was to park the mobile rig under (or as close as possible to) the antennas under test and collect:
- the advertised radio link speed
- the current channel allocation
- three real-world download performance data points
- Repeat steps 1-3 for 2.5GHz and 5GHz.
This meant 50 access points * 2 radios * 5 data points = 500 collected values.
The first dozen or so radios took nearly an hour and a half to collect the data from, primarily because the Lenovo was taking so long to switch between 2.4GHz and 5GHz radios. We therefore chose to survey the venue twice, once at 2.4GHz and a second time at 5GHz. Despite having to cover twice the distance we ended up doing the collection much more quickly as we weren’t waiting on client WLAN configuration.
Out of interest we chose the Lenovo as the client test rig as the ThinkPad connections software allows you to view a series of access points listed by MAC address and then choose a MAC address to associate to as your preferred access point. Given that we had a pre-prepared list of all of the access points in the venue together with their MAC addresses, this made validation of what we were testing at any point a snap.
Also, David Cormack from CBO was helpfully driving the wlan controller software and printed installation maps to further validate what we were doing. The whole process worked pretty well but the ‘law of big numbers’ meant that it was going to take a while no matter what – there were 100 interfaces to survey and that will never happen in an hour.
The entire survey took from approximately 0830 to 1430 to complete with four people.
I see dead radios
During the course of the survey we found the 5GHz radio in AP-24 to be faulty. It was extremely difficult to obtain an association with the 5GHz interface and when we did the radio link speed was 9mbps and real world download speeds were 3-4 kilobytes (no, that is not a typo) per second. We had to bench it to work out what was going on.
I don’t have a problem with heights per se as I’m fine on the top of a mountain or on a high rise balcony or observation deck. What I do have a problem with is these damned scissor lifts that are rickety and dodgy and wobbly. If that is not bad enough they actually can drive around when you’re 10m in the air.[youtube XDyguoeQS60]
I edited out the bit where we started percussive maintenance (I’m talking about the AP, not your forehead David Cormack :)). As an aside, it is a pretty good view of the hall from up there. During the video you’ll hear a faint rasping sound … that is the GCCEC staff scraping every last skerrick of masking/gaff tape off the raw concrete floor to ensure the venue is spotless for the next show. Keep this view in mind when you’re walking around the nice carpeted expo halls with fancy games and Xboxes and other stuff … there are innumerable venue and staging guys really go to an amazing amount of work to transform that concrete shell into the event you know and love. While we did our survey they blacked out the entire hall to do metered comparisons of a new super-bright ‘green’ compact fluorescent lamp for the hall to ensure they’re a satisfactory brightness and colour temperature compared to the existing metal-halide units. In short – there is just so much stuff that goes on to prepare the minutiae of detail at these events that hours and hours of labour go into light globe selection alone – let alone the what technology team do.
Here’s another view of the area tech•ed will consume from ground level (opposite end of the hall from the previous video):[youtube GEfU4vwUKU0]
Step 4: Analyse the results
I digress … where was I? That’s right – the survey.
So after a couple of laps of the venue we came up with this:
Some of you might see quoted figures of 2 megabytes or 5 megabytes per second and think that is pretty good for wireless, and truth be told if you were getting sustained rates like that out of some dinky home router I’d probably agree. Our expectations are somewhat higher and you have to remember that we’re really the only people at the venue there are only two acceptable performance outcomes:
- saturation of the RF
- saturation of the edge port
The results paint an entirely different picture though. On average, the RF segments of the network are only providing 17.80% of associated RF link speed on 2.4GHz and 16.14% of associated RF link speed on 5GHz.
The other problem with these results is that they are highly inconsistent. Across each of the three runs the individual tests could vary by double or half.
So it turned out that my initial hunch from the last day on site was correct … and we have a serious wifi performance problem on our hands. Ruh Roh!
Step 5: Find the culprit
By this stage we had chewed up most of the day. We had ascertained:
- there was a serious performance problem with wifi and we had quantified it in a clear and clinical way as was our original intention.
- that the problem lay between the access points and the wireless lan controller, or possibly the way the wireless lan controller terminated in the venue core switch.
While this was progress, we really wanted to provide more definitive findings that would further narrow down the source of the problem. I must admit I was very disappointed at the prospect of leaving the venue with only problems identified and no positive prescriptive advice.
Further isolation testing
We had AP-24 in hand so we retreated to the centre of the venue and proceeded to shut down each of the wireless access points. After doing this we patched AP-24 into the wireless lan controller VLAN and reconfigured it in 2.4GHz 802.11n mode (turning off it’s flaky 5GHz radio altogether). We re-ran the same test suite and found performance to be the same as any of the other access points in the venue during our initial survey. This was helpful as it showed us that the problem was not in some way related to load on the wireless controller or the number of access points talking to the controller.
Reflashing AP-24 as autonomous
You will recall from the previous article that we discussed how the access points in the network are running in lightweight mode and so can only run in conjunction with the wireless LAN controller (see this if you’re interested: http://www.cisco.com/en/US/products/hw/wireless/ps430/products_qanda_item09186a00806a4da3.shtml). We decided it would be nothing if not informative to ‘downgrade’ the access point from lightweight mode to autonomous mode.
Autonomous mode firmware gives the access point a more complete IOS feature set and allows it to bridge the 802.11 (wifi) and 802.3 (Ethernet) networks without a controller. After some mucking around we re-flashed AP-24 into autonomous mode and directly connected it to our test VLAN bypassing the controller VLAN all together.
With a link speed of 144 mbps you can expect approximately half of that in terms of real-world download speed, which would be 72mbps in this case. Let’s do some sums. Remember that hard drive sizes and network throughput are universally measured using base 10 arithmetic (1 megabyte is 8,000,000 bits), while file sizes on disk and memory consumption and so on are measured using base 2 arithmetic (1 megabyte is 8,388,608 bits) … I know, I know I could use SI units for kibibytes and kilobytes but most people reading this would have no idea what I was talking about.
Anyway, ignoring framing overhead and so on to make this easier:
- 72 x 1000 x 1000 = 72,000,000 bps
- 72,000,000 / 8 / 1024 /1024 = 8.5ish megabytes of files downloaded per second.
Above we see 9.8, 8.9, and 9.1 as the figures from the test of AP-24 in autonomous mode. This is what I would call EXCELLENT!
We made a lot of progress in the day on-site.
- We proved and documented that there is a systemic problem with wifi performance at the venue.
- We found and removed a dead AP.
- We proved that the edge and core networks are functioning as expected.
- We proved that the poor wifi performance was not related to channel/RF intereference by our first isolation test.
- We proved that the access points provide excellent performance in autonomous mode.
In short, we turned a hunch into a known and well documented problem.
We provided a written report of our findings back to the venue and they have taken it back to the installers to get the core issue looked at and the installers have basically said that the issue will be resolved ‘no matter what’ – Thank you guys – I love commitment.
We’ll provide a further update when we know definitively what the root cause was and when the matter is corrected to our satisfaction.
Don’t worry, we have a plan B if the above doesn’t get resolved. Fortunately for you this means you will get as good performance as possible at the event. Unfortunately for me it means that damned scissor lift and reflashing 50 access points to autonomous mode with individual configurations before the event and back again afterwards. Hmmm David Eagles; What are you doing this time next month? 😀