We mentioned in a previous post (see BitTorrent, traffic shaping and trusting users) that we had a small number of users who were unfairly monopolising network resources in order to download files via BitTorrent. The whole thing was a bit sad for me personally as I took it as a bit of a depressing display of the bad parts of human nature taking advantage of our deliberately liberal and generous policies on network access. We’ve been running the network the same way since 2004 and this is the first time we’ve seen people take advantage of it this way.
We don’t shape the traffic on the network at all and nor do we see it as our role to police or restrict delegate use of the network (a policy that will, no doubt, change for tech•ed 2010 based on last year). Given this, we had to do a bit of seat-of-our-pants work to identify who the main culprits were and then implement counter-measures to at least lessen their impact on the delegates who were at the event to learn and share (rather than download). The back story is interesting as it involves a collaboration between us and the team at Microsoft who own the ipnat.sys driver that we were using to do the network address translation.
Where is all that data going?
The first thing we noticed was a couple of brief periods during the day where a very small number of clients (maybe 5 or 10) would experience no Internet connectivity. This was odd:
- the network was happily switching a couple of hundred megabits per second of data (that’s the Internet link, not the core).
- there was plenty of CPU and other resource headroom on the servers in question.
- nothing was showing up as unusual on the core switches.
- performance tests carried out in network operations showed plenty of excess capacity.
- we did all of our sums on port and CPU requirements and the RRAS team did a simulation for us (see: http://www.techedbackstage.net/2009/08/05/windows-server-2008-r2-nat-performance-guest-post-by-the-windows-team/) so we did do all of our homework up front.
I was puzzled. I don’t like fault reports of any kind and while the number of affected users was small, we really did need to get to the bottom of it. There was also the matter of the amount of data we were pulling from the Internet. We normally run tech•ed with a 100-150 mbps link with headroom to spare – but this year we were seeing sustained peaks twice the historical norm. Odd and uncharacteristic.
The quickest way to get to the bottom of anything like this is to point Wireshark at the network and see what is happening. As the network is switched, we need to set one of the ports on the core into SPAN mode (Switch Port ANalyser mode) which instructs the switch to send everything on a given VLAN or port out a nominated destination port, regardless of whether it is actually destined for that port. Once you have all of that data, Wireshark does a fantastic (if not always slow) job of breaking down the data into the corresponding protocols visually so you can see what is going on. The “Top Talkers” feature in Wireshark is awesome as it quickly lets you identify which clients are consuming a bulk of the traffic.
This quickly showed us two things:
- A small number of clients were causing a lot of data utilisation.
- The protocol was BitTorrent.
At this point you have to remember that we have a heap of bandwidth available. Some clients chomping through a lot of bandwidth isn’t a problem and running BitTorrent isn’t a problem per se. The aforementioned work on port utilisation planning was already done – but I had not looked at that empirically and so that was the next step.
We pulled up the NAT table for the relevant interfaces in RRAS and the problem was immediately obvious. The table was HUGE (approaching 65K ports). The RRAS machine was experiencing port exhaustion.
Let’s step back and look at the issue a bit from first principles to understand what is going on. Like any corporate network, tech•ed 2009 runs private address space. Private address space is special IP address space allocated by IANA and intended solely for IP networks that do not route directly to and from the global Internet.
In order to provide Internet access at the event, we use Network Address Translation on the default gateway. The gateway has two network interfaces:
- one facing the private address space (and the IP address of this is your default route if you do a netstat -r on your PC).
- the other network interface is directly connected to our IP transit provider (in this case it was Telstra for delegates and Over The Wire for staff and speakers) and this interface has a normal public IP address on it.
The following happens when your client tries to send a packet to an Internet destination:
- Your local IP stack sends the packet to the default gateway specified in your current IP config.
- The NAT server receives this and pretends that it is going to route it to the Internet.
- The NAT server alters the contents of the packet so that the source of the packet is the NAT server’s public IP address and a possibly a different IP port on that public interface.
- When doing step 3, the NAT server makes a note of what it has done in what is called a translation table.
- The NAT server sends the packet on its way to wherever it is meant to go.
- A reply from the Internet host will come back to the NAT server on the public IP and port that the server provided.
- The NAT server alters the contents of that packet to be destined to your private IP address and port and then sends the packet to you.
So the upshot of the above is that you can ‘hide’ a thousand people behind a few public IP addresses. You don’t need to obtain a public IP address for each user, which is a good thing as IPv4 address space is running out. There are a few limitations on this scheme:
- There is a limit to how much you can store in the translation table. The translations need to be available for kernel mode device drivers and so cannot be swapped out to virtual memory – this means they reside in Nonpaged Pool in the NT kernel (an area of memory that is always resident). Each mapping consumes around 256bytes of Nonpaged pool. Assuming 15 mappings by 2000 clients = 30000 translation table entries. That’s 7.5 megabytes of nonpaged pool memory out of a nonpaged pool of say 256 meg. Not a problem.
- There is a limit to how much CPU grunt the machine has. The CPU time spent making translation table look ups and re-writing the packets on the way in and out of the NAT quickly add up when you’re talking about thousands and thousands of packets per second.
- IP ports are represented as two bytes. This means that the maximum number of 65536 TCP and 65536 UDP ports are available on a given IP address. The actual number available is a bit lower as the server will have a number of ports allocated for its own use depending on what is installed.
- RRAS, we found, only uses the machine base IP address for the outside of the NAT. It will not use additional IP addresses in the public address pool, no matter how many IP addresses are in that pool.
Analysing the ipnat.sys address translation table
The wireless network only had a maximum capacity of 1500 users and we only hit that peak very briefly. Even if we were at that peak, surely it cannot be the case that all users on the network are using an average of 40+ TCP ports in the NAT translation table?! I am currently running Skype, MSN Messenger, Google Talk, WinAMP (streaming 256kbps Internet radio), Chrome (16 tabs including this, two Google Apps instances, Salesforce.com, YouTube and a bunch of other reference material for this post), VPN to our intranet + whatever Windows is doing itself. I have only 20 ports open to Internet destinations that traverse the NAT in our office. There is no way the average user is going to be using double my ‘power user’ scenario.
We really needed to get our hands on the NAT table to analyse what was going on with the address translation table. “No problem!”, I think, “I’ve done that heaps of times!” Unfortunately, what I did not realise was the following:
- There is no way to export the translation table from the management snap-in.
- ipnat.sys does not have a WMI provider.
- ipnat.sys does not have any native and easily accessible Win32 APIs.
- The only way to pull the table was via MSRPC Remote Procedure Call interface.
The Windows RPC interface is architecturally … shall we say … “dated” to be polite. It is overly complex and not remotely easy to write software against (especially when you consider how simple our requirement is). We started to write an extraction utility but it was getting pretty late and you can’t really keep working until the wee hours of the morning when you have to be on site at 6:30am. We handed the code off to the owners of ipnat.sys as they said they had plenty of steam left (being in India) so they would finish it off for us.
In the meantime, we implemented certain, ahem, ‘interim countermeasures’. We quickly built a list of all of the top torrent trackers around and got the nod from Jorke to add them all to the local DNS resolver and point them at a local web server containing some RickRoll scripts.
It killed me that I didn’t see anyone getting done by this first hand, but there were hundreds of impressions in the server logs containing the the Rick Roll scripts so I did get a fair amount of satisfaction at least. It was the most evil of evil Rick Roll scripts too – worse than any that anyone has used to get me in the past.
The next morning we found that ipnat.sys developer (being in India) suffered all sorts of Internet and power problems overnight and was unable to finish the utility. It was a new day when we received this news so we completed the utility ourselves (thanks Paul – using software engineers as network admins has some benefits :)) and so were able to pull the translation table via MSRPC and dump that out to text.
The results were revealing and really made me want to go on a rampage. We were indeed hitting port exhaustion as an issue, and the distribution curve of who was using what looked like this:
As soon as you see that – it is a no-brainer. The right-hand side of that graph represents a small number of users who were using 700-800 translation table entries each at the time. We had a one minute sample file that showed a particular individual using 2500 translation table entries! Argh!
So we scheduled this script to run each minute to generate a list of offending MAC addresses. It took a few goes to get the analysis right but we ended up generating a ‘naughty factor’ based on the number of port mappings, number of distinct hosts on the other end of the mapping, idle time, and so on to give us a number between 0.0 and 1.0. 1.0 meant you were very naughty. 0.0 meant you were very good. We reasoned that if you had a lot of mappings, and that a large proportion of those mappings were to a lot of distinct remote hosts, and largely not idle, that you are probably a Torrenter. OTOH, if you had, say, 20 connections open to a single host or a low number of hosts then this is probably quite fine.
These scripts output a list of bad MACs, that we then just dropped into a block list in the core switches. The logic proved to be quite sound as this is what happened when we blocked a couple of dozen particular users with a high naughtiness factor:
And there you have it. The culprits fingered and booted off the network. Of course, they then just changed their MAC addresses, in which case they were then re-identified as soon as their utilisation crept up, and the new MAC was banned.
So this year, the users who did the above have driven me to recommend the following this Friday when we meet at Microsoft Brisbane for the first technology team meeting:
- Opt-in “allows list” basis for access to the network. You will need to register your MAC in CommNet if you want access, and we will apply a quota across all of your devices; and/or
- Mandatory rate limiting on a per MAC basis across all users; and/or
- Packeteer/Allot/etc based deep packet inspection and traffic shaping.
It’d be better if we could provide peak throughput to any given person at any time should they need it – but the above shows that a small number of people ruin the experience for a large number of prople. My annual argument that users will respect the network resources and behave sensibly will no longer wash with the rest of the team.