Never gonna give you up!

helpdeskWe mentioned in a previous post (see BitTorrent, traffic shaping and trusting users) that we had a small number of users who were unfairly monopolising network resources in order to download files via BitTorrent. The whole thing was a bit sad for me personally as I took it as a bit of a depressing display of the bad parts of human nature taking advantage of our deliberately liberal and generous policies on network access. We’ve been running the network the same way since 2004 and this is the first time we’ve seen people take advantage of it this way.

We don’t shape the traffic on the network at all and nor do we see it as our role to police or restrict delegate use of the network (a policy that will, no doubt, change for tech•ed 2010 based on last year). Given this, we had to do a bit of seat-of-our-pants work to identify who the main culprits were and then implement counter-measures to at least lessen their impact on the delegates who were at the event to learn and share (rather than download). The back story is interesting as it involves a collaboration between us and the team at Microsoft who own the ipnat.sys driver that we were using to do the network address translation.

Where is all that data going?

The first thing we noticed was a couple of brief periods during the day where a very small number of clients (maybe 5 or 10) would experience no Internet connectivity. This was odd:

  • the network was happily switching a couple of hundred megabits per second of data (that’s the Internet link, not the core).
  • there was plenty of CPU and other resource headroom on the servers in question.
  • nothing was showing up as unusual on the core switches.
  • performance tests carried out in network operations showed plenty of excess capacity.
  • we did all of our sums on port and CPU requirements and the RRAS team did a simulation for us (see: http://www.techedbackstage.net/2009/08/05/windows-server-2008-r2-nat-performance-guest-post-by-the-windows-team/) so we did do all of our homework up front.

I was puzzled. I don’t like fault reports of any kind and while the number of affected users was small, we really did need to get to the bottom of it. There was also the matter of the amount of data we were pulling from the Internet. We normally run tech•ed with a 100-150 mbps link with headroom to spare – but this year we were seeing sustained peaks twice the historical norm. Odd and uncharacteristic.

tech•ed 2009 Internet link port utilisation

tech•ed 2009 Internet link port utilisation expressed as megabytes per second (base10, multiple by 10 to get a rough base10 mbps figure)

The quickest way to get to the bottom of anything like this is to point Wireshark at the network and see what is happening. As the network is switched, we need to set one of the ports on the core into SPAN mode (Switch Port ANalyser mode) which instructs the switch to send everything on a given VLAN or port out a nominated destination port, regardless of whether it is actually destined for that port. Once you have all of that data, Wireshark does a fantastic (if not always slow) job of breaking down the data into the corresponding protocols visually so you can see what is going on. The “Top Talkers” feature in Wireshark is awesome as it quickly lets you identify which clients are consuming a bulk of the traffic.

This quickly showed us two things:

  1. A small number of clients were causing a lot of data utilisation.
  2. The protocol was BitTorrent.

At this point you have to remember that we have a heap of bandwidth available. Some clients chomping through a lot of bandwidth isn’t a problem and running BitTorrent isn’t a problem per se. The aforementioned work on port utilisation planning was already done – but I had not looked at that empirically and so that was the next step.

Port Exhaustion

We pulled up the NAT table for the relevant interfaces in RRAS and the problem was immediately obvious. The table was HUGE (approaching 65K ports). The RRAS machine was experiencing port exhaustion.

Let’s step back and look at the issue a bit from first principles to understand what is going on. Like any corporate network, tech•ed 2009 runs private address space. Private address space is special IP address space allocated by IANA and intended solely for IP networks that do not route directly to and from the global Internet.

In order to provide Internet access at the event, we use Network Address Translation on the default gateway. The gateway has two network interfaces:

  • one facing the private address space (and the IP address of this is your default route if you do a netstat -r on your PC).
  • the other network interface is directly connected to our IP transit provider (in this case it was Telstra for delegates and Over The Wire for staff and speakers) and this interface has a normal public IP address on it.

The following happens when your client tries to send a packet to an Internet destination:

  1. Your local IP stack sends the packet to the default gateway specified in your current IP config.
  2. The NAT server receives this and pretends that it is going to route it to the Internet.
  3. The NAT server alters the contents of the packet so that the source of the packet is the NAT server’s public IP address and a possibly a different IP port on that public interface.
  4. When doing step 3, the NAT server makes a note of what it has done in what is called a translation table.
  5. The NAT server sends the packet on its way to wherever it is meant to go.
  6. A reply from the Internet host will come back to the NAT server on the public IP and port that the server provided.
  7. The NAT server alters the contents of that packet to be destined to your private IP address and port and then sends the packet to you.

So the upshot of the above is that you can ‘hide’ a thousand people behind a few public IP addresses. You don’t need to obtain a public IP address for each user, which is a good thing as IPv4 address space is running out. There are a few limitations on this scheme:

  1. There is a limit to how much you can store in the translation table. The translations need to be available for kernel mode device drivers and so cannot be swapped out to virtual memory – this means they reside in Nonpaged Pool in the NT kernel (an area of memory that is always resident).  Each mapping consumes around 256bytes of Nonpaged pool. Assuming 15 mappings by 2000 clients = 30000 translation table entries. That’s 7.5 megabytes of nonpaged pool memory out of a nonpaged pool of say 256 meg. Not a problem.
  2. There is a limit to how much CPU grunt the machine has. The CPU time spent making translation table look ups and re-writing the packets on the way in and out of the NAT quickly add up when you’re talking about thousands and thousands of packets per second.
  3. IP ports are represented as two bytes. This means that the maximum number of 65536 TCP and 65536 UDP ports are available on a given IP address. The actual number available is a bit lower as the server will have a number of ports allocated for its own use depending on what is installed.
  4. RRAS, we found, only uses the machine base IP address for the outside of the NAT. It will not use additional IP addresses in the public address pool, no matter how many IP addresses are in that pool.

Analysing the ipnat.sys address translation table

The wireless network only had a maximum capacity of 1500 users and we only hit that peak very briefly. Even if we were at that peak, surely it cannot be the case that all users on the network are using an average of 40+ TCP ports in the NAT translation table?! I am currently running Skype, MSN Messenger, Google Talk, WinAMP (streaming 256kbps Internet radio), Chrome (16 tabs including this, two Google Apps instances, Salesforce.com, YouTube and a bunch of other reference material for this post), VPN to our intranet + whatever Windows is doing itself. I have only 20 ports open to Internet destinations that traverse the NAT in our office. There is no way the average user is going to be using double my ‘power user’ scenario.

We really needed to get our hands on the NAT table to analyse what was going on with the address translation table. “No problem!”, I think, “I’ve done that heaps of times!” Unfortunately, what I did not realise was the following:

  • There is no way to export the translation table from the management snap-in.
  • ipnat.sys does not have a WMI provider.
  • ipnat.sys does not have any native and easily accessible Win32 APIs.
  • The only way to pull the table was via MSRPC Remote Procedure Call interface.

The Windows RPC interface is architecturally … shall we say … “dated” to be polite. It is overly complex and not remotely easy to write software against (especially when you consider how simple our requirement is). We started to write an extraction utility but it was getting pretty late and you can’t really keep working until the wee hours of the morning when you have to be on site at 6:30am. We handed the code off to the owners of ipnat.sys as they said they had plenty of steam left (being in India) so they would finish it off for us.

In the meantime, we implemented certain, ahem, ‘interim countermeasures’. We quickly built a list of all of the top torrent trackers around and got the nod from Jorke to add them all to the local DNS resolver and point them at a local web server containing some RickRoll scripts.

How professional network administrators deal with Torrent users.

How professional network administrators deal with Torrent users.

It killed me that I didn’t see anyone getting done by this first hand, but there were hundreds of impressions in the server logs containing the the Rick Roll scripts so I did get a fair amount of satisfaction at least. It was the most evil of evil Rick Roll scripts too – worse than any that anyone has used to get me in the past.

The next morning we found that ipnat.sys developer (being in India) suffered all sorts of Internet and power problems overnight and was unable to finish the utility. It was a new day when we received this news so we completed the utility ourselves (thanks Paul – using software engineers as network admins has some benefits :)) and so were able to pull the translation table via MSRPC and dump that out to text.

The results were revealing and really made me want to go on a rampage. We were indeed hitting port exhaustion as an issue, and the distribution curve of who was using what looked like this:

The distribution of the number of translation table entries used on a per-client basis. I wonder who's using BitTorrent?

The distribution of the number of translation table entries used on a per-client basis. I wonder who's using BitTorrent?

As soon as you see that – it is a no-brainer. The right-hand side of that graph represents a small number of users who were using 700-800 translation table entries each at the time. We had a one minute sample file that showed a particular individual using 2500 translation table entries! Argh!

So we scheduled this script to run each minute to generate a list of offending MAC addresses. It took a few goes to get the analysis right but we ended up generating a ‘naughty factor’ based on the number of port mappings, number of distinct hosts on the other end of the mapping, idle time, and so on to give us a number between 0.0 and 1.0. 1.0 meant you were very naughty. 0.0 meant you were very good. We reasoned that if you had a lot of mappings, and that a large proportion of those mappings were to a lot of distinct remote hosts, and largely not idle, that you are probably a Torrenter. OTOH, if you had, say, 20 connections open to a single host or a low number of hosts then this is probably quite fine.

These scripts output a list of bad MACs, that we then just dropped into a block list in the core switches. The logic proved to be quite sound as this is what happened when we blocked a couple of dozen particular users with a high naughtiness factor:

The result of applying a block to a couple of dozen users.

The result of applying a block to a couple of dozen users.

And there you have it. The culprits fingered and booted off the network. Of course, they then just changed their MAC addresses, in which case they were then re-identified as soon as their utilisation crept up, and the new MAC was banned.

2010

So this year, the users who did the above have driven me to recommend the following this Friday when we meet at Microsoft Brisbane for the first technology team meeting:

  1. Opt-in “allows list” basis for access to the network. You will need to register your MAC in CommNet if you want access, and we will apply a quota across all of your devices; and/or
  2. Mandatory rate limiting on a per MAC basis across all users; and/or
  3. Packeteer/Allot/etc based deep packet inspection and traffic shaping.

It’d be better if we could provide peak throughput to any given person at any time should they need it – but the above shows that a small number of people ruin the experience for a large number of prople. My annual argument that users will respect the network resources and behave sensibly will no longer wash with the rest of the team.

</rant>

44 Comments
  1. Nice work, David. “Play nice or be Rick-rolled”

    If this were advertised widely at the start of the 2010 event, I wonder if the “anti-social” types would even bother to try it on?

    BTW, did anyone ever find out what the “baddies” were downloading?

  2. Trying to selectively shape some users and not others will inevitably kick off an arms race that you cannot win. Once you have two clients talking to each other using transport layer encryption there is nothing you can do in terms of deep packet inspection. The only thing you’re left with is what we did, which is inferring behaviour from the characteristics of the network utilisation – not the actual contents of the packets.

    A combination of blanket traffic shaping and outright quota system is the way forward – but to be honest, I still really don’t like doing that. In my view, it is perfectly valid for someone to legally download 5GB from Connect/MSDN/wherever in one hit if they need, for example, ISOs of Win7/SQL/etc for a demo. Those sorts of emergency scenarios will be casualties of whatever shaping we end up doing. When you hit an Akamai server at a peering exchange the download rates are astronomical so you can expect to give a half gig link a bit of a flogging. Blocking those valid use cases, though, sucks.

    So in some cases – yes – we WANT you to use a lot of data if you need to. That is why we put in hundreds of mbps of capacity.

    Pre-announcing shaping will do nothing to change behaviour. We confronted people directly to have them lie to our faces. One of the guys in particular worked for a charity – Great spend of the donated dollars, grants and goodwill sending him to teched.

  3. When they find out that the torrenters are the ones bringing in the network stack code for the news version of Windows, this is all going to eat it’s tail, right?

  4. 10% evil, 90% genius. The perfect balance IMHO.

  5. Any chance of posting those rickroll scripts so that others may learn/profit from them?

    David

  6. @David Mills: Sorry mate. They were just some random scripts we found on one of the millions of RR sites out there. The machine hosting the internal ‘fake’ isohunt etc is long since been nuked and shipped back to the hardware sponsor. There is a HEAP of them out there if you google for it.

    There is nothing special about a RR script. You just need a HEAP of EXTREMELY ANNOYING client-side code and Rick Astley! 😉

  7. Is it possible that some of these people just forgot they had torrent running?
    It seems so very mean for them to do it otherwise and is an simpler explanation.

  8. @Jax: Read the article and the previous one. There is no simpler explanation as people were approached. Most of the torrenting was done on the free netbooks that MS gave to each delegate to keep (i.e. was brand new for them at the event).

    We’re far from mean and actually bend over backwards to accomodate the delegates.

  9. Is the reason for using Windows for NAT just for fun and feedback? Or are there other benefits over using, say, Cisco?

    If you do rate limiting, couldn’t you exempt HTTP/S? Then pretty much all legitimate users wouldn’t be punished (except, I suppose, people using a non-SSL VPN).

  10. Don’t require MAC registration. it’s supper annoying.

    Observe traffic from any MAC address and slowly increase its available bandwidth if it behaves well.

    This way all legit users will soon get good connection, and abusers faking MACs will keep starting from zero.

  11. “RRAS, we found, only uses the machine base IP address for the outside of the NAT. It will not use additional IP addresses in the public address pool, no matter how many IP addresses are in that pool.”

    Fix that, and haven’t you solved your problem?

    “Most of the torrenting was done on the free netbooks that MS gave to each delegate to keep (i.e. was brand new for them at the event).”

    I know there’s a ton of stuff I install on a new machine, and if I were using BitTorrent and wanted to be friendly to other users, I’d be more focussed on limiting the bandwidth than the number of connections – the port exhaustion problem wouldn’t readily occur to me.

    “We confronted people directly to have them lie to our faces.”

    Some people can have good intentions but still react very poorly to being “caught” in the wrong. From my experiences running a website with a lot of users, it’s pretty common.

  12. hi, this is neat + interesting. however, i have to ask – did you ask people not to use bittorrent? ie you made some kind of anouncement explaining what was happening and asking people not to use it? i assume you must have, and i can understand your frustration, but i was surprised that you didn’t (that i could see) mention asking people first in your writeup.

  13. Very good read and some beautiful network mastery!
    Instead of rick rolling them (which I agree must have been gratifying) you could have led them to a page explaining how to use bittorrent correctly i.e. you can limit the number of open connections and therefore open ports in your torrent client / limit the total bandwidth to be used etc. Limits need to be set, especially on a public network such as yours. A page saying “You are damaging the network for everyone around you, please do the following…” is better in my eyes than rick rolling.

    One does not know whether these people did it on purpose or didn’t – but the fact remains that if they would have configured their bittorrent client in a better way, they wouldn’t have “destroyed” your network.

  14. The “good” thing about bittorrent is that the abusers don’t control the peers they are contacting. Most of these peers expect connections on high ports.

    I’ve found on our network that simply blocking ports above 1024 cuts >99% of torrent traffic, while most other stuff still works. Most instant messengers etc. will automatically revert to lower port alternatives. (And you could probably make a small white list for those that don’t without cutting much from the 99% efficiency.)

    I doubt this will cause an “arms race” to use lower ports for bittorrent. Most of the peers out there won’t know and won’t care that the abusers at the event have been limited.

    I realize this isn’t as esthetically pleasing as most of us techies like it, but its cheap on resources, very effective and damage to innocent users is minimal. (IMHO a shaped connection or DPI is worse.)

    (For completeness sake: on our network we don’t block high ports for everyone anymore. We made simple script that’s triggered by traffic from known torrent trackers/peers and blocks high ports for the local user accordingly for 60 minutes.)

  15. Free netbooks, high bittorrent usage? Any chance they were downloading Linux ISOs? 🙂

  16. Rather than all the deep packet inspection, etc, can’t you just directly throttle the ports-per-IP-address limit? Rather than doing the chain of deduction that says “Bittorrent uses many ports, therefore we will throttle Bittorrent”, why not go directly to the problem, which is ports?

  17. 64K ports should be enough for any tech event!

    The sad part is that you had to go through all this work because of the limitations of your NAT — and worse is that no one else will benefit from what you’ve done if they’re stuck using the same software.

  18. limiting each IP address to about 90 sockets, would also do the trick.

  19. Why not just block UDP traffic for ports other than 53 (DNS) and maybe 1194 (OpenVPN)? That will shut down torrent users pretty quickly. I have to say that I’m more in favor of banning people who do dumb shit and leaving the network as open as possible for others, but I also realize that this requires more active network policing.

    I would definitely say though, it’s really inappropriate and inconsiderate that people be torrenting from a public event with shared bandwidth like that. Leave it to your seedboxen at home, folks!

  20. This is an absolutely marvelous explanation, thank you for taking the time to write it up.

    It’s the “tragedy of the commons,” and it and similar scenarios take place commonly on open networks. I had run my home Wi-Fi in our apartment block as a deliberately open network, because many of our neighbors were broke undergrads or grad students to whom a monthly Comcast internet bill was a big expense. We used to get thanks from our neighbors. I also saw it as a reciprocity thing — when our upstream connection failed, which was rare but happened, we could get on a nearby open network and at least get our mail.

    But a funny thing happened: as the number of networks in our neighborhood grew, suddenly everyone was locking them down, and ours was the only open one. Simultaneously we would frequently start to experience “denial of service” on our own connection — and we’d have no nearby network to jump to. Sometimes we wouldn’t even be able to get a NAT’ed IP address from our Netgear router, despite configuring it to reserve a block for the specific MAC addresses used by our computers, and making the public set of NAT IP addresses disjoint from it. (It turns out the router doesn’t really honor those settings correctly). The culprit was either BitTorrent or MMPORPGs. We finally had to secure our network.

    I didn’t have the forensic skills (or want to take the time) to diagnose it in detail, but since we frequently saw a case where we couldn’t get any bandwidth despite having only a few public users connected. I now wonder if it had something to do with the port exhaustion scenario you outline.

  21. I agree with Dennis, the issue was more with how the torrent software was treating the network then the fact that they were using torrent. I myself have inadvertently crushed my companies network do to a video rss feed that decided to download several gigs of old videos.

    If the issue wasn’t the bandwidth but the number of ports being used, that is where it restrictions should be applied. If you were able to limit any ip/mac to under 50-100 ports the torrent software would be allowed to adjust to the environment and function as best it can.

    Everyone would be able to do what they are trying to do, keeping them from trying to get around some artificial block.

  22. Seems kind of silly to inconvenience your guests. Why don’t you just use your script again in 2010 and rick roll them again. 🙂

  23. Also, the mac address approach seems vulnerable to attack. What if a malicious user polls the network and gathers a list of active MACs. He can then masquerade as these MACs and trigger a ban on each one, effectively denying service to other users.

  24. This limit could be avoided by the NAT implementation. Instead of just keying off the port, one could also key off of the remote ip. Then you’d have the 60,000+ port limitation only per remote IP. Remote machine would have same limitation though so it wouldn’t matter.

    All that would be left is limits on nonpaged memory space, which this method would possibly use more per table entry anyways.

  25. ha ha i was there and downloading torrents. never got rick rolled cause i use no script and private trackers anyway.
    btw thanks for the movies teched. was good to have something to watch on the plane flight home 🙂

  26. Did you consider it may have been people outside the building?

  27. @dude – yeah we know you were torrenting, hope the weather is good in VIC – might want to check where logs go to die pronto…

  28. Sad it had to be done, but awesome work on tracking and banning!

  29. Paul R. Potts, you were actually taking a very big risk running an open WiFi network. Obviously you did it with the best of intentions, however you may not have fully considered the possible consequences. If someone had used your network for some kind of illegal activity, as well as the moral responsibility of facilitating this activity, you could actually be legally liable. Illegal activity could be anything from downloading and distributing illegal pornography, stealing credit cards, to planning and coordinating a terrorist attack.

    You were potentially giving criminals a free pass to do whatever they liked on the internet, in complete anonymity, and safe from any possibility of prosecution.

    All those other (selfish) people who were locking down their networks were actually doing the responsible thing.

  30. Its sad to see a small minority abusing the system like this, and forcing policing of traffic for everyone else.. but I guess thats the nature of people these days.

    Glad you go to the bottom of it so quickly, nice one.

  31. David,
    I’d just warn people on entry (where you’re suggesting to do the mac address opt-in) that torrenting for fun is not allowed, and that you have the scripts and tools ready to identify and kick off those that misuse the bandwidth.

    As mentioned before, if it becomes an arms race, they will force you to invest more time to stop what they’ve come up with. If you place part of the responsibility with the users, they might stop fighting against you.

    For example, if you register macs, baddies will snoop other people’s macs and use those. Mahyem ensues…..