An Internet Outage
Nobody likes it. Everyone wants to avoid it. Companies spend millions upon millions of dollars to prevent it, yet it happens to everyone at some point.
The reasons why can range from something as silly as accidentally unplugged power cords to hidden software bugs that cause cascading problems across very large global networks. Regardless of the cause, after an outage everyone wants to know what happened, and what steps are being taken to ensure that it never happens again. After all, we have books and shoes and cheese to sell online, and there are men and women to be matched and stocks to be traded, and none of it can be interrupted EVER for any reason, right?
This is the story of one such outage in February of 2009. This did not impact Gotham to any significant degree, but it caused enough widespread panic on the Internet for a few hours to be worth passing along.
Not all networks connected to the Internet have the resources to purchase and maintain expensive Cisco or Juniper routers and switches, You’ll also find an odd assortment of other hardware and software attached to the global network if you look hard enough. In the case of a small-ish service provider in the Czech Republic, we’re talking about Microtik software.
You might ask yourself, “Who or what is Microtik?”.
To make a long and mind-numbingly technical story short, the Microtik routing software used by this Czech ISP fiddled with something called autonomous system (AS) prepending as part of a traffic management scheme. AS prepending is a way to influence how packets reach a given network on the Internet. Things began to unravel that night when these Microtik routers began telling the rest of the Internet that they could be found at the end of a really, really long chain of AS numbers. Like REALLY long. Insanely long. So why is this of any consequence?
Imagine calling a friend and informing him that you have a new mailing address. Now imagine that your new mailing address was so long that it would fill an entire page of The New York Times. Your friend would not believe it, but if you’re a Microtik router, you insist that its true and force him to accept it at face value. Now imagine that your friend has to start a “phone chain” to get your new address to the rest of your friends – relaying the information to the next person, who sends it to the next, and so on. Here’s the rub. Some of your friends look oddly at your gargantuan new mailing address, but pass it along anyway without incident. Some are so maddened by this cumbersome new address that they refuse to talk any longer to the person that tried to give it to them. Suddenly your network of friends and family members is fragmented. Some people are refusing to listen to others, and the normally friendly banter that passes among them comes grinding to a halt.
This is exactly what happened when a comically long AS prepend in the Czech Republic caused a huge number of Internet routers to stop talking to each other. Since the Internet is basically routers talking to each other, this is a bad thing. For a period of approximately an hour, while network engineers around the world tried to figure out what was going on, things ground to a near standstill in many significant places. Network routing information was being changed thousands of times every second as routers began refusing to talk to one another and we wound up with a noticeable period of time during which a large chunk of the Internet turned to mush.
Picture yourself during this outage. Your website is down, your customers are screaming at you, you’re screaming at your service provider, you’re vowing to move all your stuff to a “ bulletproof” network that never goes down. We’ve all been there. Odds are we’ll be there again. In this case you’d have learned that the root of the problem was in a small office halfway around the world, that a bit of $99 software exposed a previously unknown weakness present in a very large number of routers costing thousands of times more than that, and that even the largest most expensive networks on the planet turned out not to be “bulletproof” at all.
The Internet has certainly come a long way since 1992, but it can still be unexpectedly fragile at times. It would serve all of us well to keep that in mind, especially when marketing our services and making promises that people we’ve never met (and never will) can trample all over without even realizing they’re doing it.