The Fallacies Remain...

19 February 2008

Just recently, I got this bit in an email from the Redmond Developer News ezine:

TWO IF BY SEA
In the course of just over a week starting on Jan. 30, a total of five undersea data cables linking Europe, Africa and the Middle East were damaged or disrupted. The first two cables to be lost link Europe with Egypt and terminate near the Port of Alexandria.
http://reddevnews.com/columns/article.aspx?editorialsid=2502
Early speculation placed the blame on ship anchors that might have dragged across the sea floor during heavy weather. But the subsequent loss of cables in the Persian Gulf and the Mediterranean has produced a chilling numbers game. Someone, it seems, may be trying to sabotage the global network.
It's a conclusion that came up at a recent International Telecommunication Union (ITU) press conference. According to an Associated Press report, ITU head of development Sami al-Murshed isn't ready to "rule out that a deliberate act of sabotage caused the damage to the undersea cables over two weeks ago."
http://tinyurl.com/3bjtdg
You think?
In just seven or eight days, five undersea cables were disrupted.
Five. All of them serving or connecting to the Middle East. And thus far, only one cable cut -- linking Oman and the United Arab Emirates -- has been identified as accidental, caused by a dragging ship anchor.
So what does it mean for developers? A lot, actually. Because it means that the coming wave of service-enabled applications needs to take into account the fact that the cloud is, literally, under attack.
This isn't new. For as long as the Internet has been around, concerns about attacks on the network have centered on threats posed by things like distributed denial of service (DDOS) and other network-borne attacks. Twice -- once in 2002 and again in 2007 -- DDOS attacks have targeted the 13 DNS root servers, threatening to disrupt the Internet.
But assaults on the remote physical infrastructure of the global network are especially concerning. These cables lie hundreds or even thousands of feet beneath the surface. This wasn't a script-kiddie kicking off an ill-advised DOS attack on a server. This was almost certainly a sophisticated, well-planned, well-financed and well-thought-out effort to cut off an entire section of the world from the global Internet.
Clearly, efforts need to be made to ensure that the intercontinental cable infrastructure of the Internet is hardened. Redundant, geographically dispersed links, with plenty of excess bandwidth, are a good start.
But development planners need to do their part, as well. Web-based applications shouldn't be crafted with the expectation of limitless bandwidth. Services and apps must be crafted so that they can fail gracefully, shift to lower-bandwidth media (such as satellite) and provide priority to business-critical operations. In short, your critical cloud-reliant apps must continue to work, when almost nothing else will.
And all this, I might add, as the industry prepares to welcome the second generation of rich Internet application tools and frameworks.
Silverlight 2.0 will debut at MIX08 next month. Adobe is upping the ante with its latest offerings. Developers will enjoy a major step up in their ability to craft enriched, Web-entangled applications and environments.
But as you make your plans and write your code, remember this one thing: The people, organization or government that most likely sliced those four or five cables in the Mediterranean and Persian Gulf -- they can do it again.

There's a couple of things to consider here, aside from the geopolitical ramifications of a concerted attack on the global IT infrastructure (which does more to damage corporations and the economy than it does to disrupt military communications, which to my understanding are mostly satellite-based).

First, this attack on the global infrastructure raises a huge issue with respect to outsourcing--if you lose touch with your development staff for a day, a week, a month (just how long does it take to lay down new trunk cable, anyway?), what sort of chaos is this going to strike with your project schedule? In The World is Flat, Friedman mentions that a couple of fast-food restaurants have outsourced the drive-thru--you drive up to the speaker, and as you place your order, you're talking to somebody half a world way who's punching it into a computer that's flashing the data back to the fast-food join in question for harvesting (it's not like they make the food when you order it, just harvest it from the fields of pre-cooked burgers ripening under infrared lamps in the back) and disbursement as you pull forward the remaining fifty feet to the first window.

The ludicrousness of this arrangement notwithstanding, this means that the local fast-food joint is now dependent on the global IT infrastructure in the same way that your ERP system is. Aside from the obvious "geek attraction" to a setup like this, I find it fascinating that at no point did somebody stand up and yell out, "What happened to minimizing the risks?" Effective project development relies heavily on the ability to touch base with the customer every so often to ensure things are progressing in the way the customer was anticipating. When the development team is one ocean and two continents away in one direction, or one ocean and a whole pile of islands away in the other direction, or even just a few states over, that vital communication link is now at the mercy of every single IT node in between them and you.

We can make huge strides, but at the end of the day, the huge distances involved can only be "fractionalized", never eliminated.

Second, as Desmond points out, this has a huge impact on the design of applications that are assuming a 100% or 99.9% Internet uptime. Yes, I'm looking at you, GMail and Google Calendar and the other so-called "next-generation Internet applications" based on technologies like AJAX. (I categorically refuse to call them "Web 2.0" applications--there is no such thing as "Web 2.0".) As much as we keep looking to the future for an "always-on" networking infrastructure, the more we delude ourselves to the practical realities of life: there is no such thing as "always-on" infrastructure. Networking or otherwise.

I know this personally, since last year here in Redmond, some stronger-than-normal winter storms knocked down a whole slew of power lines and left my house without electricity for a week. To very quickly discover how much of modern Western life depends on "always-on" assumptions, go without power to the house for a week. We were fortunate--parts of Redmond and nearby neighborhoods got power back within 24 hours, so if I needed to recharge the laptop or get online to keep doing business, much less get a hot meal or just find a place where it was warm, it meant a quick trip down to the local strip mall where a restaurant with WiFi (Canyon's, for those of you that visit Redmond) kept me going. For others in Redmond, the power outage meant a brief vacation down at the Redmond Town Center Marriott, where power was available pretty much within an hour or two of its disruption.

The First Fallacy of Enterprise Systems states that "The network is reliable". The network is only as reliable as the infrastructure around it, and not just the infrastructure that your company lays down from your workstation to the proxy or gateway or cable modem. Take a "traceroute" reading from your desktop machine to the server on which your application is running--if it's not physically in the building as you, then you're probably looking at 20 - 30 "hops" before it reaches the server. Every single one of those "hops" is a potential point of failure. Granted, the architecture of TCP/IP suggests that we should be able to route around any localized points of failure, but how many of those points are, in fact, to your world view, completely unroutable? If your gateway machine goes down, how does TCP/IP try to route around that? If your ISP gets hammered by a Denial-of-Service attack, how do clients reach the server?

If we cannot guarantee 100% uptime for electricity, something we've had close to a century to perfect, then how can you assume similar kinds of guarantees for network availability? And before any of you point out that "Hey, most of the time, it just works so why worry about it?", I humbly suggest you walk into your Network Operations Center and ask the helpful IT people to point out the Uninterruptible Power Supplies that fuel the servers there "just in case".

When they in turn ask you to point out the "just in case" infrastructure around the application, what will you say?

Remember, the Fallacies only bite you when you ignore them:

1) The network is reliable
2) Latency is zero
3) Bandwidth is infinite
4) The network is secure
5) Topology doesn't change
6) There is one administrator
7) Transport cost is zero
8) The network is homogeneous
9) The system is monolithic
10) The system is finished

Every project needs, at some point, to have somebody stand up in the room and shout out, "But how do we minimize the risks?" If this is truly a "mission-critical" application, then somebody needs the responsibility of cooking up "What if?" scenarios and answers, even if the answer is to say, "There's not much we can reasonably do in that situation, so we'll just accept that the company shuts its doors in that case".

Tags: reading industry distributed systems development processes security

The Fallacies Remain...

In which I discuss how the Fallacies of Distributed Systems plague us still.