Add to Google

Monday, August 20, 2007

Use case this!


Can engineers realistically be expected to think of everything that could go wrong with a system they design? Skype customers seem to think so. This week Skype went down for a few days, and on their company blog Skype offers the following interesting explanation:
The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
So, essentially, everybody logged on to Skype at the same time after Microsoft made them reboot, and took the entire Skype network down with it. Interesting. This seems like a rather obscure chain of events that would cause a system outage, can Skype Engineers be forgiven for not thinking of it ahead of time?

One of the fundamental activities that any designer (be it software or any type of system) does is to perform "use cases". This basically involves putting a "user" of a system through a number of scenarios to test how they react to the system, and to test how the system holds up.

One of the first "use cases" I remember hearing in about Engineering class was the operation of a surgery laser where if you typed the operation codes in too fast, the safety mechanisms wouldn't engage and instead of targeting cancer cells in controlled bursts, it would fry you something fierce and cause a cancer way worse than what you started with. The argument is made that if the designers had spent sufficient time putting people through expected "use cases", the this would not have happened, thus the Engineers failed.

Could Skype have predicted this situation? They certainly couldn't have tested for it (how do you get 9 million people to log on simultaneously?). The jury is out on this one, but I'm willing to give the engineers some slack... mostly because the service is free, so who am I to complain?

(As a side note, much of the blog-o-sphere finds Skype's explanation too far-fetched to believe)

1 comment:

Anonymous said...

I agree that the developers couldn't test for every possible set of circumstances. It is not so much that they had an outage than the time it took for them to fix it. This implies that there is something fundamentally wrong in their system architecture design for business continuity.