Monday, July 26, 2010

How resilient does your DC network need to be?

We had a great conversation this weekend on packetpushers about how resilient you should make your datacentre switching. The guts of the conversation is on your core chassis devices, do you need to put in multiple supervisors/PSUs/ETC.

Now, the real answer to these kinds of questions is always a varient on 'it depends on the requirements'.
Ethan Banks was making the point that there are networks where a two second loss is a big deal. I guess we're primarily talking about trading houses/banks/call centres etc here, and in those places it's a fair point.

For pretty much anyone else, the first question I'd ask is 'How much money is it worth to avoid a 3 second loss once a year'. Why those numbers? Assuming you're doing something vaguely sensible with Rapid Spanning tree and/or your L3 routing protocol of choice, you should expect an unplanned device loss to be recovered in that time. Why once a year? If you're losing  devices more often than that, you probably have an environmental issue you need to fix first.

So what's the significance of the 3 second drop? Well, of course you need to test in your environment, but you should expect TCP connections to stay up (or you can tweak your TCP stacks to ensure they will). SMB transfers may drop, VOIP calls will drop. Storage calls will fail. These things should all recover*. I'm sure you can think of a few more things.

* Yeah I know the VOIP call won't 'recover', but you can call them back. As long as it's not a regular even, this is usually an acceptable risk. As for the rest, test your applications, and see what will recover, and what dies horribly. The 'dies horribly' list is a set of applications which are the drivers for 'hitless redundancy'. Make sure it's made clear to the business that these are the apps which are responsible for the extra cost.  
So once we have our understanding of consequence, we can take our techie hat off, and we need to start thinking like a business person. What is the consequence of the occasional drop worth to the business. Throw it out to your internal customers - probably some of them will jump up and down and tell you they can't tolerate any loss - it must be up all the time. That's fine. Make them come up with a number - what will this failure scenario cost. Compare this cost to the price from Cisco/Juniper/Whoever for the extra kit, and this is your business case, one way or the other.

One useful trick (and it depends on how internal financing works in your company) is you budget to build your network to a certain 'reasonable' level of resiliency - and if any particular application owner/customer needs more, then they pay for the extra. It's not just about being a smartypants, but making people understand that these extra uptime percentage points get expensive. There is a consequence to the company's finances for demanding them. Often, it might turn out to be a lot cheaper to re-engineer the application to learn to recover from a network failure.

5 comments:

chris marget said...

I haven't listened to the discussion on the podcast, so don't know if you covered it there... There's another reason to favor dual-supervisors, especially at the access edge.

Some environments have a combination of change control process and application sensitivity that make dual supervisors a matter of expedience, rather than resiliency (which is handled by NIC failover mechanisms on the server).

Imagine that there's hundreds of applications hanging from an access switch, and every one of them has veto power over your IOS upgrade plans. Without a standby supervisor and ISSU, you might never get that upgrade done.

Ethan Banks said...

Dan, you hit on an interesting point with "it might be a lot cheaper to fix the lousy application then throw more network at it." Okay, I paraphrased indelicately, but the point is great. Often, the app guys throw back "it's the network" as a cheap dig to get out of properly engineering their app. But...then when you get into just how much it's going to cost for a hitless environment, management doesn't want to spend the capex and forces the app guys to look at how to fix what can't tolerate the 3 second hit. Sometimes, the fix is as simple as throwing up another instance of the app in another data center, sorting out a database synchronization challenge (a frequent bugaboo), and then using DNS to fling the app to the opposite site during a maintenance window where an outage is likely. Sometimes, it's re-writing lousy code. But so often, the app guys can make it better.

chris marget said...

I finally listened to PPP#13, and heard that you guys covered the 'hitless upgrade' issue. Dan, I loved that you managed to sneak an FCoTR mention in there :-) And I finally noticed that you were focused on the core devices, which is something I missed the post.

We're definitely in agreement on core redundancy, but I find myself putting it in anyway.

Power, Fabric modules and line cards are all probably redundant anyway. And they're cheap.

So that leaves the supervisor as the only question mark. It's probably not needed, but you're installing redundant supervisors in all of those access switches for the hitless upgrade feature. In a Cat6500, those list for $15K-$38K USD each.

After springing for all of those standby supervisors in the access layer, how would you present the argument against a second N7K-SUP1 in the core when it only costs $25K?

It would be tough to explain to the guy writing the checks why you don't want that level of redundancy in the core.

Dan Hughes said...

It's an intersting point on the large access switches, I hadn't really thought of that. I agree 100% on power/fabric etc - if it's cheap to add redundancy, do it. And I know in reality, we always end up putting in the second Sup - for exactly the reason Chris mentions above. I guess we tent to take an approach of 'if we're speding a million, why not spend 1.3 (made up number) and have belt and braces'..

And that's kinda why I wrote the post, we've got so used to doing that that we (and I put my hand up to this) sometimes don't really think through the question of is it really needed. That $300k we could have shaved off the project could make a difference, if all purchases did the same.. Different times and all that...

On the access switches - the only thing I'd say - is if there is a real genuine 'yes we're prepared to accept the cost and pay for this' business reason why users can't take a 5 minute out of hours hit for a switch reboot - then they should be dual homed anyway. In which case - you can still take out a switch.

chris marget said...

The environment I left recently had everything we're talking about here: redundant supervisors everywhere, AND dual-homed servers (NIC team / interface bond / IPMP / whathaveyou). Sadly, application managers still had veto power over any network changes with potential for business impact. It was hard to get things done.

Greg brought up a great point: these mechanisms add complexity, complexity adds pain. Consider a switch that's running in my environment today... It's a 4507R with dual Sup6-E. One supervisor is stuck in an endless reboot cycle. The primary supervisor says this:


Switch(config)#
Config mode locked out until standby initializes

configuration mode locked.'Please try later.'
Switch(config)#

...and this:


Switch#hw-module slot 3 reset power-cycle
Proceed with reload of module? [confirm]
Module not completely up
Switch#

Nice, huh?

Congrats on the recert, BTW!