Wednesday, July 28, 2010

EIGRP Feasible Successor routes

I always forget what these means in-between exams (I simply don't use EIGRP anywhere), but it is quite important to understand EIGRPs computation logic. The logic is this - when looking at possible 'backup' routes to hold ready for use, it's important to avoid the possibility of considering a path which has come from a neighbor which is going to send the packet back to you in the end - i.e. a routing loop. Without the 'global view' of link state protocols - you need to have a way to determine what inbound advertisements are 'safe' and which ones are 'maybe not safe'.

If you've a backup route which you know is safe, you can switch over really quickly in the event of the loss of your primary link. If you don't - you want to take a little time and be sure that you're not going to introduce a loop. You need to wait for the network to re-converge. If you've your head around that - the rest is just terminology.

First key term is 'Feasible Distance'. This is basically the metric of the current 'best' route (i.e. the live one), including any metric elements that were added by the local router. The 'whole cost' if you like. Let's make up a number and say it's 10,000.   'Reported Distance' of an inbound route advertisement is the metric that the neighbor is passing to us, before we add on any metric ourselves.

There are two possibilities for the Reported Distance, it's either more or less than our current 'best' metric - the Feasible Distance. If it's more, say 20,000, then it's possible that that neighbor is actually advertising our own advertisement back to us - i.e. a loop. Maybe it's not, but until the network has re-converged, we have to consider the possibility.  If the RD is less, then it is not possible that it's going back via us. This cannot be a loop, and therefore it's safe to quickly switch to this link.

Feasible Successors are quite simply an alternative path to the live route (the 'Successor'), which has a lower reported distance than the current feasible distance - i.e. backup routes which we know are not a loop. If a router looses it's main route, and has a feasible successor, it simply promotes it to the 'live' route, and sends out updates to it's neighbors to tell them the metric has changed.

If it doesn't have a feasible successor, then the router has no quick backup path, marks the route as 'active' (a counter intuitive tern which means the router is actively trying to work out how to get to the route in question), and starts querying it's neighbors to see if any of them know an way to get to the destination. Either it'll get an answer back with a loop free route, which will be installed as the new successor (i.e live route), or it'll give up and remove the route from the routing table..

Why EIGRP hello and hold timers don't have to match

Lets take a simple topology. Two routers, R1 and R2, on a common LAN. Assume basic EIGRP is up. One morning you decide to start 'optimizing' the protocol by adjusting timers, but get half way through it and get bored and go for tea :

(both)

router eigrp 100
 network 150.150.0.0




R1#sh run int f0/0
!
interface FastEthernet0/0
 ip address 150.150.36.3 255.255.255.128
!! default EIGRP timers of hello (5) and holdtime (15) apply
!
end




R2#sh run int f0/0
!
interface FastEthernet0/0
 ip address 150.150.36.126 255.255.255.0
 ip hello-interval eigrp 100 1
 ip hold-time eigrp 100 3
end

R1's hellos are only going to be sent out every 5 seconds, which is longer than R2's hold time of 3 seconds, so it's going to break right? No, actually it isn't. In EIGRP, when a router exchanges hellos with a neighbor, it looks at the timers in the inbound hello, and expects packets at that rate.

So in the scenario above, R1 knows to expect packets every second from R2, and applies a hold time of 3 seconds, even though it's sending it's own hellos out ever 5 seconds. Vice versa, even though R2's interface is configured with a hold-time of only 3 seconds, it knows to expect hellos from R1 every 5, and to apply a hold time of 15.

I don't know if this is a good thing, as it basically makes EIGRP very forgiving of bad configuration, but I've no doubt that if you look hard enough, you'll find neighbor relationships relying on this behavior. Fun eh!

Monday, July 26, 2010

How resilient does your DC network need to be?

We had a great conversation this weekend on packetpushers about how resilient you should make your datacentre switching. The guts of the conversation is on your core chassis devices, do you need to put in multiple supervisors/PSUs/ETC.

Now, the real answer to these kinds of questions is always a varient on 'it depends on the requirements'.
Ethan Banks was making the point that there are networks where a two second loss is a big deal. I guess we're primarily talking about trading houses/banks/call centres etc here, and in those places it's a fair point.

For pretty much anyone else, the first question I'd ask is 'How much money is it worth to avoid a 3 second loss once a year'. Why those numbers? Assuming you're doing something vaguely sensible with Rapid Spanning tree and/or your L3 routing protocol of choice, you should expect an unplanned device loss to be recovered in that time. Why once a year? If you're losing  devices more often than that, you probably have an environmental issue you need to fix first.

So what's the significance of the 3 second drop? Well, of course you need to test in your environment, but you should expect TCP connections to stay up (or you can tweak your TCP stacks to ensure they will). SMB transfers may drop, VOIP calls will drop. Storage calls will fail. These things should all recover*. I'm sure you can think of a few more things.

* Yeah I know the VOIP call won't 'recover', but you can call them back. As long as it's not a regular even, this is usually an acceptable risk. As for the rest, test your applications, and see what will recover, and what dies horribly. The 'dies horribly' list is a set of applications which are the drivers for 'hitless redundancy'. Make sure it's made clear to the business that these are the apps which are responsible for the extra cost.  
So once we have our understanding of consequence, we can take our techie hat off, and we need to start thinking like a business person. What is the consequence of the occasional drop worth to the business. Throw it out to your internal customers - probably some of them will jump up and down and tell you they can't tolerate any loss - it must be up all the time. That's fine. Make them come up with a number - what will this failure scenario cost. Compare this cost to the price from Cisco/Juniper/Whoever for the extra kit, and this is your business case, one way or the other.

One useful trick (and it depends on how internal financing works in your company) is you budget to build your network to a certain 'reasonable' level of resiliency - and if any particular application owner/customer needs more, then they pay for the extra. It's not just about being a smartypants, but making people understand that these extra uptime percentage points get expensive. There is a consequence to the company's finances for demanding them. Often, it might turn out to be a lot cheaper to re-engineer the application to learn to recover from a network failure.

Monday, July 19, 2010

The problem with net neutrality..

I've always considered myself in the 'yaa boo ISP opposition to net neutrality is bad' camp, but over the weekend I was speaking to a friend of mine who is a network engineer at a medium sized UK ISP, who all but hissed when I mentioned it.

Why all the fuss? The best way to look at it (and the thing that ISP's are worried about) is not so much the stuff that people are doing today, but the changes that are on the horizon to how media gets delivered. Do any of you remember the old days, when kids were respectful, policemen kept order, the summers where always hot, and music and books where things you bought in shops? God knows how people got them into their kindles, but hey.

To the ISP's these things are not huge deals. An eBook is a few hundred kilobytes, an album is say 18-25Mb. And the key point I guess is that they are one off transactions. A typical user is only going to download them occasionally.

TV and video are next. At the moment we go to the video store, or maybe occasionally rent videos from iTunes or whoever, but primarily we still watch live TV via traditional methods. The ISP business in the UK got a real wakeup call when the BBC launched their iPlayer application, which allowed people to ignore shedules, and watch their tv whenever they wanted. Unlike PVR type systems, the content is delivered in demand from the cloud, rather than the user storing the media locally.

This hurt their networks really badly. But what really scares them is the question of what happens when this becomes the mainstream. Fast forward to a few weeks ago when google announce Google TV. A set top box which allows people to scour the Internet for tv content (no doubt with the help of YouTube acting as a CDN). Moderatly scary. Then the fact that they've tied up with Sony to build it into their TVs.. Scary. Also the story around the campfire is that Sony can hit the switch on however many million PS3's are out there and turn them into into GoogleTV set top boxes. Oh and Microsoft can do something similar with the Xbox. Terrifying.

The Internet as its designed today (a unicast based, far away datacentre centric network) simply couldn't cope. It would die horribly. It simply can't do that huge sustained volume. And this is why they want to be able to limit it. They have to.

Now in reality, this is a when not if scenario. So in reality, someone (IETF?) needs to start planning for this, because it's perfectly possible to do, it just takes a different design than we have right now for how traffic gets for A to B across the Internet. Unlike when I say 'someone should do it' at work, i doubt it'll end up being me. However I do suspect CDNs will become extremely important.

The main conclusion I came to over the weekend was to buy some Akami stock! I know this will end up with big nasty corporate telephone companies controlling what we get access to, rather than the current system where Rupert Murdoch does, but hey, whoever said speech was free right...

Location:Dublin, Ireland