Wednesday, May 19, 2010

Troubleshooting: Dan's mistake of the week..

I've spent time over the last two weeks pulling my hair out as to why new traffic on an existing site to site VPN didn't work. Finally got it today, and reminded myself an important lesson in the process.

Scenario



Take a simple site-site internet VPN between two sites (well there are lots, but this about a VPN between two sites), and two networks in each site. London has the network 10.15.1.0/24 and 10.15.2.0/24. Dublin has 10.35.130.0/24 and 10.35.139.0/24. Each site has a pair of ASAs (up to date code).

Summary of the configuration here (it's not the real one for obvious reasons, and cut down to the key bits, but is accurate for the sake of the article) :

Dublin-PIX :

interface e0/0
nameif outside
security-level 0
ip address 35.35.35.35 255.255.255.0
!
interface e0/1
nameif inside
security-level 100
ip address 10.35.139.254 255.255.255.0
!
interface e0/1
nameif dmz
security-level 50
ip address 10.35.130.254 255.255.255.0
!
crypto map Dublin 24 match address Dublin-London
crypto map Dublin 24 set peer 15.15.15.15
crypto map Dublin 24 set transform-set AES-256VPN
crypto map Dublin interface outside
!
access-list Dublin-London ex per ip 10.35.130.0 255.255.252.0 10.15.0.0 255.255.0.0
access-list Dublin-London ex per ip 10.35.139.0 255.255.255.0 10.15.0.0 255.255.0.0
!
access-list NONAT extended per ip 10.35.130.0 255.255.252.0 10.0.0.0 255.0.0.0
access-list NONAT extended per ip 10.35.139.0 255.255.255.0 10.0.0.0 255.255.255.0
!
nat (inside) 0 access-list NONAT
nat (inside) 1 10.35.139.0 255.255.255.0
nat (dmz) 0 access-list NONAT
nat (dmz) 1 10.35.130.0 255.255.255.0
!
global (outside) 1 interface



London-PIX :

interface e0/0
nameif outside
security-level 0
ip address 15.15.15.15 255.255.255.0
!
interface e0/1
nameif inside
security-level 100
ip address 10.15.1.254 255.255.255.0
!
interface e0/1
nameif dmz
security-level 50
ip address 10.15.2.254 255.255.255.0
!
crypto map London 14 match address London-Dublin
crypto map London 14 set peer 35.35.35.35
crypto map London 14 set transform-set AES-256VPN
crypto map London interface outside
!
access-list London-Dublin extended permit ip 10.15.0.0 255.255.0.0 10.35.130.0 255.255.252.0
access-list London-Dublin extended permit ip 10.15.0.0 255.255.0.0 10.35.139.0 255.255.255.0
!
access-list NONAT extended permit ip 10.15.0.0 255.255.0.0 10.0.0.0 255.0.0.0
!
nat (inside) 0 access-list NONAT
nat (inside) 1 10.15.1.0 255.255.255.0
nat (dmz) 0 access-list NONAT
nat (dmz) 1 10.15.2.0 255.255.255.0
!
global (outside) 1 interface


Symptoms



The problem is that while traffic from 10.35.130.0/24 could get to machines in 10.15.1.0/24, traffic from 10.35.139.0/24 consistently could not. Running a packet-trace (at both ends) showed that it should work, and packets where leaving the 10.35.0.0 site, arriving at 10.15.0.0 site, the response packets never made it back. A capture on the inside interface in site 15 showed the server did respond. Rule access-lists are all correct..

To keep it simple - it's nothing to do with rules or the servers. I has probably never worked, but this is the first traffic to go between these two networks.

Simply - the packets from Dublin->London pass, but packets from London->Dublin don't.

What it could be



Once I've a clear definition of the problem, the next thing is to rule in or out the most likely causes.

The first thing that jumped to mind (and a common cause of issues with these symptoms) was that the NAT exemption wasn't set up correctly. Unless traffic is NAT exempted, it will NAT behind the interface IP, which will put it outside the VPN interesting traffic, and we would get these symptoms. However the packet-trace I did showed that the traffic was 'Allowed' to NAT exempt, so it wasn't that. I was actually still fairly convinced it could be, but eventually moved on.

Second thought was 'could it be badly written ACLs for the VPN definition', or even some incorrect routing. After lots of staring, answer was nope, none of them.

As sherlock homes (he's a British policeman) once said, 'when you've ruled out the probable, then whatever remains, however improbable, must be the answer'. So we're into the unlikely stuff. I spent hours looking for odd traffic handling quirks of the ASA. Tried a few things. Rebooted them. I was getting nowhere.


So what was it?



Have you worked it out yet (I'll be impressed with you if you have)?

I think you cross a line when you decide it isn't a probable cause. You decide it's going be something odd, then common sense leaves you and you start looking for crazy stuff. Don't get me wrong, sometimes it can be a bug, or a feature you don't understand properly, or something else crazy. But usually, it's something simple.

I showed you the same parts of the configuration that I focused on - the sections relevant to the connection in question. The sharp eyed amongst you will have noticed the sequence numbers in the crypto map may point to other entries prior to this one. Such as :


London :

crypto map London 9 match address London-Cork
crypto map London 9 set peer 9.9.9.9
crypto map London 9 set transform-set AES-256VPN
!
access-list London-SITE9 extended per ip 10.15.0.0 255.255.0.0 10.35.136.0 255.255.252.0

You see, the second octect in the IP scheme denotes country code (353 is ireland - shortened to 35), and for some reason in the distant past someone decided to use non contiguous addresses (I can only assume satan was whispering in their ear). When the (older) Cork connection was set up (which has 136,137,138 ranges) they simplified it by using 10.35.136.0 255.255.252.0. As this is a lower sequence number in the crypto map, it gets hit first, and the traffic never gets down to the ACL we want it to hit.

As with so many of these things, once you work it out, it's obvious and you're dumb. So where did I go wrong? My biggest mistake was immediately deciding (see the second sentence of this article) that the issue could only be to do with the configuration of this particular site-site VPN (i.e. sequence 14 of the crypto map), or something crazy and probably global. I didn't look at the rest of the crypto map at all. Why would I?

Dan's approach to troubleshooting

When you've spent many years troubleshooting networks, you learn not to beat yourself up over silly mistakes like this. You learn from them and move on. I'll certainly not make this mistake again. However to help you with the issues you've never encountered before, troubleshooting depends on being logical and methodical. It's so incredibly important, and normally something I take very seriously.

In this case, I wasn't as methodical or as logical as I should have been, and that's the real lesson I need to take away with me.


2 comments:

Max said...

Dear Dan,

Sherlock Holmes was not a policeman. Also as a fictional character, it is doubtful he has ever 'said' anything. You are quoting from a book by Sir Arthur Conan Doyle. According to my quote dictionary, you have misquoted him as well. *sigh*

The troubleshooting stuff is all very well, but if you're going to stray into narrative, then attention to detail will serve you well in that function too.

Dan Hughes said...

A joke is never funny when you have to explain to people it's a joke...