DNS takes a hit - maybe?

Submitted by dash on Wed, 03/07/2018 - 00:44

According to F5, 41% of the outages in data centers are related to DNS.  Today at around 12:19PM CST, we took an outage.  The first thing we noticed were VDI desktops locked up.  After troubleshooting a bit, I found that DNS was not resolving.  So, the green text on the Umbrella VA... yeah, that was red.  Not good.  Pulling logs on our main access switch stack, I found 2 ports flapping.  The configuration on the access switch was gone for a port channel and the configuration associated with the member ports.  I quickly shutdown an interface and the flapping subsided.  What caused the configuration to disappear is a mystery.

Back to the DNS issue.  After some troubleshooting, I found the firewall could get to Umbrella servers, but DC(s) and VA(s) could not.  I was still seeing issues with internal connectivity so I opened a ticket with Cisco's ACI fabric team.  While waiting on an engineer, I narrowed the troubles down to the firewall.  But what was wrong with the firewall?  Nothing had changed.  In a work around, I bypassed Umbrella and pointed our DC(s) to Google.  This required a firewall policy change to allow DNS from our DC(s) to Google.  I changed the DHCP scopes to point to DNS servers and still nothing, the DC(s) couldn't get to DNS.  WTH...

Our issue was with the pre-filter policy.  I had to remove ports from all pre-filter rules since there was a source and destination specified (according to my TAC engineer).  Saved, deployed, and boom, DNS started resolving.  I also noticed that Umbrella VA(s) went all green.  Changed DNS servers to point back to VA(s) and DHCP scopes to point to VA(s) and left the rules for DNS servers to hit Google in case we needed to run through this exercise again.

Pulling logs, we noticed that all traffic from a VA was hitting an allow rule up until the outage.  Then, traffic stopped processing.  No connection events after that until we restored service.  Then, connection events showed pre-filter instead of allow, but both events showed that the policy they were matching were the pre-filter policy applied.  This was on an pair of FTD 2110(s).  In LA, we have a managed pair of 5525-X firewalls running FTD and any matches in the pre-filter do not show up in the connection events at all!

I'll post the bug ID after TAC updates the case.  Fun day at Tortoise!!