On the morning of March 12, 2024, I was reminded again that the difference between “resilient” and “robust” isn’t just semantics — it’s the margin between inconvenience and outright failure.
This post breaks down another real outage we experienced and the mistakes I made along the way, in hopes that you won’t repeat them.
The Incident
At 6:24 AM, our network monitoring service alerted us that our SFTP server had lost internet access. Initially, it looked isolated — no other alerts, just this one box offline. I dove into OS and network troubleshooting on the server. But as minutes passed and additional reports rolled in, the scope started to expand. It wasn’t just the SFTP server — it was our entire primary internet connection at the data center, suffering from 20–80% packet loss.
The culprit? Our datacenter provider, whose multi-ISP aggregate connection was flaking out. They later admitted to the issue — after I’d already started diagnosing and mitigating. Par for the course.
The Expected vs. Actual Impact
Our firewall’s SD-WAN configuration should have taken care of this. It’s designed to detect degraded circuits and automatically reroute traffic over an alternat WAN connection.
Expected impact:
- Slight (a few seconds) interruption to “realtime” services like Zoom and Teams calls while traffic re-routed to the alternate WAN provider
Actual impact:
- The abovre, plus…
- No internet on wired devices at our headquarters location
- Confused and frustrated users walking into a network that felt half-dead
The Root Cause (a painful one)
It turns out, some of our firewall policies were hardcoding traffic to use a public IP associated with our degraded internet circuit — including:
- A policy for Bloomberg traffic (with outbound NAT binding to a single public IP address)
- A policy for generic outbound HTTP/S traffic from our HQ location (bound to the same IP)
This NAT configuration was a holdover from the days when IP whitelisting was the norm. Unfortunately, SD-WAN doesn’t override NAT bindings. So while our failover logic worked, traffic was still forced onto a failing route because of this sticky NAT policy.
Lesson learned: if your policy includes NAT bindings tied to a specific circuit, it can undo all the intelligence of your SD-WAN config.
Our Testing Regimen Had a Blind Spot
Prior to SD-WAN go-live, we tested WAN failover events using a test VLAN in our data center. We never tested from the real production segments.
Because SD-WAN is an overlay and abstracted from the underlying topology, I assumed (wrongly) that path failover would behave the same across all segments. But of course, it didn’t — especially with NAT complicating things.
We’ve now updated our disaster recovery test plan to include multi-segment WAN failover testing, not just isolated VLANs in the DC.
The Workaround
Once we realized wired traffic was locked onto a dead path, we pivoted quickly:
- Disconnected user docks to force clients onto Wi-Fi
- Wi-Fi SSIDs were already linked to separate uplinks (score one for design)
- Staff reconnected, got back to work, and we exhaled
Our IT staff absolutely killed it under pressure. Our helpdesk engineer, who was the only IT staffer on-site when this broke,calmly and quickly executed the workaround — zero panic, maximum grit. Our infrastructure engineer also dropped a doctor’s appointment to help. That’s the kind of team I’m lucky to lead.
Configuration Changes Post-Mortem
Actions taken immediately after we stopped the bleeding:
- Removed legacy NAT bindings from all firewall rules
- Tuned the SD-WAN “restore to service” threshold (from 5 checks to 1000) to avoid premature reintroduction of degraded links
- Documented the whole mess for better team knowledge sharing (and now, for you)
What I’d Do Differently
If I could hop in a time machine:
- I’d audit every single firewall policy for NAT bindings the day after SD-WAN go-live (maybe even the day before)
- I’d simulate WAN failure from each critical network segment, not just a test VLAN
- I’d set much more conservative thresholds for link restoration — degraded is often worse than dead (see the Appendix below)
Appendix: When broken is better than degraded
Network administrators have struggled for decades with automating routing switchovers between network segments. Getting it right is a tricky business and I have met more firewalls with poorly designed auto-failover mechanisms than those build with adequate forethought. Most of the bad configs work perfectly well if the primary internet connection goes completely dark, but struggle when connections are intermittently degraded. The latter scenario often leads to an unnecessary high volume of switchover events, which can be impactful to network users. This is one of the main problems the SD-WAN standard aimed to resolve.
SD-WAN is both a network failover and load-balancing tool, and it handles degraded connections very well by design. Switchover events are less impactful to users on the SD-WAN standard due to the load-balancing nature of the spec (rather than being killed upon switchover, existing sessions are allowed to persist on the originating link until they are closed). The most important finer detail to get right with these configurations is the threshold after which a previously degraded connection is deemed back to normal (and thus, ready for use). Data from today’s event suggested that this threshold should be increased to a value well above the default value in our firewall’s SD-WAN configuration. Consequently, I’ve increased this value from 5 connection quality checks to 1000. This should result in less unnecessary switchover events. You should do the same.