When “Resiliency” Broke the Internet: A Firewall HA Misconfiguration Story

Sometimes the most disruptive outages stem not from hardware failures or zero-day exploits, but from well-intentioned changes made under pressure. This is one of those stories—a tale of mistaken assumptions, lurking misconfigurations, and the curious fragility of high availability setups when their underlying wiring is off by just one port.

This post recounts a production network outage triggered by an LACP misconfiguration between a pair of firewalls and core switches, misaligned due to a change made while resolving a previous incident. On the surface, everything looked fine—until it wasn’t.

Timeline of Events

October 8, 2023
While troubleshooting a separate issue, I noticed what appeared to be a misconfigured set of LACP groups connecting our core switches to a high availability firewall pair. To increase resiliency, I adjusted the port memberships to align with what I believed was an active/active HA setup: aggregated links spanning both firewalls.

Unbeknownst to me, the firewalls were configured for active/passive HA. The intended design was two separate LACP groups—one for each firewall. My change instead created cross-wired LACP groups combining ports from both firewalls, violating that architecture. Interestingly, this misconfiguration did not break production…at first.

Fast Forward: October 19, 2023

08:04:53: The active firewall’s interface port goes down due to an LACP timeout. The switch sees this and removes the port from its LACP group.
08:04:56: Almost immediately, the switch adds a new port—connected to the passive firewall—into the LACP group. The passive firewall appears to assume the active role due to the failover condition.
08:04:56: The previously failed interface on the active firewall comes back up.

The resulting instability effectively blackholed traffic. Devices on the wired LAN and critical systems in our virtual datacenter lost internet access.

08:09:53: Our monitoring system alerts IT staff. I was en route to another location and diverted to the datacenter.
08:20: Sent an employee-wide outage notice from the roadside.
08:35: Arrived onsite and began triage.
08:45: Disconnected the passive firewall. That change immediately resolved the problem by eliminating the conflicting LACP state.
08:58: Issued an “all clear” to the organization.

Root Cause

The issue stemmed from mismatched LACP groupings. Because the firewall pair operated in active/passive mode, only one firewall actively participated in LACP—masking the misconfiguration during normal operations. When the active firewall experienced a brief link failure, the passive firewall attempted to take over. But due to the misaligned switch port aggregation, the result was an inconsistent and broken forwarding path.

This is a textbook case of resiliency misconfiguration: systems built for fault tolerance actually introducing fragility due to incorrect assumptions at the physical layer.

Remediation & Recovery

Given the proximity of a critical company event, my team opted to defer reconfiguration until the weekend to avoid further disruption.

Sunday, October 22: Rebuilt the LACP groups correctly—each firewall now has a discrete, isolated link aggregation to the core switch fabric.
Verified HA failover behavior with controlled link and firewall failures. No service loss observed.

Lessons Learned

Document the HA mode before changing switch port behavior.
Active/passive vs. active/active configurations have drastically different requirements for upstream topology.
If it looks wrong but works, verify anyway.
A “working” misconfiguration is a ticking time bomb—especially in high-availability or redundant setups. Test after every change.
Firewall failovers are silent until they aren’t.
The handoff between HA firewalls must be tested end-to-end, including link aggregation behavior.
Layer 1/2 assumptions can cause Layer 3 failures.
Always validate link-level configurations when working with LAGs, HA pairs, and dynamic link negotiation.
Make rollback paths explicit.
In this case, yanking the passive firewall resolved the issue. That decision was made on instinct—documenting and rehearsing similar recovery paths will make future responses smoother.

Visuals

Correct LACP Configuration

A properly configured set of LAGs between firewalls and switches, which allows for the loss of an entire switch and/or an entire firewall with no impact to normal network operations. Correct LACP Configuration

Misconfigured (But Functional) LACP Setup

Misconfigured LAGs with a working traffic flow from the switches to the (Active) Firewall. Misconfigured LACP Setup

Misconfigured (Non-Functional) LACP Setup

The aftermath of a LAG failover in a misconfigured state, in which traffic from the switch to the HA firewall pair routes to the (Passive) Firewall B. The result: loss of internet connectivity Misconfigured LACP Setup

Final Thoughts

This was a case where a subtle misstep—made with good intentions—took down production. These are the scars that shape better infrastructure thinking. High availability is only as resilient as its underlying assumptions. And in network engineering, those assumptions are often hidden deep in switch configs and cable diagrams, or in conflict with Layer 1.

Root cause? Misconfigured LACP groups.
Root lesson? Resiliency without repeat validation is a liability.

Timeline of Events#

Fast Forward: October 19, 2023#

Root Cause#

Remediation & Recovery#

Lessons Learned#

Visuals#

Correct LACP Configuration#

Misconfigured (But Functional) LACP Setup#

Misconfigured (Non-Functional) LACP Setup#

Final Thoughts#