
Welcome to Root Cause — where I share hard-won lessons and practical insights from my work as an IT, infosec, and AI engineer, and my adventures as a gigging musician.
On the morning of March 12, 2024, I was reminded again that the difference between “resilient” and “robust” isn’t just semantics — it’s the margin between inconvenience and outright failure. This post breaks down another real outage we experienced and the mistakes I made along the way, in hopes that you won’t repeat them. The Incident At 6:24 AM, our network monitoring service alerted us that our SFTP server had lost internet access. Initially, it looked isolated — no other alerts, just this one box offline. I dove into OS and network troubleshooting on the server. But as minutes passed and additional reports rolled in, the scope started to expand. It wasn’t just the SFTP server — it was our entire primary internet connection at the data center, suffering from 20–80% packet loss. ...
Sometimes the most disruptive outages stem not from hardware failures or zero-day exploits, but from well-intentioned changes made under pressure. This is one of those stories—a tale of mistaken assumptions, lurking misconfigurations, and the curious fragility of high availability setups when their underlying wiring is off by just one port. This post recounts a production network outage triggered by an LACP misconfiguration between a pair of firewalls and core switches, misaligned due to a change made while resolving a previous incident. On the surface, everything looked fine—until it wasn’t. ...
Not the kind you can blame on junior engineers, flaky hardware, or vague documentation. I mean real mistakes. Misconfigured firewalls that broke production. Deployments that triggered outages. Security controls that looked great on paper but failed under pressure. Strategic decisions that aged poorly as the full picture emerged. I used to hide them — tidy them up in postmortems. Frame them as “learnings.” Bury them in chat threads. Smile in meetings and say “we handled it.” That’s what we’re trained to do — protect credibility, defend authority, keep the gears turning. ...