Network issues linked to Microsoft systems causing outages at businesses worldwide

return2ozma@lemmy.world · 4 months ago

Network issues linked to Microsoft systems causing outages at businesses worldwide

SirDerpy@lemmy.world · edit-2 4 months ago

EDIT: There’s no tier 1 failure. My shits just fucked.

I don’t have easy access to very much high level, non-public network data and information outside my organization. But, based on what I can see, it sure looks like at least one piece of the US fiber backbone has gone kaput.

I decided our small organization would react as if this were true. But, we really don’t know and haven’t had time to figure it out, yet.

The worst case scenario for regular people is that their shift gets cut today, possibly also tomorrow, because their assigned computer is fucked.

breadsmasher@lemmy.world · 4 months ago

My place took a hit from this but it seems to be more to do with crowdstrike. They issued an update which caused machines to into reboot loops, requiring manual intervention. My assumption was microsoft uses crowdstrike internally and also took a hit

SirDerpy@lemmy.world · 4 months ago

deleted by creator

breadsmasher@lemmy.world · 4 months ago

No idea why someone downvoted you for this!

Absolutely, we are entirely unsure of real root cause too (until microsoft releases an explanation).

We have pretty simple networking, but for a while internal vnet communication was really all over the place. That seems to have stabilised for us recently

(UK south / west regions)

SirDerpy@lemmy.world · 4 months ago

deleted by creator

breadsmasher@lemmy.world · 4 months ago

… If your shift isnt getting cut …

Exactly! We made sure to contact everyone who would travel to an office to inform them its at least a half day off and to check in at lunch time. Email everyone WFH the same thing, SMS too just in case their computer is down.

But I (I assume you are in the same boat) was up much earlier to tackle this! Fun day.

SirDerpy@lemmy.world · edit-2 4 months ago

I’ve got easy mode: On-call woke me up. We’re five techies and equity holders with one employee, an intern. We pushed a few buttons to effect plan B and enable plan C, and sent a medium priority notification by chat.

Then, we sent the intern to the office to watch the meters. If she fucks up it’s not a problem until about Tuesday. So, we’re waiting to see if she automates the task, uses the existing routing template to ensure she’s notified of wonkiness if she leaves the office, and asks permission. She’s recreating the wheel and can then compare to our work. I thought I’d give her a couple hours before checking in.

JonsJava@lemmy.world · 4 months ago

Article has been updated with the root cause - Crowdstrike. The reason is simple: Azure has tons of Windows systems that are protected with CrowdStrike Falcon. Crowdstrke released a bad version that is causing boot loops on Windows computers, including Windows VM servers.

SirDerpy@lemmy.world · edit-2 4 months ago

deleted by creator

JonsJava@lemmy.world · edit-2 4 months ago

Microsoft, Azure, and Crowdstrike have all stated the root cause at this point. Furthermore, this tells me most of the Falcon sensor installs are done bad, as we also use Crowdstrke and have ours set to “latest version - 1” to ensure this exact thing doesn’t happen.

SirDerpy@lemmy.world · 4 months ago

deleted by creator

JonsJava@lemmy.world · 4 months ago

There aren’t any backbone outages right now that are being discussed. Many servers that run MANY services are on Windows, using Crowdstrike. Flights, banks, entertainment (some Netflix, for example).

The overall result: it looks like a backbone outage, but isn’t.

SirDerpy@lemmy.world · 4 months ago

Thank you.

But, fuck. That means we screwed up primary design or someone broke the contract.

Gotta work today.