Microsoft 365, Teams and Outlook outage: Here’s what went wrong


Microsoft says router update caused huge, hours-long outage affecting Microsoft’s wide area network (WAN), rendering Azure, Microsoft 365 apps, and Power Platform inaccessible globally entire last week.

Last Wednesday’s outage affected Microsoft Teams, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, Microsoft Graph, PowerBi, M365 Admin Portal, Microsoft Intune, Microsoft Defender for Cloud Apps, and Microsoft Defender for Identity.

Prior to the outage, Microsoft advised customers that a planned update could cause latency or delays when customers attempt to connect to Azure resources in the Public Azure, Microsoft 365, and Power BI regions. But as European workers started the day, everyone realized that the update was causing more problems than latency.

A planned change that does not pass

The update directly impacted network devices across Microsoft’s WAN, which brought down connections between services in data centers. As well as connections over ExpressRoute, Microsoft’s private network allowing customers to transfer data between data centers.

Microsoft says in its preliminary post-incident review that production in most regions and services had been restored by 09:00 UTC on Wednesday. But all services were restored at 12:43 UTC on January 25. The outage also affected Azure Government cloud services that depended on the Azure public cloud, according to Microsoft.

“We determined that a change to Microsoft’s wide area network (WAN) impacted connectivity between customers on the Internet and Azure, connectivity between regions, and connectivity between sites via ExpressRoute” , explains Microsoft in its report.

“As part of a planned change to update the IP address of a WAN router, a command given to the router caused messages to be sent to all other routers in the WAN, requesting all servers recalculate their adjacency and forwarding matrices. During this recalculation process, the routers were not able to correctly forward packets. The command causing the problem behaves differently on different network devices, and the command had not been verified using our full qualification process on the router it was run on.”

packet losses

Microsoft monitoring systems detected Domain Name Service (DNS) and WAN issues at 07:12 UTC. After reviewing the recent changes, while AutoRecover was running at 08:20 UTC, engineers discovered the “problematic command” causing the issues.

“Due to the impact of the WAN, our automated systems have been paused, including the device identification and removal systems, and the traffic engineering system to optimize data flow over the network” , Microsoft said.

“Due to the pause of these systems, some network paths experienced increased packet loss beginning at 09:35 UTC until these systems were manually restarted, restoring the WAN to optimal operating conditions. . This recovery ended at 12:43 UTC.”

Microsoft now says it has “blocked the execution of high-impact commands on devices” to prevent this from happening again.

Microsoft plans to release a final post-incident report within the next two weeks.

To dig deeper into cloud computing outages


Source: “ZDNet.com”





Source link -97