At this moment I saw many services becoming unavailable and also portal.azure.com is no longer available. The status on azure.com shows issues with accessing the portal, but the effects seem to be broader than that. At Microsoft, this case is known by the Tracking ID: KTY1-HW8
Disruptions started at 12:06 UTC or 14:06 CEST based on my monitoring.
After the big outtage originated by cloudstrike on July 19th, 2024 we see right away the next big issue on July 30th, 2024. Having contencanly plans and desater processes becomes more and more important.
Problem found, Network issue 15:01 CEST
I can see that partial traffic is still reaching services, and depending on my ISP, I can also access some services and with other ISP not.
Global issue 15:21 CEST
The flag shows that this is not geographically limited.
Frontdoor
There are signs that this issue is related to Azure Front Door. Services running behind an Application Gateway do not appear to be experiencing these problems.
Help and Support
At first, portal.azure.com refused to load, but now it’s available again. However, attempting to report a support case is still not working.
Recovery 15:58 CEST
I can now access my account through more ISPs, and the graphs show that the incident may be resolved.
I’ve seen that my services have been steadily available since 15:58, but as of now (16:27 CEST), there’s still no official update from Microsoft regarding recovery.
An unexpected usage spike resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes.
https://azure.status.microsoft/en-us/status (2024-07-30 22:12)
Offical Statement, July 31st
Mitigation Statement – Azure Front Door – Issues accessing a subset of Microsoft services
What happened?
Between approximately at 11:45 UTC and 19:43 UTC on 30 July 2024, a subset of customers may have experienced issues connecting to a subset of Microsoft services globally. Impacted services included Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, as well as the Azure portal itself and a subset of Microsoft 365 and Microsoft Purview services.
What do we know so far?
An unexpected usage spike resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes. While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it.
How did we respond?
Customer impact began at 11:45 UTC and we started investigating. Once the nature of the usage spike was understood, we implemented networking configuration changes to support our DDoS protection efforts, and performed failovers to alternate networking paths to provide relief. Our initial network configuration changes successfully mitigated majority of the impact by 14:10 UTC. Some customers reported less than 100% availability, which we began mitigating at around 18:00 UTC. We proceeded with an updated mitigation approach, first rolling this out across regions in Asia Pacific and Europe. After validating that this revised approach successfully eliminated the side effect impacts of the initial mitigation, we rolled it out to regions in the Americas. Failure rates returned to pre-incident levels by 19:43 UTC – after monitoring traffic and services to ensure that the issue was fully mitigated, we declared the incident mitigated at 20:48 UTC. Some downstream services took longer to recover, depending on how they were configured to use AFD and/or CDN.
What happens next?
Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.
https://azure.status.microsoft/en-us/status/history (2024-07-31 09:11)
Preliminary Post Incident Review (PIR) – Azure Front Door – Connectivity issues in multiple regions, August 5th
This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a “Final” PIR with additional details/learnings.
What happened?
Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN).
The two main impacted services were Azure Front Door (AFD) and Azure Content Delivery Network (CDN), and downstream services that rely on these – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services.
From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts.
What went wrong and why?
Azure Front Door (AFD) is Microsoft’s scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.
Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact.
At 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network DDoS control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. This event in isolation would not have caused any impact.
However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, around two hours later when we resolved the routing issue. A small subset of customers without retry logic in their application may have experienced residual effects until 19:43 UTC.
How did we respond?
Our internal monitors detected impact on our Europe edge sites at 11:47 UTC, immediately prompting a series of investigations. Once we identified that the network routes could not be updated within that one specific site, we updated the DDoS protection configuration system to avoid traffic congestion. These changes successfully mitigated most of the impact by 13:58 UTC. Availability returned to pre-incident levels by 19:43 UTC once the default network policies were fully restored.
How we are making incidents like this less likely or less impactful
• We have already added the missing configuration on network devices to ensure a DDoS mitigation issue in one geography cannot spread to other geographies in the Europe region which resulted in traffic redirection. (Completed)
• We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
• We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024)
• This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a “Final” PIR with additional details/learnings.
How can customers make incidents like this less impactful
• For customers of Azure Front Door/Azure CDN products, implementing retry logic in your client-side applications can help handle temporary failures when connecting to a service or network resource during mitigations of network layer DDoS attacks. For more information, refer to our recommended error-handling design patterns: https://learn.microsoft.com/azure/well-architected/resiliency/app-design-error-handling#implement-retry-logic.
• Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry.
• More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency.
• Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
How can we make our incident communications more useful?
• You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/KTY1-HW8.
Source: https://azure.status.microsoft/en-us/status/history (2024-08-05 08:36)
My learning
After the big outage originated by Cloudstrike on July 19th, 2024, we see right away the next big issue affecting Azure on a global scale on July 30th, 2024. Having incident management, contingency plans and disaster processes becomes more and more important in the world of cloud computing in 2024.