Intermittent connection time-outs on status pages and admin portal

Incident
August 01, 1:07pm EDT

Intermittent connection time-outs on status pages and admin portal

Status: closed
Start: July 30, 9:30am EDT
End: July 30, 6:00pm EDT
Duration: 8 hours 30 minutes
Affected Components:
Status pages Admin application
Update

July 30, 9:30am EDT

July 30, 9:30am EDT

StatusCast engineers were alerted earlier that some users were experiencing sporadic issues attempting to connect to the status page and admin portal. Our hosting provider, Microsoft Azure, has alerted us via their status page that they are experiencing some network issues globally. We will provide an update as soon as more information is available. 

Update

July 30, 10:44am EDT

July 30, 10:44am EDT

Access to status pages has remained stable and Azure has updated their status indicating failover processes have been engaged to improve their service availability. StatusCast's engineers will continue to watch this closely and will post additional updates as necessary.  

Update

July 30, 4:47pm EDT

July 30, 4:47pm EDT

StatusCast's application has continued to remain stable. Our engineers will continue to watch the system closely as Microsoft has not fully closed out the event on their side. For more specific details on Azure's issue please refer to their status page. We will provide additional updates as necessary. 

Resolved

July 30, 6:00pm EDT

July 30, 6:00pm EDT

Microsoft has closed the issue on their side and StatusCast's platform continues to operate as expected. Once Microsoft has published more details on this we will provide here in the form of an RCA.

Root Cause

August 01, 1:07pm EDT

August 01, 1:07pm EDT

FROM MICROSOFT:
Mitigation Statement - Azure Front Door Issues accessing a subset of Microsoft services
Tracking ID: KTY1-HW8

What happened?

Between approximately at 11:45 UTC and 19:43 UTC on 30 July 2024, a subset of customers may have experienced issues connecting to a subset of Microsoft services globally. Impacted services included Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, as well as the Azure portal itself and a subset of Microsoft 365 and Microsoft Purview services.

What do we know so far?

An unexpected usage spike resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes. While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it.

How did we respond?

Customer impact began at 11:45 UTC and we started investigating. Once the nature of the usage spike was understood, we implemented networking configuration changes to support our DDoS protection efforts, and performed failovers to alternate networking paths to provide relief. Our initial network configuration changes successfully mitigated majority of the impact by 14:10 UTC. Some customers reported less than 100% availability, which we began mitigating at around 18:00 UTC. We proceeded with an updated mitigation approach, first rolling this out across regions in Asia Pacific and Europe. After validating that this revised approach successfully eliminated the side effect impacts of the initial mitigation, we rolled it out to regions in the Americas. Failure rates returned to pre-incident levels by 19:43 UTC - after monitoring traffic and services to ensure that the issue was fully mitigated, we declared the incident mitigated at 20:48 UTC. Some downstream services took longer to recover, depending on how they were configured to use AFD and/or CDN.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.