Intermittent Connectivity Issues - East US - Investigating

Incident
January 18, 5:50pm EST

Intermittent Connectivity Issues - East US - Investigating

Status: closed
Start: January 18, 9:12am EST
End: January 18, 5:19pm EST
Duration: 8 hours 7 minutes
Affected Components:
Cloud Providers Azure Network Infrastructure
Update

January 18, 2:48pm EST

January 18, 2:48pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, a limited subset of customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.


Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. The next update will be provided in 2 hours or as events warrant.

Update

January 18, 3:04pm EST

January 18, 3:04pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. The next update will be provided in 2 hours or as events warrant.

Update

January 18, 5:07pm EST

January 18, 5:07pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. We have continued to monitor the status of the service and we can confirm that our telemetry indicates that there have been no additional spikes in the past 2-3 hours. We will continue to monitor and provide an update in 2 hours or as events warrant.

Resolved

January 18, 5:19pm EST

January 18, 5:19pm EST

Summary of Impact: Between 14:12 UTC and 16:52 UTC on 18 Jan 2024, customers in East US may have experienced short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicated that these interruptions were brief and appeared in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: This incident is now mitigated. More details will be provided shortly.

Resolved

January 18, 5:50pm EST

January 18, 5:50pm EST

Summary of Impact: Between 14:12 UTC and 16:52 UTC on 18 Jan 2024, customers in East US may have experienced short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. 

 

Preliminary Root Cause: Engineers observed a sudden increase in traffic to an underlying network endpoint in the East US region. This increase happened in quick spikes(less than 5) over the course of 2-3 hours . When these spikes occurred, customers with resources in the region with network traffic routed through this endpoint may have encountered periods of packet loss and service interruption.

 

Mitigation: Engineers identified and isolated the source of the sudden increases in network traffic.

 

Next Steps: Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.