Azure

Cloud Providers Azure Network Infrastructure
 
Jan-18, 2:48pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, a limited subset of customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.


Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. The next update will be provided in 2 hours or as events warrant.

 
Jan-18, 3:04pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. The next update will be provided in 2 hours or as events warrant.

 
Jan-18, 5:07pm EST

Impact Statement: Starting at 14:12 UTC on 18 Jan 2024, customers in East US may experience short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicates that these interruptions are brief and appear in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: Engineering teams have identified a root cause for this issue and are currently exploring mitigation options. We have continued to monitor the status of the service and we can confirm that our telemetry indicates that there have been no additional spikes in the past 2-3 hours. We will continue to monitor and provide an update in 2 hours or as events warrant.

 
Jan-18, 5:19pm EST

Summary of Impact: Between 14:12 UTC and 16:52 UTC on 18 Jan 2024, customers in East US may have experienced short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. Internal telemetry indicated that these interruptions were brief and appeared in spikes, lasting approximately 2-5 minutes at a time with less than 5 spikes over a 3 hour period.

 

Current Status: This incident is now mitigated. More details will be provided shortly.

 
Jan-18, 5:50pm EST

Summary of Impact: Between 14:12 UTC and 16:52 UTC on 18 Jan 2024, customers in East US may have experienced short periods of application latency or intermittent HTTP 500-level response codes and/or timeouts when connecting to resources hosted in this region. 

 

Preliminary Root Cause: Engineers observed a sudden increase in traffic to an underlying network endpoint in the East US region. This increase happened in quick spikes(less than 5) over the course of 2-3 hours . When these spikes occurred, customers with resources in the region with network traffic routed through this endpoint may have encountered periods of packet loss and service interruption.

 

Mitigation: Engineers identified and isolated the source of the sudden increases in network traffic.

 

Next Steps: Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Cloud Providers Azure
 
Nov-28, 1:32am EST

Summary of Impact: Between 23:19 UTC on 27 Nov 2023 and 02:20 UTC on 28 Nov 2023, you were identified as a customer using Azure Cosmos DB in East US, who may have experienced Service Availability issues while performing Management operations in the Azure Portal or Azure CLI.


Preliminary Root Cause: We determined that this issue was caused due to an inadvertent human error when trying to perform a preventative mitigation.


Mitigation: The mitigation involved restoring the affected configurations to their prior state.


Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

 
Nov-28, 1:41am EST

Summary of Impact: Between 23:19 UTC on 27 Nov 2023 and 02:20 UTC on 28 Nov 2023, you were identified as a customer using Azure Cosmos DB in East US, who may have experienced Service Availability issues while performing Management operations in the Azure Portal or Azure CLI.


Preliminary Root Cause: We determined that this issue was caused due to an inadvertent human error when trying to perform a preventative mitigation.


Mitigation: The mitigation involved restoring the affected configurations to their prior state.


Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Cloud Providers Azure Network Infrastructure
 
Nov-22, 3:22pm EST

Summary of Impact: At 18:29 UTC, 20:04 UTC, and 20:24 UTC on 22 Nov 2023 for periods of 1 minute, 15 minutes, and 5 minutes respectively, resources in a single availability zone in East US may have seen intermittent network connection failures, delays, packet loss when reaching out to services across the region. However, the retries would have been successful. 


Preliminary Root Cause: While conducting an urgent break-fix repair on some network capacity in a single availability zone of East US, live traffic was impacted. Traffic was re-routed to alternate spans with sufficient capacity. The region is now stable. Customers can continue to operate workloads in the impacted Availability Zone. 


Mitigation: We have moved traffic to healthy optical fiber spans in the region. The issue is mitigated.


Next Steps: We will investigate why this impact occurred to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

 
Nov-22, 3:23pm EST
Resolved
Cloud Providers Azure
 
Oct-7, 2:47pm EDT

Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).


Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 13 Oct 2023, however, progress shows we are expected to complete by 11 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.

 
Oct-7, 2:54pm EDT

Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).


Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 11 Oct 2023, however, progress shows we are expected to complete by 09 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.

 
Oct-8, 2:11pm EDT

Summary of Impact: Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).


Current Status: This issue is now mitigated. More information will be provided shortly.

 
Oct-8, 2:55pm EDT

What happened? 

Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).

 

What do we know so far? 

We identified a change in behavior of one of the Azure security monitoring services agents used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases, scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches.

 

How did we respond?

To address this issue, engineers performed manual action on the underlying Virtual Machines of impacted caches. After further monitoring, internal telemetry confirmed this issue is mitigated and full-service functionality was restored.

 

What happens next? 

We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Cloud Providers Azure
 
Aug-18, 4:30pm EDT

Summary of Impact: Between 20:30 UTC on 18 Aug. 23 and 05:10 UTC on 19 Aug. 23, you were identified as customer using Workspace-based Application Insights resources who may have experienced 7-10% data gaps during the impact window, and potentially incorrect alert activations.

Preliminary Root Cause: We identified that the issue was caused due to a code bug as part of the latest deployment which has caused some drop in the data.

Mitigation: We have rolled back the deployment to last known good build to mitigate the issue.

Additional Information: Following additional recovery efforts, we have re-ingested the data that was not correctly ingested due to this event, after further investigation it was discovered that the initial re-ingested data had incorrect TimeGenerated values, instead of the original TimeGenerated value. This may cause incorrect query results which may further cause incorrect alerts or report generation. We have investigated the issue that caused this behavior so future events utilizing data recovery processes will re-ingest the data with the correct, original TimeGenerated value.

If you need any further assistance on this, please raise the support ticket for the same.

 
Aug-18, 4:31pm EDT
Resolved
 
Sep-13, 2:44pm EDT

Summary of Impact: Between 20:30 UTC on 18 Aug. 23 and 05:10 UTC on 19 Aug. 23, you were identified as customer using Workspace-based Application Insights resources who may have experienced 7-10% data gaps during the impact window, and potentially incorrect alert activations.

 

Preliminary Root Cause: We identified that the issue was caused due to a code bug as part of the latest deployment which has caused some drop in the data.

 

Mitigation: We have rolled back the deployment to last known good build to mitigate the issue.

 

Additional Information: Following additional recovery efforts, we re-ingested the data that was not correctly ingested due to this event, after further investigation it was discovered that the initial re-ingested data had incorrect TimeGenerated values, instead of the original TimeGenerated value. This may have caused incorrect query results which may have further caused incorrect alerts or report generation. Our investigation extended past previous mitigation and we were able to identify a secondary code bug that caused this behavior. We deployed a hotfix using our Safe Deployment Procedures that re-ingested the data with the correct, original TimeGenerated value. All regions are now recovered and previously incorrect TimeGenerated values are now corrected.

 

If you need any further assistance on this, please raise the support ticket for the same.

 

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrence. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Cloud Providers Azure
 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We are actively investigating this issue and will provide more information within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We are actively investigating this issue and will provide more information within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.


 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 855 complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 85% complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.

 
Jul-23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.

 
Jul-23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.


This incident is now mitigated. More information will be provided shortly.

 
Jul-23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.

This incident is now mitigated. More information will be provided shortly.

 
Jul-23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.


Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.

 

Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.

 

Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. . Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

 
Jul-23, 9:40am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.

 

Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.

 

Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.

 

Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

June 26, 9:47am EDT
Cloud Providers Azure Azure Frontdoor
 
Jun-26, 9:47am EDT

Impact Statement: Starting at 06:45 UTC on 26 Jun 2023 you have been identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.


Current Status: Based on our initial investigation, we have determined that a subset of AFD POPs became unhealthy and unable to handle the load of incoming requests, which in turn impacted Azure Front Door availability.


After a successfully simulation we are currently applying the potential mitigation workstream, by removing impacted instances from rotation while monitoring traffic allocations to remaining healthy cluster. The first set of impacted instances have been successfully removed at 15:30 UTC on 26 Jun 2023. Since the removal we have been seeing a reduction of errors and we will continue to monitor impact. We are currently working on removing the rest of the impacted instances and allocating the resources to a healthy alternative. Some customers may already begin to see signs of recovery. The next update will be provided in 60 minutes, or as events warrant.

 
Jun-26, 9:47am EDT

Summary of Impact: Between 06:45 UTC and 17:15 UTC on 26 Jun 2023 you were identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.


This issue is now mitigated, more information on mitigation will be provided shortly.

 
Jun-26, 9:48am EDT

Summary of Impact: Between 06:45 UTC and 17:15 UTC on 26 Jun 2023 you were identified as a customer using Azure Front Door who may have encountered intermittent HTTP 502 errors response codes when accessing Azure Front Door CDN services.


Preliminary Root Cause: We found that a subset of AFD POPs were throwing errors and unable to process requests. 


Mitigation: We moved the resources from the affected AFD POPs to healthy alternatives which returned the service to a healthy state.


Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation. 

How to stay informed about Azure service issues


Cloud Providers Azure Network Infrastructure
 
Jun-10, 4:30pm EDT

Summary of Impact: Between 20:33 UTC - 21:00 UTC on 10 Jun. 2023, customers in East US may have experienced impact on network communications, due to hardware failure of a router, during a planned maintenance. Retries would have been successful.

 

Preliminary Root Cause: We have determined that the router self-healed at 21:00 UTC.

 

Mitigation: We have isolated the device as a precaution and stopped further upgrades.

 

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

 
Jun-10, 4:31pm EDT
Resolved
Cloud Providers Azure
 
Mar-16, 5:24am EDT

Between 03:30 UTC and 13:52 UTC on 16 Mar 2023 you were identified as a customer using Azure Redis Cache that may have experienced some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.

This issue is now mitigated; more information will follow shortly. 

 
Mar-16, 5:24am EDT

We are investigating an alert for Azure Redis Cache. We will provide more information as it becomes available. 

 
Mar-16, 5:24am EDT

Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Current Status: We are aware of the issue and are actively investigating. The next update will be provided in 60 minutes or as events warrant.

 
Mar-16, 5:24am EDT

Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Current Status: We have identified the cause of this to be an issue with a recent deployment. We are in the early stages of developing a hotfix to mitigate this issue. The next update will be provided in 1 hour or as events warrant.

 
Mar-16, 5:24am EDT

Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Current Status: We have identified the potential root cause to be an issue with a recent deployment. We are in the early stages of developing a hotfix to mitigate this, in the meantime, we are looking into pausing the deployment to avoid further disruption. The next update will be provided in 2 hours, or as events warrant.

 
Mar-16, 5:24am EDT

Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Current Status: We have identified the potential root cause to be a bug which was introduced during a recent deployment and is causing the unexpected failovers to occur. We are working to pause the deployment to avoid further disruption. There is no current workaround for this issue, however, once failover process is completed for all nodes of a cache resource, cache health should return to normal. We are also working on a hotfix that will be deployed in the coming days to resolve the underlying issue.

The next update will be provided in 2 hours, or as events warrant.

 
Mar-16, 5:24am EDT

Impact Statement: Starting at 03:30 UTC on 16 Mar 2023, you have been identified as a customer using Azure Redis Cache who may experience some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Current Status: We have identified the cause of this to be a bug which was introduced during a recent deployment and is causing the unexpected failovers to occur. We have put a pause on the deployment as a short term fix while we continue to work on developing the hotfix to address the underlying issue which we expect to mitigate fully this. There is no current workaround for this, however, once failover process is completed for all nodes of a cache resource, cache health should return to normal. 

The next update will be provided in 2 hours, or as events warrant.

 
Mar-16, 5:24am EDT

Summary of Impact: Between 03:30 UTC and 13:52 UTC on 16 Mar 2023 you were identified as a customer using Azure Redis Cache that may have experienced some service degradation such as unexpected failovers, timeouts and intermittent connectivity issues.


Preliminary Root Cause: We identified that a recent deployment introduced a bug, which led to the unexpected failovers, timeouts and connectivity issues mentioned above. 


Mitigation: We mitigated this by stopping the deployment which was causing these issues, and can confirm that this has mitigated impact for customers. We are still in the process of rolling out the hotfix as a long term fix. 


Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Cloud Providers Azure
 
Jan-25, 2:17am EST

Summary of Impact: Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI.


Preliminary Root Cause: We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, as well as connectivity between services in different regions, as well as ExpressRoute connections.


Mitigation: We identified a recent change to WAN as the underlying cause and have rolled back this change. Networking telemetry shows recovery from 09:00 UTC onwards across all regions and services with the final networking equipment recovering at 09:35 UTC. Most impacted Microsoft services automatically recovered once network connectivity was restored, and we worked to recover the remaining impacted services.


Next Steps: We will follow up in 3 days with a preliminary Post Incident Report (PIR), which will cover the initial root cause and repair items. We'll follow that up 14 days later with a final PIR where we will share a deep dive into the incident. You can stay informed about Azure service issues, maintenance events, or advisories by creating custom service health alerts (https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation) and you will be notified via your preferred communication channel(s).

 
Jan-25, 2:18am EST
Resolved