Application Insights - Advisory

Incident
July 23, 9:40am EDT

Application Insights - Advisory

Status: closed
Start: July 23, 9:39am EDT
End: July 23, 9:39am EDT
Duration: 1m
Affected Components:
Cloud Providers Azure
Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We are actively investigating this issue and will provide more information within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We are actively investigating this issue and will provide more information within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.


Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We anticipate this mitigation workstream to take up to 4 hours to complete. An update on the status of this mitigation effort will be provided within 2 hours.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We completed more than half of this mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 2 hours. An update on the status of this mitigation effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 855 complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. To completely restore data ingestion back to normal, we are actively rebooting instances of an ingestion component. We are approximately 85% complete with this final mitigation workstream, which is anticipated to restore affected services and mitigate customer impact once completed. Customers may begin seeing signs of recovery and resolution of this event is anticipated to occur within 60 minutes. The next update will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We have completed our latest workstream across all instances of the affected ingestion service. A very small subset of instances remains unhealthy, where additional action is ongoing to complete recovery of the ingestion service and mitigate remaining impact. Customers may be seeing signs of recovery. An update on the status of the mitigation effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. We are progressing with the recovery of the remaining unhealthy service instances, which is estimated to complete within 60 minutes. Customers may be seeing signs of recovery. An update on the status of the service instance recovery effort will be provided within 60 minutes.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data gaps, data latency, and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.

Update

July 23, 9:39am EDT

July 23, 9:39am EDT

Starting at 07:15 UTC on 23 Jul 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may be experiencing intermittent data latency and incorrect alert activation.

Current Status: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform. We have rolled back the unhealthy deployment to prevent further impact and restore parts of the ingestion layer. Our telemetry shows that ingestion errors have almost returned back to normal and most customers should be seeing signs of recovery at this time. We are continuing to address what remains to be a small number of errors occurring on some ingestion service instances. An update will be provided within 60 minutes, or as soon as mitigation has been confirmed.

Resolved

July 23, 9:39am EDT

July 23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.


This incident is now mitigated. More information will be provided shortly.

Resolved

July 23, 9:39am EDT

July 23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.

This incident is now mitigated. More information will be provided shortly.

Resolved

July 23, 9:39am EDT

July 23, 9:39am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on Jul 24 2023, a subset of customers using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data gaps, data latency, and incorrect alert activation.


Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.

 

Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.

 

Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. . Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Resolved

July 23, 9:40am EDT

July 23, 9:40am EDT

Summary of Impact: Between 07:15 UTC on 23 July 2023 and 00:05 UTC on July 24 2023, a subset of customer using Application Insights workspaces enabled and Azure Monitor Storage Logs may have experienced intermittent data latency, and incorrect alert activation.

 

Preliminary Root Cause: We identified a recent deployment that included a code regression, which caused connectivity issues between some services that make up the data ingestion layer on the platform.

 

Mitigation: We rolled back services to previous version to restore the health of the system. It took longer for full mitigation due to various cache layers that required clearing it out so that data can be ingested as expected.

 

Next Steps: We are continuing to investigate the underlying cause of this event to identify additional repairs to help prevent future occurrences for this class of issue. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.