Redis Cache - High CPU usage - Applying mitigation

Incident

October 08, 2:55pm EDT

Redis Cache - High CPU usage - Applying mitigation

Status: closed

Start: September 20, 8:49pm EDT

End: October 08, 2:11pm EDT

Duration: 17 days 17 hours 21 minutes

Affected Components:

Cloud Providers Azure

Update

October 07, 2:47pm EDT

Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).

Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 13 Oct 2023, however, progress shows we are expected to complete by 11 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.

Update

October 07, 2:54pm EDT

Impact statement: Beginning as early as 11 Aug 2023, you have been identified as a customer experiencing timeouts and high server load for smaller size caches (C0/C1/C2).

Current status: Investigation revealed the cause to be a change in behavior of one of the Azure security monitoring services agent used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches. Currently, we are in progress of rolling out of the hotfix to the impacted regions which is 80% completed. Initially we estimated this to complete by 11 Oct 2023, however, progress shows we are expected to complete by 09 Oct 2023. To prevent impact till the fix is rolled out we are applying short term mitigation to all caches which will reduce the log file size. The next update will be provided by 19:00 UTC on 8 Oct 2023 or as events warrant, to allow time for the short term mitigation to progress.

Resolved

October 08, 2:11pm EDT

Summary of Impact: Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).

Current Status: This issue is now mitigated. More information will be provided shortly.

Resolved

October 08, 2:55pm EDT

What happened?

Between as early as 11 Aug 2023 and 18:00 UTC on 8 Oct 2023, you were identified as a customer who may have experienced timeouts and high server load for smaller size caches (C0/C1/C2).

What do we know so far?

We identified a change in behavior of one of the Azure security monitoring services agents used by Azure Cache for Redis. Monitoring Agent subscribes to the event log and has scheduled backoff for resetting subscription in case no events are received. In some cases, scheduled backoff is not working as expected and can increase the frequency of subscription resetting which can significantly affect CPU usage for smaller size caches.

How did we respond?

To address this issue, engineers performed manual action on the underlying Virtual Machines of impacted caches. After further monitoring, internal telemetry confirmed this issue is mitigated and full-service functionality was restored.

What happens next?

We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.