Status pages and admin portal unavailable July 21st 2023

Incident
July 21, 12:02pm EDT

Status pages and admin portal unavailable July 21st 2023

Status: closed
Start: July 21, 9:40am EDT
End: July 21, 10:55am EDT
Duration: 1 hour 15 minutes
Affected Components:
Status pages Admin application
Update

July 21, 9:40am EDT

July 21, 9:40am EDT

At approximately 9:40AM EDT StatusCast engineers were alerted to errors on the application that were preventing users from accessing both their status page as well as the administrative portal. StatusCast’s engineers have determined a potential issue with its service provider Azure and is currently working with Microsoft to diagnose and resolve the issue. 

Resolved

July 21, 10:55am EDT

July 21, 10:55am EDT

At this time services have been restored and should be operating as normal. If you continue to have any issues please contact support@statuscast.com to open a ticket. We will follow to this event with an RCA detailing what occurred and how we will handle this moving forward. 

Root Cause

July 21, 12:02pm EDT

July 21, 12:02pm EDT

Describe the full incident details below:

On July 21st, 2023 at approximately 9:40 EDT StatusCast’s engineers received alerts that the application was displaying a  HTTP Error 500.30 error when attempting to access any *.status.page status page or admin portal. During this period any notifications in progress or from schedule maintenance would have continued to work as expected. Additionally, during this period anyone using StatusCast’s legacy(*.statuscast.com) version of the application was not impacted. 

Describe action taken by StatusCast to mitigate issue:

Engineers immediately began to investigate the cause of the problem. StatusCast’s service provider, Azure, indicated that it was undergoing maintenance in the region that StatusCast’s is primarily hosted on(US East). Engineers got in contact with Microsoft to confirm and to get additional insight as the issue itself was impacting the failover region(US West). During this process StatusCast deployed an additional instance to another Azure region which experienced the same errors as both East and West.

The root cause of the problem ultimately was related to Azure’s maintenance and the availability of one of StatusCast’s databases used for managing connections to the application. Leading up to the outage StatusCast’s operations team was preparing for its monthly penetration test which regularly involves a fresh test database for a reserved test application. The updated connection was not properly propagated to all of StatusCast’s application servers and traffic manager which unfortunately caused the subsequent errors. 

Once the issue had been identified StatusCast’s engineers were quickly able to restore service. StatusCast development team will be performing an emergency patch today(July 21st, 2023) to ensure that an issue like this can be caught without the application going unavailable.