Timeouts experienced on status pages and admin portal

Incident
November 18, 3:58pm EST

Timeouts experienced on status pages and admin portal

Status: closed
Start: November 17, 9:47am EST
End: November 17, 11:00am EST
Duration: 1 hour 13 minutes
Affected Components:
Status pages Admin application
Update

November 17, 9:47am EST

November 17, 9:47am EST

StatusCast's engineers have been alerted that some users while attempting to access their status page and/or administrative portal experienced significant delays in response time and in some cases the application would not load at all. We are working to diagnose and resolve this issue ASAP and will provide updates as available. 

Resolved

November 17, 11:00am EST

November 17, 11:00am EST

Engineers have confirmed that this morning StatusCast experienced an unexpected significant spike in traffic that effected response time for many users. In some cases, occasional timeouts were reported when loading status pages as well. We have scaled our servers temporarily while we investigate the root cause of the spike and will be mitigating for long term scalability.


At this time all services should be operating as expected and we will follow-up with a detailed RCA once our investigation is concluded

Root Cause

November 18, 3:58pm EST

November 18, 3:58pm EST

On November 17th at approximately 9:45AM EST StatusCast experienced a tremendous spike in inbound traffic(over 3x our historical max) which caused the primary caching mechanism for the application to become overloaded. This caused many connection requests to the application to experience either major delays in page loads or complete time outs.  During this time StatusCast’s own status page was also affected; not allowing for customers to check in regarding the status of the service and the actions being taken.


Engineers mitigated the issue by scaling out the service and performing an emergency flush of the caching system in order to restore service while investigating the source of traffic spike. 


Once the system had been fully restored engineers continued their investigation into the traffic spike and determined that it was not malicious in nature. The engineering and development teams have spent the last 24 hours making and preparing the following changes to StatusCast’s service offering:


  1. Permanently scaled up the resource baseline for all of StatusCast’s servers

  2. Added additional servers into the pool of servers used to maintain the application 

  3. Revisited auto-scaling rules around resource baselines for auto-mitigation purposes

  4. Planned caching updates for StatusCast’s December release that will aid in caching resource constraints

  5. Migrated StatusCast’s own page to an environment that is totally separate from the production space that clients are deployed to.


StatusCast’s team will continue to monitor both the health of its service offerings and analyze traffic patterns in order to gauge if additional changes to its infrastructure are necessary.