Starting August 30th, 2023 for Public Status Pages that allow SMS subscriptions StatusCast will now require that a valid email address be confirmed before a person can fully establish a new SMS subscription.
This change in subscription workflow is to help prevent malicious parties from attempting to commit SMS fraud which has become a growing concern for many SaaS companies dealing with mass notifications. We here at StatusCast have witnessed this trend, in the past 6 months the quantity of malicious traffic attempting to commit SMS fraud has increased drastically. While we have continued to implement industry best practices to safeguard against this sort of activity, ultimately real user confirmation is the most effective way to prevent such unwanted attention.
At this time services have been restored and should be operating as normal. If you continue to have any issues please contact email@example.com to open a ticket. We will follow to this event with an RCA detailing what occurred and how we will handle this moving forward.
Describe the full incident details below:
On July 21st, 2023 at approximately 9:40 EDT StatusCast’s engineers received alerts that the application was displaying a HTTP Error 500.30 error when attempting to access any *.status.page status page or admin portal. During this period any notifications in progress or from schedule maintenance would have continued to work as expected. Additionally, during this period anyone using StatusCast’s legacy(*.statuscast.com) version of the application was not impacted.
Describe action taken by StatusCast to mitigate issue:
Engineers immediately began to investigate the cause of the problem. StatusCast’s service provider, Azure, indicated that it was undergoing maintenance in the region that StatusCast’s is primarily hosted on(US East). Engineers got in contact with Microsoft to confirm and to get additional insight as the issue itself was impacting the failover region(US West). During this process StatusCast deployed an additional instance to another Azure region which experienced the same errors as both East and West.
The root cause of the problem ultimately was related to Azure’s maintenance and the availability of one of StatusCast’s databases used for managing connections to the application. Leading up to the outage StatusCast’s operations team was preparing for its monthly penetration test which regularly involves a fresh test database for a reserved test application. The updated connection was not properly propagated to all of StatusCast’s application servers and traffic manager which unfortunately caused the subsequent errors.
Once the issue had been identified StatusCast’s engineers were quickly able to restore service. StatusCast development team will be performing an emergency patch today(July 21st, 2023) to ensure that an issue like this can be caught without the application going unavailable.
StatusCast engineers have detected a possible performance impacting event affecting status pages and the admin application. This event is not impacting notification processing. We apologize for this inconvenience and will provide an update shortly.
This event has been resolved.
The StatusCast team will be performing a maintenance on February 17, 6:00am EST, the estimated duration is 60. We do not expect any impact to your service but in some cases there may be a brief interruption.
StatusCast's engineers have been alerted that some users while attempting to access their status page and/or administrative portal experienced significant delays in response time and in some cases the application would not load at all. We are working to diagnose and resolve this issue ASAP and will provide updates as available.
Engineers have confirmed that this morning StatusCast experienced an unexpected significant spike in traffic that effected response time for many users. In some cases, occasional timeouts were reported when loading status pages as well. We have scaled our servers temporarily while we investigate the root cause of the spike and will be mitigating for long term scalability.
At this time all services should be operating as expected and we will follow-up with a detailed RCA once our investigation is concluded
On November 17th at approximately 9:45AM EST StatusCast experienced a tremendous spike in inbound traffic(over 3x our historical max) which caused the primary caching mechanism for the application to become overloaded. This caused many connection requests to the application to experience either major delays in page loads or complete time outs. During this time StatusCast’s own status page was also affected; not allowing for customers to check in regarding the status of the service and the actions being taken.
Engineers mitigated the issue by scaling out the service and performing an emergency flush of the caching system in order to restore service while investigating the source of traffic spike.
Once the system had been fully restored engineers continued their investigation into the traffic spike and determined that it was not malicious in nature. The engineering and development teams have spent the last 24 hours making and preparing the following changes to StatusCast’s service offering:
Permanently scaled up the resource baseline for all of StatusCast’s servers
Added additional servers into the pool of servers used to maintain the application
Revisited auto-scaling rules around resource baselines for auto-mitigation purposes
Planned caching updates for StatusCast’s December release that will aid in caching resource constraints
Migrated StatusCast’s own page to an environment that is totally separate from the production space that clients are deployed to.
StatusCast’s team will continue to monitor both the health of its service offerings and analyze traffic patterns in order to gauge if additional changes to its infrastructure are necessary.
StatusCast engineers were alerted to an issue affecting some users access to the status.page and admin version of the application resulting in slow load times or pages to time out.
At this point service should be operating as expected for all users, however if you continue to experience any issues please contact firstname.lastname@example.org.
StatusCast's engineers have been alerted that some users are experiencing latency when attempting to access their status page as well as their administrative portal. At this time this latency does not appear to be impacting all users.
Engineers are working to resolve this now and we will post an update shortly when more information is available.
At this time all services should be operating as expected. If you continue to experience any further issues please reach out to StatusCast support at email@example.com.
We will follow up with additional information related to the root cause of this latency at a later time.
StatusCast’s engineers were alerted to an issue affecting some customers accessing the status.page version of the application. Engineers confirmed that a certificate renewal was not properly propagated to all servers. This did not impact customers utilizing the statuscast.com domain or those utilizing a custom domain name.
Once we were made aware of this issue the updated certificate was pushed out directly to all instances. At this point service should be operating as expected for all users, however if you continue to experience any issues please contact firstname.lastname@example.org.
StatusCast's engineers were alerted that schedule maintenance events created from StatusCast's legacy application("V2") were not properly auto-closing after their estimated duration had been reached. After an initial investigation engineers have confirmed the cause on the service responsible and a patch was performed to correct the error. Any maintenance that was overdue for closure should have been resolved and StatusCast's engineers will continue to monitor the legacy process for this to ensure no other issues occur.