First and foremost, we sincerely apologize for any inconvenience this issue may have caused.
Around 11:00 AM (MDT), a back-end deployment rendered the VictorOps portal unreachable via the web-portal and mobile applications for approximately 20 minutes. Alert ingestion, incident processing, and notification delivery were NOT adversely affected during this time period.
For the duration of the disruption, users were unable to connect to the VictorOps platform. As a result, Any attempts by users to acknowledge or resolve incidents directly from a notification (push, phone, SMS) would have failed. Additionally, the VictorOps-Slack integration was inoperable during the same time period.
Any events which occurred during this disruption will not be reflected in the main timeline, incident details view, or reporting related to this specific 20 minute period of time.
10:58 a.m. - Restarted back-end servers following a deploy (standard procedure)
11:00 a.m. - Monitoring tools detected failures to load the timeline on web and mobile
11:01 a.m. - Reverted the deploy, which did not immediately resolve the issue
11:07 a.m. - StatusPage incident created
11:11 a.m. - Identified problem in configuration
11:20 a.m. - Portal access was restored
11:26 a.m. - StatusPage incident resolved
Total duration of customer impact: ~ 20 minutes
Time to acknowledge the issue (internal): Less than 1 minute
In addition to some improvements in the way we handle application configuration changes in general, we intend to verify configuration changes separately before deployment. We are also adding additional monitoring around our configuration management system and all touch points it has with back-end applications.