VictorOps Portal Unavailable on Web and Mobile

Incident Report for Splunk On Call

Postmortem

First and foremost, we sincerely apologize for any inconvenience this issue may have caused.

Summary:

Around 11:00 AM (MDT), a back-end deployment rendered the VictorOps portal unreachable via the web-portal and mobile applications for approximately 20 minutes. Alert ingestion, incident processing, and notification delivery were NOT adversely affected during this time period.

Immediate temporary effects:

For the duration of the disruption, users were unable to connect to the VictorOps platform. As a result, Any attempts by users to acknowledge or resolve incidents directly from a notification (push, phone, SMS) would have failed. Additionally, the VictorOps-Slack integration was inoperable during the same time period.

Long term effects:

Any events which occurred during this disruption will not be reflected in the main timeline, incident details view, or reporting related to this specific 20 minute period of time.

Timeline:

10:58 a.m. - Restarted back-end servers following a deploy (standard procedure)

11:00 a.m. - Monitoring tools detected failures to load the timeline on web and mobile

11:01 a.m. - Reverted the deploy, which did not immediately resolve the issue

11:07 a.m. - StatusPage incident created

11:11 a.m. - Identified problem in configuration

11:20 a.m. - Portal access was restored

11:26 a.m. - StatusPage incident resolved

Total duration of customer impact: ~ 20 minutes

Time to acknowledge the issue (internal): Less than 1 minute

Improvements & Countermeasures:

In addition to some improvements in the way we handle application configuration changes in general, we intend to verify configuration changes separately before deployment. We are also adding additional monitoring around our configuration management system and all touch points it has with back-end applications.

Posted Jun 26, 2018 - 23:17 UTC

Resolved

We identified the source of the issue as a configuration change included in our most recent deployment. We rolled back the change and all functionality, platform-wide, has been restored.

If you have any questions or concerns, please contact our support team at support@victorops.com or submit the form on our Contact Support page (https://victorops.com/contact-support/)

We will release a more complete Post Incident Review as soon as we have completed a more thorough internal investigation.

Posted Jun 26, 2018 - 17:26 UTC

Investigating

The VictorOps portal is currently unavailable via web and mobile. We are ingesting alerts, processing incidents, and delivering notifications to users as normal. You can ack or resolve incidents by responding to phone call and SMS notifications.

We are actively working to resolve this issue as quickly as possible.

Use the Subscribe to Updates option or follow @VOSupport on Twitter for updates. If you have any immediate questions or concerns, our support team is standing by to respond. Email support@victorops.com or submit the form on our Contact Support page (https://victorops.com/contact-support/)

Posted Jun 26, 2018 - 17:07 UTC

This incident affected: Clients (Web Client (Portal), Android Client - Mobile App, iOS Client - Mobile App).