Splunk On-Call Service Disruption: Outbound Notification Delays
Incident Report for VictorOps
Postmortem

Basic Timeline & Incident Overview:

Starting at approximately 6:18 AM (Mountain) on the morning of 02.26.21, the Splunk On-Call (SpOC) platform began experiencing delays in the outbound delivery of notifications.

These delays were directly related to a critical service disruption associated with one of our primary telecommunications providers.

Although the service disruption was primarily affecting outbound Phone and SMS notifications, the situation also (temporarily) delayed our delivery of Push notifications, as well.

At approximately 7:40 AM (Mountain), actions taken by the SpOC Engineering team alleviated the Push notification delays and returned Push notification delivery to standard operational efficiency.

The proper delivery of SMS notifications resumed at approximately 8:30 AM (Mountain).

Phone notifications were queued and being delivered at a roughly 40-minute delay rate until approximately 9:30 AM (Mountain).

The overall incident timeline was from approximately 6:18 AM (Mountain) - 9:30 AM (Mountain) on the morning of 02.26.21.

After internal testing and direct communication with the telecommunications provider in-question, we deemed the incident resolved at 9:30 AM (Mountain) on the morning of 02.26.21.

. . .

As with any and all such incidents, the appropriate Splunk On-Call (SpOC) will conduct a collaborative and intensive Post Incident Review (PIR) process aimed at both preventative measures and improved responsiveness toward addressing any future issues. If you have any immediate questions or concerns, please contact the Splunk On-Call (SpOC) Support Team at:

victorops-support@splunk.com

Once again, we sincerely apologize for any unintended inconvenience this incident may have caused.

Posted Feb 26, 2021 - 16:22 MST

Resolved
The issue has been resolved. All message delivery types (Phone, SMS, Push, and Email) and associated notifications are fully functional.

We will be providing additional updates on our status page upon the completion of our internal review.

If you have any immediate questions, please reach out to the Splunk On-Call Support team:

victorops-support@splunk.com

We sincerely apologize for any unintended inconvenience this issue may have caused.
Posted Feb 26, 2021 - 09:38 MST
Update
We are continuing to investigate this issue.
Posted Feb 26, 2021 - 09:34 MST
Update
SMS message delivery is improving significantly (to) fully functional.

Phone messaging is still delayed.

We are continuing to investigate the issue. Updates to follow.
Posted Feb 26, 2021 - 08:58 MST
Update
We are continuing to investigate this issue.
Posted Feb 26, 2021 - 08:56 MST
Update
As we continue to investigate this issue, we recommend leveraging PUSH Notification in your Splunk On-Call personal paging policy settings. For guidance:

https://help.victorops.com/knowledge-base/paging-policy-setup/

The service disruption we are continuing to troubleshoot is primarily affecting Phone and SMS messaging.

Updates to follow.
Posted Feb 26, 2021 - 08:38 MST
Update
We are continuing to investigate this issue.
Posted Feb 26, 2021 - 08:30 MST
Update
Please be advised that message delivery and alerting associated with the Splunk On-Call platform are significantly delayed as a result of this issue.

As we continue to investigate and work this issue to ground, we recommend that you directly evaluate any monitoring solutions you may have integrated with Splunk On-Call to better determine the health and status of your associated systems, applications, and/or platform.

Updates to follow.
Posted Feb 26, 2021 - 07:55 MST
Update
Update: The service affecting issues that we are actively investigating appear to be associated with a service disruption being experienced by one of our communications partners, Twilio:

https://status.twilio.com/

We are continuing to pursue the investigation and resolution of this issue with the highest levels of urgency.

Updates to follow.
Posted Feb 26, 2021 - 07:46 MST
Update
We are continuing to investigate this issue.
Posted Feb 26, 2021 - 07:33 MST
Investigating
We are actively investigating a service affecting issue with the message delivery and alerting associated to our platform. We are approaching this issue with the highest level of urgency and the appropriate parties are actively engaged in troubleshooting the issue.

Updates to follow as soon as they are available.

f you have any immediate questions, please contact the Splunk On-Call Support team: victorops-support@splunk.com
Posted Feb 26, 2021 - 07:33 MST
This incident affected: Delivery Systems (Notifications - SMS, Notifications - Google Push, Notifications - Apple Push, Notifications - Phone).