Synergis SaaS - System degradation

Incident Report for Genetec Cloud Products

Postmortem

We apologize for the downtime this incident has caused. We understand the impact this incident had on your operations and we assure you that the appropriate actions have been taken to ensure the continuous improvement of our service.

Customers affected throughout this service degradation would have experienced the following symptoms in their SaaS systems:

Access Control events being slow to appear in Security Center
Manual operations being slow to execute, or not executing at all (such as manually unlocking doors)
Synchronization of Access Control units failing, or taking longer than usual
Access Manager roles being in a warning state

The following is a detailed summary of the incident, investigation, and resolution throughout this incident:

Friday April 21st 2023

8:23am EST: Our team received an automated alert from our monitoring infrastructure. There was a high rate of exceptions, causing delays in the system.
Upon investigation, this delay was introduced due to errors reading commands coming from Synergis units and Access Manager roles that were sending messages to our infrastructure through our Event Hub Azure resource.
We restarted our backend App Services. This did not resolve the issue, it however allowed us to have a better understanding of the situation.
10:00am EST: We followed our disaster recovery procedure by failing over to a backup Event Hub Azure resource.
10:50am EST: Outage was declared and customers were notified via the Genetec status page.
Throughout the day, we monitored the system and the situation, and as the Synergis units and Access Manager roles were starting to write to the backup Event Hub Azure resource, the number of exceptions lowered over time.
4:19pm EST: Incident closed. Most Synergis units and Access Manager roles were writing to the backup Event Hub Azure resource. No delays or exceptions were observed.

Saturday April 22nd 2023 and Sunday April 23rd 2023

No disruptions were observed within the service.

Monday April 24th 2023

8:18am EST: Our team received an automated alert from our monitoring infrastructure. There was a high rate of exceptions, causing delays in the system similar to Friday’s alert.
For this incident, we were able to observe that the exceptions and delays were not only surrounding our usage of the Event Hub Azure resource, but also the Service Bus Azure resource.
An issue in our production code was identified where such exceptions could cause delays in our processing of messages coming from the Event Hub, that would get worse over time.
11:00am EST: A fix was pushed to production.
After monitoring, the fix resolved the delays we were seeing in our system. However, exceptions were still occurring.
12:17pm EST: The issue was escalated to Microsoft. We have continued working with Microsoft to investigate the issue.
8:39pm EST: Root cause of the incident was identified: our backend App Service was consuming too many network sockets when managing connections with our Event Hub and Service Bus Azure resources. This was not an issue that was introduced by any recent code changes. This was observed only now due to the current scale of our system. Our monitoring tools did not have access to the required metrics that would have allowed us to proactively prevent this issue. As a result, the necessary metrics, as well as additional tools, have been implemented and will be periodically reviewed to ensure optimal service monitoring.
9:19pm EST: Service was fully operational.

Tuesday April 25th 2023

10:30am EST: A fix was pushed to production. We identified that another fix would be required.
12:30pm EST: A final fix was pushed to production.
The service was monitored for the rest of the day. No exceptions or delays were observed while our system was in peak usage.
5:44pm EST: Incident was closed.

‌

If you have any questions, please contact the Genetec Technical Assistance Center.

‌

Synergis Product Team

Posted Apr 26, 2023 - 16:53 EDT

Resolved

This incident has been resolved.

Posted Apr 25, 2023 - 17:44 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 25, 2023 - 12:30 EDT

Identified

The issue has been identified and we are actively working on a fix.

Posted Apr 25, 2023 - 08:08 EDT

Update

We are continuing to investigate this issue.

Posted Apr 24, 2023 - 08:52 EDT

Investigating

We are currently investigating an issue which might affect client systems. You might see:
- Access Control events being slow to appear in Security Center
- Manual operations being slow to execute, such as unlocking doors
- Synchronization of your Access Control units failing, or taking longer than usual
- Access Manager roles being in a warning state

Posted Apr 24, 2023 - 08:52 EDT

This incident affected: Synergis SaaS.