Major incident: Outage ZITADEL Cloud

Summary

On November 18, 2022, between 10:26 and 11:47, we detected an increased error rate in our Customer Portal that created a secondary incident with the ZITADEL Core Services.
Impacted customers experienced a higher than usual response time or timeouts.

We sincerely apologize for any impact this has caused to you and your users.

Root Cause Analysis

The primary incident was triggered due to no more available ports in one of our NAT Gateways in the eu-west6 (Switzerland) region. This led to dropped packets. After allocation of additional IPs traffic started to flow again. The amount of back pressured traffic triggered a correct scaling operation from the ZITADEL Core Services in our eu-west6 region. During scaling however a software bug in our storage layer led to a locked query which ahead of line blocked all later storage queries. After quick investigation the lock was resolved manually.

Resolution

To prevent this issue from happening again a storage fix was implemented and tested immediately.

Timeline

All times in UTC

10:26 - Early alerts fired for the Customer Portal

10:31 - Start investigation

10:45 - Initial analysis hints a problem with outgoing traffic

10:50 - Engineering allocates more resources to improve outbound traffic

10:58 - Back pressured traffic triggers correct autoscaling for ZITADEL Core Services in eu-west6

11:04 - Sudden increase in latency of ZITADEL Core Services triggers secondary alert

11:07 - Public status page updated with a minor incident

11:10 - Escalated public status to major incident

11:22 - Analysis highlights a deadlocked storage connection

11:34 - Rollout of the mitigation

11:47 - De-escalation of alert to minor and converting to monitoring phase

12:00 - Declaring end of incident

November 23, 2022 · 22:05 CET

Resolved

All errors returned to normal levels.

We close this incident and will provide a post-mortem in due course.

November 18, 2022 · 13:00 CET

De-escalate

We are closely watching the situation as the situation improves.

A post-mortem will be provided in due course.

November 18, 2022 · 12:47 CET

Monitoring

We are currently rolling out a fix to mitigate a connection issue to our storage layer.

Error rates are improving across the board.

November 18, 2022 · 12:36 CET

Investigating

We are currently investigating a cascading error with elevated error rates across our services.

November 18, 2022 · 12:23 CET

Issue

We’re currently experiencing degraded performance and elevated error level with our customer portal. Customer instances are not affected.

Our team is working to restore normal performance levels. We apologize for any inconvenience.

November 18, 2022 · 12:07 CET

Outage ZITADEL Cloud

Updates

Summary

Root Cause Analysis

Resolution

Timeline