Outage ZITADEL Cloud
Updates
Summary
On November 18, 2022, between 10:26 and 11:47, we detected an increased error rate in our Customer Portal that created a secondary incident with the ZITADEL Core Services.
Impacted customers experienced a higher than usual response time or timeouts.
We sincerely apologize for any impact this has caused to you and your users.
Root Cause Analysis
The primary incident was triggered due to no more available ports in one of our NAT Gateways in the eu-west6 (Switzerland) region. This led to dropped packets. After allocation of additional IPs traffic started to flow again. The amount of back pressured traffic triggered a correct scaling operation from the ZITADEL Core Services in our eu-west6 region. During scaling however a software bug in our storage layer led to a locked query which ahead of line blocked all later storage queries. After quick investigation the lock was resolved manually.
Resolution
To prevent this issue from happening again a storage fix was implemented and tested immediately.
Timeline
All times in UTC
10:26 - Early alerts fired for the Customer Portal
10:31 - Start investigation
10:45 - Initial analysis hints a problem with outgoing traffic
10:50 - Engineering allocates more resources to improve outbound traffic
10:58 - Back pressured traffic triggers correct autoscaling for ZITADEL Core Services in eu-west6
11:04 - Sudden increase in latency of ZITADEL Core Services triggers secondary alert
11:07 - Public status page updated with a minor incident
11:10 - Escalated public status to major incident
11:22 - Analysis highlights a deadlocked storage connection
11:34 - Rollout of the mitigation
11:47 - De-escalation of alert to minor and converting to monitoring phase
12:00 - Declaring end of incident
All errors returned to normal levels.
We close this incident and will provide a post-mortem in due course.
We are closely watching the situation as the situation improves.
A post-mortem will be provided in due course.
We are currently rolling out a fix to mitigate a connection issue to our storage layer.
Error rates are improving across the board.
We are currently investigating a cascading error with elevated error rates across our services.
We’re currently experiencing degraded performance and elevated error level with our customer portal. Customer instances are not affected.
Our team is working to restore normal performance levels. We apologize for any inconvenience.
← Back