Storage outage

Major incident Regions Global Switzerland GDPR safe countries Core Services Login Service Management API OpenID Connect / OAuth API Support Services Network Services Customer Portal Identity Brokering Authentication API
2023-02-15 09:26 CET · 1 hour, 3 minutes

Updates

Post-mortem

Summary

On February 15th, between 8:31 UTC to 9:25 UTC our on-call team was alerted of a total loss of the storage connectivity.
At this time impacted customers experienced a high level of http 404 or http 429 status codes.

We sincerely apologize for any impact this had caused to you and your users. We work closely with our storage vendor to prevent reoccurrence of such an error.

Root Cause Analysis

Our storage vendor applied an update during routine maintenance with a rolling rollout strategy. This is an operation that happens often and so far never had problems. However in this case the minor update prevented all ZITADEL instance from connecting to the storage engine.

After an initial analysis by our on-call team we declared a P1 incident with our storage vendor which in turn started working with us to resolve the problem. To mitigate the problem it was decided to rollback to the earlier version and to start an RCA on the storage vendor’s end. The result of the RCA was that there was an error in the storage engine that only could be triggered in rare cases. Nonetheless our storage vendor decided to postpone the global rollout of the affected version and started working on a hot-fix.

Resolution

The situation was resolved by rolling back the storage engines version and by creating a patched version of the bug. Besides that we work closely with the storage vendor to improve test strategies.

Timeline

All times in UTC

08:21 - Storage update rollout
08:25 - Alerts notifying on-call team
08:28 - Declaring major outage, total loss of storage connectivity
08:58 - Declaring P1 incident with storage vendor
09:15 - Start rollback of storage version
09:20 - De-escalation, traffic starts flowing again with elevated error levels
09:30 - Case closed, error levels returned to normal levels

February 27, 2023 · 16:00 CET
Resolved

All errors returned to normal levels.

We close this incident and will provide a post-mortem in due course.

February 15, 2023 · 10:30 CET
De-escalate

We are closely watching the situation as the situation improves.

A post-mortem will be provided in due course.

February 15, 2023 · 10:20 CET
Investigating

We are currently investigating an issue with one of our vendors

February 15, 2023 · 10:09 CET
Investigating

We are currently investigating the problem, ZITADEL is still unavailable

February 15, 2023 · 10:04 CET
Issue

All ZITADEL services are currently unavailable

February 15, 2023 · 09:28 CET

← Back