On February 15th, between 8:31 UTC to 9:25 UTC our on-call team was alerted of a total loss of the storage connectivity.
At this time impacted customers experienced a high level of http 404 or http 429 status codes.
We sincerely apologize for any impact this had caused to you and your users. We work closely with our storage vendor to prevent reoccurrence of such an error.
Root Cause Analysis
Our storage vendor applied an update during routine maintenance with a rolling rollout strategy. This is an operation that happens often and so far never had problems. However in this case the minor update prevented all ZITADEL instance from connecting to the storage engine.
After an initial analysis by our on-call team we declared a P1 incident with our storage vendor which in turn started working with us to resolve the problem. To mitigate the problem it was decided to rollback to the earlier version and to start an RCA on the storage vendor’s end. The result of the RCA was that there was an error in the storage engine that only could be triggered in rare cases. Nonetheless our storage vendor decided to postpone the global rollout of the affected version and started working on a hot-fix.
The situation was resolved by rolling back the storage engines version and by creating a patched version of the bug. Besides that we work closely with the storage vendor to improve test strategies.
All times in UTC
08:21 - Storage update rollout
08:25 - Alerts notifying on-call team
08:28 - Declaring major outage, total loss of storage connectivity
08:58 - Declaring P1 incident with storage vendor
09:15 - Start rollback of storage version
09:20 - De-escalation, traffic starts flowing again with elevated error levels
09:30 - Case closed, error levels returned to normal levels
All errors returned to normal levels.
We close this incident and will provide a post-mortem in due course.
We are closely watching the situation as the situation improves.
A post-mortem will be provided in due course.
We are currently investigating an issue with one of our vendors
We are currently investigating the problem, ZITADEL is still unavailable
All ZITADEL services are currently unavailable