Investigating issues ZITADEL Cloud

Minor incident Regions Global Switzerland GDPR safe countries Core Services Login Service Management API OpenID Connect / OAuth API Support Services Network Services Customer Portal Identity Brokering Authentication API
2023-10-16 19:58 CET · 2 hours, 23 minutes

Updates

Post-mortem

Summary

On October 16th, between 17:21 UTC to 20:05 UTC customers were affected by elevated error rates and availability of our services. Our on-call team responded immediately after being alerted about abnormal storage CPU usage. The following timeline sums up the outage:

17:56 - 18:18 UTC ZITADEL Cloud was unavailable, response codes 404 and 403 were returned
18:18 - 19:36 UTC Elevated response times and error rate
19:36 - 20:03 UTC ZITADEL Cloud was shut down, response code 502 was returned
from 20:03 UTC ZITADEL Cloud was restarted and availability of all services were restored

We apologize for the impact this had caused. To mitigate such unexpected behavior in the future, we will review our operational procedures to further reduce the risk of changes to the production storage. This includes an update to our controls ensuring that an additional review layer to live storage changes will be put in place.

Root Cause Analysis

After the storage update from October 14th our developer team was working on further optimizations on the storage layer. The optimizations included a change to increase performance for running ZITADEL instances. The change ignored a call to the storage during startup of ZITADEL which increased CPU usage of the storage to it’s limits (starting at 17:20 UTC) and consequently increased response times of the storage and also of ZITADEL itself. After 17:58 UTC the error propagated to all services of ZITADEL Cloud.

After an initial analysis by our on-call team our engineering team started in depth analysis and decided to shut down ZITADEL cloud completely between 19:36 and 20:03 UTC. During this time every request got a 502 http response code. The shutdown was required to lower the CPU usage of our storage and to revert the changes made beforehand. At 20:03 UTC ZITADEL was restarted and error rates started to recover.

Mitigation in the future

To mitigate such unexpected behavior in the future we will add an additional review layer to live storage changes so that multiple people can intervene on fundamental changes.

Resolution

The situation was resolved by reverting prior changes made to the storage layer. After restarting our services, error rates recovered and all services were available.

Timeline

All times in UTC

17:21 - Alerts notify the on-call team
17:58 - Declaration of service interruption and elevated error rates
18:07 - Escalation to major outage of ZITADEL Cloud
19:11 - De-escalation to degraded performance
19:26 - Escalation to major outage
20:07 - De-escalation, response times started to decrease
20:20 - Case closed, response times and error rates returned to normal

October 20, 2023 · 14:42 CET
Resolved

All response times returned to normal levels.
We close this incident and will provide a post-mortem in due course.

October 16, 2023 · 22:20 CET
Monitoring

Response times returned back to normal. We are still investigating.

October 16, 2023 · 22:15 CET
De-escalate

We’re currently experiencing degraded performance. Our team is working to restore normal performance levels. We apologize for any inconvenience. Next update in 30 minutes.

October 16, 2023 · 22:07 CET
Investigating

We are still investigating the issue. Thanks for your patience. Next update follows in 30 minutes

October 16, 2023 · 21:49 CET
Escalate

Our service is experiencing an outage with ZITADEL. Our team is working to restore the affected service. We apologize for any inconvenience. Login is currently not possible, users are affected. Next update in 30 minutes.

October 16, 2023 · 21:26 CET
De-escalate

We’re currently experiencing degraded performance. Our team is working to restore normal performance levels. We apologize for any inconvenience. Next update in 30 minutes.

October 16, 2023 · 21:11 CET
Investigating

We are still investigating the issue. Thanks for your patience. Next update follows in 30 minutes

October 16, 2023 · 21:00 CET
Investigating

We are still investigating the issue. Thanks for your patience. Next update follows in 30 minutes

October 16, 2023 · 20:28 CET
Escalate

Our service is experiencing an outage with ZITADEL. Our team is working to restore the affected service. We apologize for any inconvenience. Login is currently not possible, users are affected. Next update in 15 minutes.

October 16, 2023 · 20:07 CET
Issue

Our team currently investigates reports of a potential service interruption and elevated error levels with ZITADEL. We apologize for any inconvenience and will post another update as soon as we learn more.

October 16, 2023 · 19:58 CET

← Back