Elevated Error Rates
Requests to the ZITADEL Cloud Service and Customer Portal were responded with status 500. ZITADEL Cloud Customers were affected during these time frames:
- The first time it lasted for 30 minutes from May 11, 2023, 2:39 pm to 3:09 pm UTC.
- The second time it lasted for 36 minutes from May 16, 2023, 7:55 pm UTC to 8:31 pm UTC.
We decided to publish a single post mortem for both incidents, as the root cause is the same.
We sincerely apologize for any impact this had caused to you and your users. We work closely with our infrastructure vendor to prevent reoccurrence of such an error.
Why did it happen?
Our cloud infrastructure provider changed the infrastructure logic in a way that prevented the ZITADEL process to access its configuration file. This led to ZITADEL being unable to start up new instances. Running ZITADEL revisions lost access to the config file, because our infrastructure provisioning tool deleted ZITADELs previous configuration from their storage location, due to configuration problem.
Why did it happen twice?
After the first incident, we assumed, that it was a temporary failure on our infrastructure providers side. We assumed this, because a deployment at the same time to one of our test environments failed with the exact same reason while other deployments later and before succeeded. This assumption turned out to be wrong. Instead, the infrastructure provider rolled out a change gradually, so it (re)appeared later on, leading to the second incident.
How do we mitigate it in the future?
- We change the way we provision the ZITADEL configuration, to make sure that the old configuration versions are always available. This ensures that falling back to the previous ZITADEL versions works as expected in error cases.
- We improve our monitoring and alerting to watch out for specific indications of this issues.
All times in UTC
- 02:39 pm - We deployed a new ZITADEL version with a changed config to production. The ZITADEL service went down immediately.
- 02:40 pm - Ops Team starts investigation.
- 02:54 pm - We (re)deployed ZITADEL again, the service started to work again.
All times in UTC
- 07:35 pm - HTTP response errors with status 500 start to elevate.
- 07:57 pm - HTTP response errors with status 500 reach critical levels and alerts trigger.
- 08:10 pm - Ops Team starts investigation.
- 08:21 pm - Starting deployment of a hotfix.
- 08:29 pm - Deployment rollout of the hotfix finished.
- 08:31 pm - Error levels returned to normal conditions.
After a brief observation period we can see normal error levels.
Our engineering team deployed at hot fix to resolve this problem.
Error levels should return to normal levels immediately.
This error relates to https://status.zitadel.com/incidents/130606 and we will publish a RCA in the next days.
We’re currently experiencing elevated error levels with ZITADEL Cloud. Our team is working to restore normal levels. We apologize for any inconvenience.
Next update in 15 minutes