Investigating issues ZITADEL Cloud
On January 8th and 10th customers were affected by elevated response times, error rates and a decreased availability of our services. Our on-call team responded immediately after being alerted about the unusual high response times. The following timeline presents the events and timestamps of the outages (times in UTC):
19:05 - 20:24: ZITADEL Cloud was unavailable, requests took up to 5 minutes and 500 class response codes were returned in many cases\
20:25: ZITADEL Cloud services were restored and served normal traffic\
21:37: Engineering decides to rollout a hotfix as a result of ongoing root cause analysis
20:25 - 20:42: ZITADEL Cloud was unavailable, requests took up to 5 minutes and 500 class response codes were returned in many case\
20:42 - 21:02: ZITADEL Cloud recovered and availability of all services was restored\
21:02 - 21:35: ZITADEL Cloud was unavailable again, requests took up to 5 minutes and 500 class response codes were returned in many case\
21:31: ZITADEL Cloud services were restored and served normal traffic\
23:31: Engineering decides to rollout a hotfix for further mitigation
After the initial assessment by our on-call team we decided to restart ZITADEL Cloud on 2024-01-08 20:24 UTC and 2024-01-10 21:26 UTC. After the restart, services started to recover immediately.
Root Cause Analysis
During this outage an exceptional event occurred, utilizing outdated code that resulted in the exhaustion of ZITADEL’s database connections. This code neglected to address an error path that hindered the return of SQL connections to the database connection pool. Subsequent to ZITADEL’s restart, the connection pool recovered the capability to provision new connections. However, following the restart, connections were allocated in a surreptitious manner, giving rise to the second outage on January 10th.
The hot fix deployed on 2024-01-10 23:31 UTC ensured that all error cases result in a rollback of the transaction and database connections are correctly returned to the connection pool.
We apologize for the impact this had caused. We will review our development and quality assurance procedures to further reduce unexpected behavior. This includes additional focus on testing during development and review of code. Additionally we will review our on-call procedure and handbooks so that there is no lack of updates and declarations of service interruptions on status.zitadel.com.
All times in UTC
19:04 - Alerts notified the on-call team\
19:28 - Public declaration of service interruption and elevated error rates\
20:27 - Deescalation, response times returned to normal levels\
20:50 - Case closed, response times and error rates returned to normal\
21:37 - Deployment of first hotfix, root cause analysis is ongoing
20:27 - Alerts notified the on-call team\
20:42 - ZITADEL Cloud recovered\
21:06 - Alerts notified the on-call team\
21:35 - ZITADEL Cloud recovered\
23:31 - Deployment of hotfix
We are going to close this incident since the errors and latency returned to normal levels.
Please allow us to excuse us once more for the impact we might have had on you and your business. Rest assured that we will debrief this and provide you with a root cause analysis in due course.
Remediation is showing effect, we are seeing improvements across all services as they return to normal error levels.
We continue to monitor the situation and are again sorry for the inconvenience.
Remediation is still ongoing, we are sorry for the inconvenience.
We identified a potential cause and are working on a short term remediation.
Our team currently investigates reports of a potential service interruption and elevated error levels with ZITADEL’s services. We apologize for any inconvenience and will post another update as soon as we learn more.