Low response success rate

Updates

Summary

On January 8th and 10th customers were affected by elevated response times, error rates and a decreased availability of our services. Our on-call team responded immediately after being alerted about the unusual high response times. The following timeline presents the events and timestamps of the outages (times in UTC):

2024-01-08

19:05 - 20:24: ZITADEL Cloud was unavailable, requests took up to 5 minutes and 500 class response codes were returned in many cases\
20:25: ZITADEL Cloud services were restored and served normal traffic\
21:37: Engineering decides to rollout a hotfix as a result of ongoing root cause analysis

2024-01-10

20:25 - 20:42: ZITADEL Cloud was unavailable, requests took up to 5 minutes and 500 class response codes were returned in many case\
20:42 - 21:02: ZITADEL Cloud recovered and availability of all services was restored\
21:02 - 21:35: ZITADEL Cloud was unavailable again, requests took up to 5 minutes and 500 class response codes were returned in many case\
21:31: ZITADEL Cloud services were restored and served normal traffic\
23:31: Engineering decides to rollout a hotfix for further mitigation

Mitigation

After the initial assessment by our on-call team we decided to restart ZITADEL Cloud on 2024-01-08 20:24 UTC and 2024-01-10 21:26 UTC. After the restart, services started to recover immediately.

Root Cause Analysis

During this outage an exceptional event occurred, utilizing outdated code that resulted in the exhaustion of ZITADEL’s database connections. This code neglected to address an error path that hindered the return of SQL connections to the database connection pool. Subsequent to ZITADEL’s restart, the connection pool recovered the capability to provision new connections. However, following the restart, connections were allocated in a surreptitious manner, giving rise to the second outage on January 10th.

Resolution

The hot fix deployed on 2024-01-10 23:31 UTC ensured that all error cases result in a rollback of the transaction and database connections are correctly returned to the connection pool.

Further improvements

We apologize for the impact this had caused. We will review our development and quality assurance procedures to further reduce unexpected behavior. This includes additional focus on testing during development and review of code. Additionally we will review our on-call procedure and handbooks so that there is no lack of updates and declarations of service interruptions on status.zitadel.com.

Timeline

All times in UTC

2024-01-08

19:04 - Alerts notified the on-call team\
19:28 - Public declaration of service interruption and elevated error rates\
20:27 - Deescalation, response times returned to normal levels\
20:50 - Case closed, response times and error rates returned to normal\
21:37 - Deployment of first hotfix, root cause analysis is ongoing

2024-01-10

20:27 - Alerts notified the on-call team\
20:42 - ZITADEL Cloud recovered\
21:06 - Alerts notified the on-call team\
21:35 - ZITADEL Cloud recovered\
23:31 - Deployment of hotfix

January 22, 2024 · 21:02 CET

Issue

During 22:05 and 22:36 CET the response rate of ZITADEL dropped. Clients saw response times up to 5 minutes.

Please allow us to excuse us once more for the impact we might have had on you and your business. Rest assured that we will debrief this and provide you with a root cause analysis in due course.

January 11, 2024 · 09:00 CET

← Back