Investigating Service Degradation
Updates
Summary
On November 17th, 05:00 AM UTC, a bug in our database locking logic caused a service degradation impacting login, logout, and token requests in the Global and Switzerland regions. The incident lasted approximately 39 hours and 48 minutes, causing intermittent disruptions for users attempting to access these services. We implemented traffic management, infrastructure scaling, and database optimization to mitigate impact during the incident. A hotfix was successfully deployed, resolving the issue and restoring full service stability by November 18th, 08:48 AM UTC.
What Happened
Our system uses an “with existing-query” mechanism with database locks to prevent conflicts when multiple operations attempt to modify the same data. During the push of events, this query locks the latest event of specific aggregate IDs using FOR UPDATE to prevent parallel inserts on the same aggregate.
However, a critical oversight in the locking logic was that this lock did not encompass newly written events. As a result, the sequence of the waiting query remained the same as before the initial push, leading to a unique constraint violation in the database. This issue was further compounded by Zitadel’s automatic retry mechanism, which, by default, retries the operation up to 15 times, potentially intensifying the database load
Impact
The service degradation resulted in intermittent failures for users attempting to log in, log out, or use applications that require tokens. While the exact number of impacted users is not yet quantified, customer support received notifications of potential disruptions. We are working to analyze system logs to determine the full extent of the impact.
Our Response
- Traffic Management: Implemented traffic limiting and blocking, particularly for the token endpoint, to alleviate strain on the system.
- Truncating old data: To counteract some of the negative impacts we decided to remove old token.added events from the database which are not used by our customers anymore.
- Infrastructure Scaling: Upgraded the database server to increase capacity and handle the elevated processing demands.
- Database Optimization: Optimized database indexes to improve query performance.
- Hotfix Deployment: Developed and deployed a hotfix to address the locking logic flaw.
- Release of the Hotfix as Bugfix: https://github.com/zitadel/zitadel/pull/8816
What We’re Doing to Prevent This in the Future
- Targeted Testing: Expanding testing procedures with scenarios that replicate the conditions leading to the incident.
- Load Testing: Investing in sophisticated synthetic load testing to simulate realistic production environments and identify performance bottlenecks and concurrency issues earlier in development.
- Performance Monitoring: Our performance team is actively engaged in addressing all performance-related issues, including those highlighted by this incident, and continuously working to improve system efficiency and scalability.
Lessons Learned
- Database Locking: The incident highlighted the critical importance of robust database locking mechanisms in concurrent environments. We need to enhance code reviews and testing specifically around database interactions.
- Retry Mechanisms: While designed to improve resilience, retry mechanisms can amplify issues under certain failure conditions. We will review and refine our retry logic to prevent unintended consequences.
- Monitoring and Alerting: We need to improve our monitoring and alerting systems to provide earlier and more granular notifications of potential service degradations.
Timeline
17th, 05:00 AM UTC - Early Signals of Service Degradation
17th, 06:30 AM UTC - Team Alerted
17th, 06:35 AM UTC - Team Start Working on the Incident
17th, 06:52 AM UTC - Notifications received from customers about potential disruptions
17th, 08:51 AM UTC - Identified the potential problem
17th, 08:53 AM UTC - Deployed stricter rate limits to reduce system strain
17th, 09:09 AM UTC - Monitoring the Situation
17th, 03:00 PM UTC - Further fixes needed, tasking engineering to improve push and retry logic
17th, 04:35 PM UTC - Increasing Database Server to adapt to increased load
18th, 05:53 AM UTC - Start Rolling out Hotfix
18th, 06:58 AM UTC - End Rollout Hotfix
18th, 07:00 AM UTC - Monitoring the Situation
18th, 08:48 AM UTC - Closing Incident
We’re closing the report about this problem since it’s been solved. We’re sorry that some of our customers were affected. We’ll share what caused the problem in the first place in a detailed root cause analysis.
We are going to de-escalate this incident now, since we have good signals that our service quality is within our defined goal.
The team will further monitor the situation and work on a root cause analysis in the coming days.
We are actively monitoring the situation and will provide a update in 30 minutes about the further process.
All error levels and latencies have returned to normal levels. We will further monitor the situation.
We are sorry any potential impact we had on our customers. The team will conduct a root cause analysis and publish it here in due course.
While the API latency remained stable we still observe some elevated error rates on the authentication endpoints (Login, Tokens, …).
The team is doing further investigations.
We are currently rolling out a mitigation for the infrastructure related issue we experience.
We identified an additional infrastructure related problem.
Work for the mitigation is ongoing.
The team is currently investigating additional solutions to further improve the latency.
We identified a potential problem and started rolling out a temporary mitigation to improve the service quality.
Our team currently investigates reports of a potential service interruption and elevated error levels. We apologize for any inconvenience and will post another update as soon as we learn more.
← Back