Minor incident: Investigating Service Degradation

Summary

On November 17th, 05:00 AM UTC, a bug in our database locking logic caused a service degradation impacting login, logout, and token requests in the Global and Switzerland regions. The incident lasted approximately 39 hours and 48 minutes, causing intermittent disruptions for users attempting to access these services. We implemented traffic management, infrastructure scaling, and database optimization to mitigate impact during the incident. A hotfix was successfully deployed, resolving the issue and restoring full service stability by November 18th, 08:48 AM UTC.

What Happened

Our system uses an “with existing-query” mechanism with database locks to prevent conflicts when multiple operations attempt to modify the same data. During the push of events, this query locks the latest event of specific aggregate IDs using FOR UPDATE to prevent parallel inserts on the same aggregate.

However, a critical oversight in the locking logic was that this lock did not encompass newly written events. As a result, the sequence of the waiting query remained the same as before the initial push, leading to a unique constraint violation in the database. This issue was further compounded by Zitadel’s automatic retry mechanism, which, by default, retries the operation up to 15 times, potentially intensifying the database load

Impact

The service degradation resulted in intermittent failures for users attempting to log in, log out, or use applications that require tokens. While the exact number of impacted users is not yet quantified, customer support received notifications of potential disruptions. We are working to analyze system logs to determine the full extent of the impact.

Our Response

Traffic Management: Implemented traffic limiting and blocking, particularly for the token endpoint, to alleviate strain on the system.
Truncating old data: To counteract some of the negative impacts we decided to remove old token.added events from the database which are not used by our customers anymore.
Infrastructure Scaling: Upgraded the database server to increase capacity and handle the elevated processing demands.
Database Optimization: Optimized database indexes to improve query performance.
Hotfix Deployment: Developed and deployed a hotfix to address the locking logic flaw.
Release of the Hotfix as Bugfix: https://github.com/zitadel/zitadel/pull/8816

What We’re Doing to Prevent This in the Future

Targeted Testing: Expanding testing procedures with scenarios that replicate the conditions leading to the incident.
Load Testing: Investing in sophisticated synthetic load testing to simulate realistic production environments and identify performance bottlenecks and concurrency issues earlier in development.
Performance Monitoring: Our performance team is actively engaged in addressing all performance-related issues, including those highlighted by this incident, and continuously working to improve system efficiency and scalability.

Lessons Learned

Database Locking: The incident highlighted the critical importance of robust database locking mechanisms in concurrent environments. We need to enhance code reviews and testing specifically around database interactions.
Retry Mechanisms: While designed to improve resilience, retry mechanisms can amplify issues under certain failure conditions. We will review and refine our retry logic to prevent unintended consequences.
Monitoring and Alerting: We need to improve our monitoring and alerting systems to provide earlier and more granular notifications of potential service degradations.

Timeline

17th, 05:00 AM UTC - Early Signals of Service Degradation
17th, 06:30 AM UTC - Team Alerted
17th, 06:35 AM UTC - Team Start Working on the Incident
17th, 06:52 AM UTC - Notifications received from customers about potential disruptions
17th, 08:51 AM UTC - Identified the potential problem
17th, 08:53 AM UTC - Deployed stricter rate limits to reduce system strain
17th, 09:09 AM UTC - Monitoring the Situation
17th, 03:00 PM UTC - Further fixes needed, tasking engineering to improve push and retry logic
17th, 04:35 PM UTC - Increasing Database Server to adapt to increased load
18th, 05:53 AM UTC - Start Rolling out Hotfix
18th, 06:58 AM UTC - End Rollout Hotfix
18th, 07:00 AM UTC - Monitoring the Situation
18th, 08:48 AM UTC - Closing Incident

December 2, 2024 · 14:51 CET

Resolved

We’re closing the report about this problem since it’s been solved. We’re sorry that some of our customers were affected. We’ll share what caused the problem in the first place in a detailed root cause analysis.

November 18, 2024 · 09:38 CET

De-escalate

We are going to de-escalate this incident now, since we have good signals that our service quality is within our defined goal.

The team will further monitor the situation and work on a root cause analysis in the coming days.

November 18, 2024 · 08:22 CET

Monitoring

We are actively monitoring the situation and will provide a update in 30 minutes about the further process.

November 18, 2024 · 07:44 CET

Monitoring

All error levels and latencies have returned to normal levels. We will further monitor the situation.

We are sorry any potential impact we had on our customers. The team will conduct a root cause analysis and publish it here in due course.

November 17, 2024 · 21:08 CET

Investigating

While the API latency remained stable we still observe some elevated error rates on the authentication endpoints (Login, Tokens, …).

The team is doing further investigations.

November 17, 2024 · 16:00 CET

Investigating

We are currently rolling out a mitigation for the infrastructure related issue we experience.

November 17, 2024 · 13:42 CET

Escalate

We identified an additional infrastructure related problem.

Work for the mitigation is ongoing.

November 17, 2024 · 12:04 CET

Investigating

The team is currently investigating additional solutions to further improve the latency.

November 17, 2024 · 11:29 CET

Investigating

We identified a potential problem and started rolling out a temporary mitigation to improve the service quality.

November 17, 2024 · 10:26 CET

Issue

Our team currently investigates reports of a potential service interruption and elevated error levels. We apologize for any inconvenience and will post another update as soon as we learn more.

November 17, 2024 · 09:40 CET

Investigating Service Degradation

Updates

Summary

What Happened

Impact

Our Response

What We’re Doing to Prevent This in the Future

Lessons Learned

Timeline