Authorization failures on ZITADEL API request from service users using JWT profile and Client Credential Grant.
Updates
Incident Title
Authorization failures on ZITADEL API request from service users using JWT profile and Client Credential Grant.
Incident Description
On October 1st, 2024, the deployment of Zitadel Cloud version 2.62.3 introduced a critical error causing authorization failures for customers using JWT Profile or Client Credentials Grants. The issue manifested as an inability to access ZITADEL APIs, accompanied by the error message “could not read projectid by clientid (AUTH-GHpw2)”.
Incident Timeline
All times October 1st, 2024, UTC.
- 07:01 UTC: Version 2.62.3 rolled out to all regions.
- 08:14 UTC: Customers reported issues on Discord.
- 08:16 UTC: Customers reported issues on a GitHub issue.
- 08:34 UTC: Customer contacted support.
- 08:35 UTC: Incident surfaced in internal chat.
- 08:46 UTC: Team convened for an internal video chat to assess the situation.
- 08:48 UTC: Urgent rollback initiated for global and Switzerland regions.
- 08:57 UTC: Rollback to version 2.60.3 completed for all regions.
- 09:00 UTC: Monitoring showed the situation improving.
- 09:05 UTC: Customers confirmed a return to normal functionality.
Impact of the Incident
Customers relying on tokens issued via JWT Profile or Client Credentials Grants experienced authorization failures, disrupting their services and workflows.
Root Cause
The root cause was a code change (https://github.com/zitadel/zitadel/pull/8580) that added the clientID to service account tokens and their OIDC session, which led to the observed authorization errors.
Resolution Steps
The issue was resolved by a rolllback of the problematic release and by opening a fix https://github.com/zitadel/zitadel/pull/8704
Key Learnings
- Enhanced Observability: The incident highlighted the need for improved monitoring and alerting mechanisms to detect spikes in 4xx errors, enabling faster incident identification and response.
- Thorough Testing: Rigorous testing of code changes, particularly those impacting core authorization flows, is crucial to prevent regressions.
← Back