Authorization failures on ZITADEL API request from service users using JWT profile and Client Credential Grant.

Minor incident Regions Global Switzerland GDPR safe countries Core Services OpenID Connect / OAuth API EU US AU
2024-10-01 09:01 CEST · 2 hours, 4 minutes

Updates

Issue

Incident Title

Authorization failures on ZITADEL API request from service users using JWT profile and Client Credential Grant.

Incident Description

On October 1st, 2024, the deployment of Zitadel Cloud version 2.62.3 introduced a critical error causing authorization failures for customers using JWT Profile or Client Credentials Grants. The issue manifested as an inability to access ZITADEL APIs, accompanied by the error message “could not read projectid by clientid (AUTH-GHpw2)”.

Incident Timeline

All times October 1st, 2024, UTC.

  • 07:01 UTC: Version 2.62.3 rolled out to all regions.
  • 08:14 UTC: Customers reported issues on Discord.
  • 08:16 UTC: Customers reported issues on a GitHub issue.
  • 08:34 UTC: Customer contacted support.
  • 08:35 UTC: Incident surfaced in internal chat.
  • 08:46 UTC: Team convened for an internal video chat to assess the situation.
  • 08:48 UTC: Urgent rollback initiated for global and Switzerland regions.
  • 08:57 UTC: Rollback to version 2.60.3 completed for all regions.
  • 09:00 UTC: Monitoring showed the situation improving.
  • 09:05 UTC: Customers confirmed a return to normal functionality.

Impact of the Incident

Customers relying on tokens issued via JWT Profile or Client Credentials Grants experienced authorization failures, disrupting their services and workflows.

Root Cause

The root cause was a code change (https://github.com/zitadel/zitadel/pull/8580) that added the clientID to service account tokens and their OIDC session, which led to the observed authorization errors.

Resolution Steps

The issue was resolved by a rolllback of the problematic release and by opening a fix https://github.com/zitadel/zitadel/pull/8704

Key Learnings

  • Enhanced Observability: The incident highlighted the need for improved monitoring and alerting mechanisms to detect spikes in 4xx errors, enabling faster incident identification and response.
  • Thorough Testing: Rigorous testing of code changes, particularly those impacting core authorization flows, is crucial to prevent regressions.
October 4, 2024 · 13:44 CEST

← Back