Authentication Latency on DUO1 Deployment
Duo · DUO1 deployment
On August 29th, 2018, between 14:11 and 15:13 UTC, the DUO1 deployment experienced significant performance degradation. This led to increased authentication latency and intermittent request timeouts for all customer applications protected by the Duo service on this deployment. This incident mirrored a similar outage that occurred on August 20th.
The root cause was identified as a capacity issue on the DUO1 deployment, stemming from a combination of factors including specific request types, background jobs, inefficient database queries, and automatic retry mechanisms. Critically, the application’s request queue became overloaded while waiting for database connections, creating a large backlog that prevented the database from recovering and resulted in a cascading failure.
Immediate remediation involved implementing a maximum limit on the request queue. This ensures that excess requests are proactively rejected instead of being queued, which helps prevent the database from becoming overwhelmed. Additionally, monitoring for queue depth was established to provide early alerts for potential future issues.
For long-term resolution and improved scalability, Duo initiated several measures. Customer accounts were migrated off DUO1 to alleviate current load, and the database capacity for DUO1 was scheduled to be doubled. The database tier is also being re-architected to allow for customer-specific database servers, enabling more flexible and isolated capacity additions. These lessons learned are being incorporated into Duo’s ongoing capacity planning processes.