Incident

Service disruption in private projects Portal and KB Site

Resolved | Dec 14, 2023 | 15:24 GMT+00:00

Impact:
Customers who had set up private mode experienced inaccessibility to the Knowledgebase and portal site's pages, causing a disruption in service. Requests for the home page or portal would consistently time out after approximately 30 seconds.

Cause:
Upon investigation, it was found that the identity server experienced a sudden and significant surge, marked by an unusually high number of requests. This influx, in turn, triggered a timeout situation in the SQL server. Our analysis revealed a notable increase in thread connections, directly attributable to the surge in requests. This surge, coupled with connection pool starvation, led to the inability to establish necessary connections with the database, resulting in the observed timeouts.

Mitigation:
To address the immediate impact, our auto-heal setup and process efficiently identified the elevated server load and initiated the scaling out of resources. This reactive measure ensured that the system could adapt to the increased demand, mitigating the severity of the issue and restoring accessibility to the Knowledgebase and portal pages for users in private mode.

Next Steps:
We are currently engaged in a comprehensive analysis to identify potential areas for improvement to prevent similar incidents in the future. Our aim is to implement proactive measures that will enhance the system's resilience and responsiveness, ensuring a more robust and reliable experience for our users even during periods of unexpected demand.

We appreciate the understanding and patience of our customers during this incident, and we are dedicated to continuously enhancing our systems to provide a seamless and reliable service.