Incident

Service disruption in private projects KB Site and Portal

Resolved | Feb 26, 2024 | 10:40 GMT+00:00

Incident:
During February 20, 09:21 to 10:41 PM IST and February 21, 07:26 to 09:11 PM IST, a production outage occurred in the Identity Server, disrupting service for multiple customers. 

Duration:
The slowness persisted for 185 minutes in total, causing interruptions in service availability and connectivity. 

Impact:
Customers using document360 portal and private projects had intermittent timeouts. 

Cause:
During peak traffic periods, the system experienced performance degradation characterized by delayed response times and subsequent timeouts. Automatic horizontal scaling mechanisms were activated to manage the increased load on our servers. However, as the horizontal scaling reached a critical threshold, it exacerbated connection issues with the SQL database, resulting in timeouts. 

Resolution:
To address this challenge, we implemented a solution by vertically scaling our server infrastructure to higher-core machines capable of handling increased loads with high availability. This adjustment ensures improved performance and reliability during peak usage periods, thereby mitigating potential disruptions caused by resource limitations. 

Thank you for your understanding and continued support.
Should you have any further inquiries or require additional information, please do not hesitate to reach out to our support team.

Identified | Feb 20, 2024 | 04:02 GMT+00:00

Between 04:02PM and 04:57 PM UTC on 20 February 2024, Customers who had set up private mode experienced inaccessibility to the Knowledgebase and portal site's pages, causing a disruption in service. Requests for the home page or portal would consistently time out after approximately 30 seconds.

Root Cause:
After conducting an investigation, it was determined that the identity server encountered a sudden and substantial increase in traffic, characterized by an unusually high volume of requests. This surge led to the database system reaching its maximum concurrent requests limit, resulting in data write issues.

Mitigation:
The delay in the data write process resulted in temporary system slowness, which, in turn, caused a delay in the scaling up process as defined in the system.

Next Steps:
We are currently engaged in a comprehensive analysis to identify potential areas for improvement to prevent similar incidents in the future.

We appreciate the understanding and patience of our customers during this incident, and we are dedicated to continuously enhancing our systems to provide a seamless and reliable service.