An Azure NoSQL database service for app development.
Root Cause:
Upon investigation, product team determined that the root cause of the incident was the unexpected termination of the internal container hosting the Azure Document DB process. Metrics showed that the health status of the cluster was stable and resources such as CPU and memory did not indicate consistent pressure, but a 'kill' event was recorded against the internal container at the onset of the issue. This event interrupted the primary process, making the cluster unavailable even though the underlying VM remained healthy throughout. The Gateway Availability metric also showed a degraded state during the incident, but a gap in telemetry complicated immediate detection and alerting. There was no evidence of a code defect causing the disruption. Instead, the incident was the result of a backend operational event within the container infrastructure, where the process was lost for reasons that remain under review. The absence of related logs and core dump data constrained the ability to pinpoint why the container was terminated. High availability was not enabled for this cluster, which increased reliance on the health of the single node and left no automated resilience in case of process disruption.
Mitigation and Next Steps:
Following the identification of the unavailability, product team engineers undertook a reconfiguration of the backend infrastructure, which resulted in the restoration of cluster connectivity. The metrics after the intervention indicated successful requests and stable health, confirming service availability for the customer once again. The team is actively investigating to improve monitoring sensitivity and close the observed telemetry gaps, which delayed incident escalation and detection. Work is underway to tune the Gateway Availability monitor and ensure that incidents are promptly surfaced in the future.
We regret the inconvenience experienced due to this unexpected database interruption. As ongoing corrective actions, we recommend enabling high availability for clusters where production workloads depend on continuous access, reducing exposure to single-node failures. The engineering team is also reviewing container lifecycle controls and diagnostic data retention to aid rapid root cause identification. Thank you for your patience as we reinforce our operational defenses to prevent a recurrence of this disruption.
Please let us know if you have any questions or concerns.