Share via

Azure DocumentDB (with MongoDB compatibility) Connection Timeouts

Phillip Stenger 5 Reputation points
2026-05-18T15:23:26.5566667+00:00

We have started to experience connection timeouts for our Azure DocumentDB MongoDB resource. We have an app running in container apps which connects to the database over private link which is suddenly unable to connect. Also, when I try to connect from my local machine (something that previously worked) I am unable to connect.

The issue started early this morning around 8:15 AM. The only changes under the Activity Log this morning was a "Create role assignment" and a "Create or update resource diagnostic setting" which happened as part of a Terraform deployment this morning. Both the app and my local are attempting to use password for authentication, so it is unlikely that the "role assignment" operation should affect it.

This is the error I get from my local: Unable to connect: connect ETIMEDOUT

In my container app: pymongo.errors.ServerSelectionTimeoutError: [redacted].mongocluster.cosmos.azure.com:10260: timed out, Timeout: 30s,

Another observation is I notice a strange CPU spike at 5:40AM this morning up to 60%. The previous max for the past couple months is ~12%. I could not find any requests in the logs at that time on the database.

Azure Cosmos DB
Azure Cosmos DB

An Azure NoSQL database service for app development.


2 answers

Sort by: Most helpful
  1. Manoj Kumar Boyini 16,725 Reputation points Microsoft External Staff Moderator
    2026-05-26T13:20:48.2833333+00:00

    Hi @Phillip Stenger

    Root Cause:  

    Upon investigation, product team determined that the root cause of the incident was the unexpected termination of the internal container hosting the Azure Document DB process. Metrics showed that the health status of the cluster was stable and resources such as CPU and memory did not indicate consistent pressure, but a 'kill' event was recorded against the internal container at the onset of the issue. This event interrupted the primary process, making the cluster unavailable even though the underlying VM remained healthy throughout. The Gateway Availability metric also showed a degraded state during the incident, but a gap in telemetry complicated immediate detection and alerting.  There was no evidence of a code defect causing the disruption. Instead, the incident was the result of a backend operational event within the container infrastructure, where the process was lost for reasons that remain under review. The absence of related logs and core dump data constrained the ability to pinpoint why the container was terminated. High availability was not enabled for this cluster, which increased reliance on the health of the single node and left no automated resilience in case of process disruption.   

    Mitigation and Next Steps: 

    Following the identification of the unavailability, product team engineers undertook a reconfiguration of the backend infrastructure, which resulted in the restoration of cluster connectivity. The metrics after the intervention indicated successful requests and stable health, confirming service availability for the customer once again. The team is actively investigating to improve monitoring sensitivity and close the observed telemetry gaps, which delayed incident escalation and detection. Work is underway to tune the Gateway Availability monitor and ensure that incidents are promptly surfaced in the future.   

    We regret the inconvenience experienced due to this unexpected database interruption. As ongoing corrective actions, we recommend enabling high availability for clusters where production workloads depend on continuous access, reducing exposure to single-node failures. The engineering team is also reviewing container lifecycle controls and diagnostic data retention to aid rapid root cause identification. Thank you for your patience as we reinforce our operational defenses to prevent a recurrence of this disruption.   

    Please let us know if you have any questions or concerns.

    Was this answer helpful?


  2. Vinodh247-1375 43,101 Reputation points Volunteer Moderator
    2026-05-18T15:41:20.62+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    This is not an authentication issue. It is a connectivity failure.

    Most likely causes:

    • Private Link/DNS issue (endpoint not resolving to private IP)

    Firewall or network rules reset (public access blocked or IP not allowed)

    Private endpoint not in Approved state

    Possible Cosmos DB backend failover/service issue (CPU spike is a clue)

    Key signal: Both local and container time out + no DB logs -> requests are not reaching Cosmos DB

    Immediate checks:

    nslookup <account>.mongocluster.cosmos.azure.com

    • Verify Private Endpoint - Connected

    Temporarily enable public access to isolate issue

    Check Azure Service Health

    Conclusion: Focus on DNS + networking + private endpoint, not role assignment.

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.