Skip to content

Cosmos DB: Network level Metrics #1495

@sourabh1007

Description

@sourabh1007

Area(s)

area:db

Is your change request related to a problem? Please describe.

In the Cosmos DB SDK, a single operation involves several network calls. Currently, if something goes wrong (e.g., high latency) with these network calls, customers rely solely on the logs they have implemented in their applications. When investigating such issues, we are dependent on the information provided by the customer and backend telemetry. To improve monitoring and make it more aligned with potential errors, I am proposing a set of metrics that the SDK should collect to enhance observability.

Describe the solution you'd like

Proposing below list of metrics for network calls, SDK make 2 kinds of network calls

  1. Gateway (i.e. HTTP)
  2. TCP (i.e. RNTBD, proprietary to Microsoft)

Gateway (Meter name: Azure.Cosmos.Client.Request)

We cannot use the HTTP default metrics because we would need our custom dimensions for these metrics. Below is the proposed metrics with dimensions:

Dimensions

Tag/dimension name Sample value
db.system cosmodb
db.collection.name myCollectionName
db.namespace myDatabaseName
server.address myaccountname.documents.azure.com
server.port 443
db.operation.name query_items
db.response.status_code 200 or 429 etc.
db.cosmosdb.sub_status_code 1002 etc.
db.cosmosdb.consistency_level Eventual, ConsistentPrefix, BoundedStaleness, Strong or Session
network.protocol.name http for gateway mode, rntbd for direct mode
network.protocol.host host from http://<host> : <port>
network.protocol.port port from http://<host>:<port>
cloud.region region name, where request was sent
db.cosmosdb.network.response.status_code 200 or 429 etc.
db.cosmosdb.network.response.sub_status_code 1002 etc.
db.cosmosdb.network.routing_id (opt-in) pkrangeid (gateway mode), partionid/replicaid (direct mode)

Metrics

Name Unit Type Description
db.client.cosmosdb.request.duration {seconds} Histogram Duration of client requests.
db.client.cosmosdb.request.count {requests} Histogram Number of requests made
db.client.cosmosdb.request.body.size By Histogram Size of client request bodies.
db.client.cosmosdb.response.body.size By Histogram Size of client response bodies.
db.client.cosmosdb.request.channel_aquisition.duration {seconds} Histogram The duration of the successfully established outbound TCP connections. i.e. Channel Aquisition Time (for direct mode)
db.server.cosmosdb.request.duration db.client.cosmosdb.request.service_duration {seconds} Histogram Backend Latency (for direct mode)
db.client.cosmosdb.request.pipelined.duration {seconds} Histogram Time spent on "pipelined" stage (for direct mode)
db.client.cosmosdb.request.transit.duration {seconds} Histogram Time spent on the wire (for direct mode)
db.client.cosmosdb.request.received.duration {seconds} Histogram Time spent on "Received" stage (for direct mode)
db.client.cosmosdb.request.completed.duration {seconds} Histogram Time spent on "Completed" stage (for direct mode)
db.client.cosmosdb.request.failed.duration {seconds} Histogram Time spent on "Failed" stage (for direct mode)

Describe alternatives you've considered

No response

Additional context

Ref. java SDK metrics : https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:dbenhancementNew feature or requestexperts neededThis issue or pull request is outside an area where general approvers feel they can approvetriage:needs-triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions