-
Notifications
You must be signed in to change notification settings - Fork 278
Description
Area(s)
area:db
Is your change request related to a problem? Please describe.
In the Cosmos DB SDK, a single operation involves several network calls. Currently, if something goes wrong (e.g., high latency) with these network calls, customers rely solely on the logs they have implemented in their applications. When investigating such issues, we are dependent on the information provided by the customer and backend telemetry. To improve monitoring and make it more aligned with potential errors, I am proposing a set of metrics that the SDK should collect to enhance observability.
Describe the solution you'd like
Proposing below list of metrics for network calls, SDK make 2 kinds of network calls
- Gateway (i.e. HTTP)
- TCP (i.e. RNTBD, proprietary to Microsoft)
Gateway (Meter name: Azure.Cosmos.Client.Request)
We cannot use the HTTP default metrics because we would need our custom dimensions for these metrics. Below is the proposed metrics with dimensions:
Dimensions
| Tag/dimension name | Sample value |
|---|---|
| db.system | cosmodb |
| db.collection.name | myCollectionName |
| db.namespace | myDatabaseName |
| server.address | myaccountname.documents.azure.com |
| server.port | 443 |
| db.operation.name | query_items |
| db.response.status_code | 200 or 429 etc. |
| db.cosmosdb.sub_status_code | 1002 etc. |
| db.cosmosdb.consistency_level | Eventual, ConsistentPrefix, BoundedStaleness, Strong or Session |
| network.protocol.name | http for gateway mode, rntbd for direct mode |
| network.protocol.host | host from http://<host> : <port> |
| network.protocol.port | port from http://<host>:<port> |
| cloud.region | region name, where request was sent |
| db.cosmosdb.network.response.status_code | 200 or 429 etc. |
| db.cosmosdb.network.response.sub_status_code | 1002 etc. |
| db.cosmosdb.network.routing_id (opt-in) | pkrangeid (gateway mode), partionid/replicaid (direct mode) |
Metrics
| Name | Unit | Type | Description |
|---|---|---|---|
db.client.cosmosdb.request.duration |
{seconds} |
Histogram | Duration of client requests. |
db.client.cosmosdb.request.count |
{requests} |
Histogram | Number of requests made |
db.client.cosmosdb.request.body.size |
By |
Histogram | Size of client request bodies. |
db.client.cosmosdb.response.body.size |
By |
Histogram | Size of client response bodies. |
db.client.cosmosdb.request.channel_aquisition.duration |
{seconds} |
Histogram | The duration of the successfully established outbound TCP connections. i.e. Channel Aquisition Time (for direct mode) |
db.server.cosmosdb.request.durationdb.client.cosmosdb.request.service_duration |
{seconds} |
Histogram | Backend Latency (for direct mode) |
db.client.cosmosdb.request.pipelined.duration |
{seconds} |
Histogram | Time spent on "pipelined" stage (for direct mode) |
db.client.cosmosdb.request.transit.duration |
{seconds} |
Histogram | Time spent on the wire (for direct mode) |
db.client.cosmosdb.request.received.duration |
{seconds} |
Histogram | Time spent on "Received" stage (for direct mode) |
db.client.cosmosdb.request.completed.duration |
{seconds} |
Histogram | Time spent on "Completed" stage (for direct mode) |
db.client.cosmosdb.request.failed.duration |
{seconds} |
Histogram | Time spent on "Failed" stage (for direct mode) |
Describe alternatives you've considered
No response
Additional context
Ref. java SDK metrics : https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md