-
Notifications
You must be signed in to change notification settings - Fork 498
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
OpenTelemetry Metrics: Adds support to collect Operation level metrics (
#4682) ## Description 1. Added new flag in `CosmosClientTelemetryOptions` i.e. `IsClientMetricsEnabled`, to enable/disable metrics. By default, it would be disabled. (inspired from Java SDK https://github.com/Azure/azure-sdk-for-java/blob/5bc07ca75c7c0520c1098b5a6264258b6e043435/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/models/CosmosClientTelemetryConfig.java#L61) 2. If `enabled`, collecting below metrics ref. open-telemetry/semantic-conventions#1438 ref. Java Metric Doumentation, https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md ref. Discussion with open telemetry community open-telemetry/semantic-conventions#1438 this PR has **Contract Changes**. ## Perf Testing I haven't observed any performance impact from this change. In this feature, any performance issues would likely stem from the Exporter implementation or the aggregation interval. The tests were conducted using a no-op exporter subscribed to these meters to isolate any performance impact specifically related to data recording ![image](https://github.com/user-attachments/assets/1a0ab16a-7b1b-44fb-a0ff-2eacd87d2d93) ## Type of change - [] New feature (non-breaking change which adds functionality) --------- Co-authored-by: Kiran Kumar Kolli <[email protected]> Co-authored-by: Debdatta Kunda <[email protected]>
- Loading branch information
1 parent
d7169f4
commit 9fa85ee
Showing
29 changed files
with
1,377 additions
and
164 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
143 changes: 143 additions & 0 deletions
143
Microsoft.Azure.Cosmos/src/Telemetry/OpenTelemetry/CosmosDbClientMetrics.cs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
// ------------------------------------------------------------ | ||
// Copyright (c) Microsoft Corporation. All rights reserved. | ||
// ------------------------------------------------------------ | ||
|
||
namespace Microsoft.Azure.Cosmos | ||
{ | ||
/// <summary> | ||
/// The CosmosDbClientMetrics class provides constants related to OpenTelemetry metrics for Azure Cosmos DB. | ||
/// These metrics are useful for tracking various aspects of Cosmos DB client operations and compliant with Open Telemetry Semantic Conventions | ||
/// It defines standardized names, units, descriptions, and histogram buckets for measuring and monitoring performance through OpenTelemetry. | ||
/// </summary> | ||
public sealed class CosmosDbClientMetrics | ||
{ | ||
/// <summary> | ||
/// OperationMetrics | ||
/// </summary> | ||
public static class OperationMetrics | ||
{ | ||
/// <summary> | ||
/// the name of the operation meter | ||
/// </summary> | ||
public const string MeterName = "Azure.Cosmos.Client.Operation"; | ||
|
||
/// <summary> | ||
/// Version of the operation meter | ||
/// </summary> | ||
public const string Version = "1.0.0"; | ||
|
||
/// <summary> | ||
/// Metric Names | ||
/// </summary> | ||
public static class Name | ||
{ | ||
/// <summary> | ||
/// Total request units per operation (sum of RUs for all requested needed when processing an operation) | ||
/// </summary> | ||
public const string RequestCharge = "db.client.cosmosdb.operation.request_charge"; | ||
|
||
/// <summary> | ||
/// Total end-to-end duration of the operation | ||
/// </summary> | ||
public const string Latency = "db.client.operation.duration"; | ||
|
||
/// <summary> | ||
/// For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service. | ||
/// </summary> | ||
public const string RowCount = "db.client.response.row_count"; | ||
|
||
/// <summary> | ||
/// Number of active SDK client instances. | ||
/// </summary> | ||
public const string ActiveInstances = "db.client.cosmosdb.active_instance.count"; | ||
} | ||
|
||
/// <summary> | ||
/// Unit for metrics | ||
/// </summary> | ||
public static class Unit | ||
{ | ||
/// <summary> | ||
/// Unit representing a simple count | ||
/// </summary> | ||
public const string Count = "#"; | ||
|
||
/// <summary> | ||
/// Unit representing time in seconds | ||
/// </summary> | ||
public const string Sec = "s"; | ||
|
||
/// <summary> | ||
/// Unit representing request units | ||
/// </summary> | ||
public const string RequestUnit = "# RU"; | ||
|
||
} | ||
|
||
/// <summary> | ||
/// Provides descriptions for metrics. | ||
/// </summary> | ||
public static class Description | ||
{ | ||
/// <summary> | ||
/// Description for operation duration | ||
/// </summary> | ||
public const string Latency = "Total end-to-end duration of the operation"; | ||
|
||
/// <summary> | ||
/// Description for total request units per operation | ||
/// </summary> | ||
public const string RequestCharge = "Total request units per operation (sum of RUs for all requested needed when processing an operation)"; | ||
|
||
/// <summary> | ||
/// Description for the item count metric in responses | ||
/// </summary> | ||
public const string RowCount = "For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service"; | ||
|
||
/// <summary> | ||
/// Description for the active SDK client instances metric | ||
/// </summary> | ||
public const string ActiveInstances = "Number of active SDK client instances."; | ||
} | ||
} | ||
|
||
/// <summary> | ||
/// Buckets | ||
/// </summary> | ||
public static class HistogramBuckets | ||
{ | ||
/// <summary> | ||
/// ExplicitBucketBoundaries for "db.cosmosdb.operation.request_charge" Metrics | ||
/// </summary> | ||
/// <remarks> | ||
/// <b>1, 5, 10</b>: Low Usage Levels, These smaller buckets allow for precise tracking of operations that consume a minimal number of Request Units. This is important for lightweight operations, such as basic read requests or small queries, where resource utilization should be optimized. Monitoring these low usage levels can help ensure that the application is not inadvertently using more resources than necessary.<br></br> | ||
/// <b>25, 50</b>: Moderate Usage Levels, These ranges capture more moderate operations, which are typical in many applications. For example, queries that return a reasonable amount of data or perform standard CRUD operations may fall within these limits. Identifying usage patterns in these buckets can help detect efficiency issues in routine operations.<br></br> | ||
/// <b>100, 250</b>: Higher Usage Levels, These boundaries represent operations that may require significant resources, such as complex queries or larger transactions. Monitoring RUs in these ranges can help identify performance bottlenecks or costly queries that might lead to throttling.<br></br> | ||
/// <b>500, 1000</b>: Very High Usage Levels, These buckets capture operations that consume a substantial number of Request Units, which may indicate potentially expensive queries or batch processes. Understanding the frequency and patterns of such high RU usage can be critical in optimizing performance and ensuring the application remains within provisioned throughput limits. | ||
/// </remarks> | ||
public static readonly double[] RequestUnitBuckets = new double[] { 1, 5, 10, 25, 50, 100, 250, 500, 1000}; | ||
|
||
/// <summary> | ||
/// ExplicitBucketBoundaries for "db.client.operation.duration" Metrics. | ||
/// </summary> | ||
/// <remarks> | ||
/// <b>0.001, 0.005, 0.010</b> seconds: Higher Precision at Sub-Millisecond Levels, For high-performance workloads, especially when dealing with microservices or low-latency queries. <br></br> | ||
/// <b>0.050, 0.100, 0.200</b> seconds: Granularity for Standard Web Applications, These values allow detailed tracking for latencies between 50ms and 200ms, which are common in web applications. Fine-grained buckets in this range help in identifying performance issues before they grow critical, while covering the typical response times expected in Cosmos DB.<br></br> | ||
/// <b>0.500, 1.000</b> seconds: Wider Range for Intermediate Latencies, Operations that take longer, in the range of 500ms to 1 second, are still important for performance monitoring. By capturing these values, you maintain awareness of potential bottlenecks or slower requests that may need optimization.<br></br> | ||
/// <b>2.000, 5.000</b> seconds: Capturing Outliers and Slow Queries, It’s important to track outliers that might go beyond 1 second. Having buckets for 2 and 5 seconds enables identification of rare, long-running operations that may require further investigation. | ||
/// </remarks> | ||
public static readonly double[] RequestLatencyBuckets = new double[] { 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000 }; | ||
|
||
/// <summary> | ||
/// ExplicitBucketBoundaries for "db.client.response.row_count" Metrics | ||
/// </summary> | ||
/// <remarks> | ||
/// <b>10, 50, 100</b>: Small Response Sizes, These buckets are useful for capturing scenarios where only a small number of items are returned. Such small queries are common in real-time or interactive applications where performance and quick responses are critical. They also help in tracking the efficiency of operations that should return minimal data, minimizing resource usage.<br></br> | ||
/// <b>250, 500, 1000</b>: Moderate Response Sizes, These values represent typical workloads where moderate amounts of data are returned in each query. This is useful for applications that need to return more information, such as data analytics or reporting systems. Tracking these ranges helps identify whether the system is optimized for these relatively larger data sets and if they lead to any performance degradation.<br></br> | ||
/// <b>2000, 5000</b>: Larger Response Sizes, These boundaries capture scenarios where the query returns large datasets, often used in batch processing or in-depth analytical queries. These larger page sizes can potentially increase memory or CPU usage and may lead to longer query execution times, making it important to track performance in these ranges.<br></br> | ||
/// <b>10000</b>: Very Large Response Sizes (Outliers), This boundary is included to capture rare, very large response sizes. Such queries can put significant strain on system resources, including memory, CPU, and network bandwidth, and can often lead to performance issues such as high latency or even network drops. | ||
/// </remarks> | ||
public static readonly double[] RowCountBuckets = new double[] { 1, 10, 50, 100, 250, 500, 1000, 2000, 5000, 10000 }; | ||
} | ||
} | ||
} |
Oops, something went wrong.