-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Description
The MetricsCollectorSpec.MetricsCollector_should_collector_accurate_metrics_for_node test is consistently failing on Windows CI due to timeout issues. Investigation reveals several underlying problems with the DefaultCollector implementation that need to be addressed.
Problems Identified
1. Incorrect semantic meaning of MemoryAvailable metric
The MemoryAvailable metric is calculated incorrectly:
| var availableMemory = NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.WorkingSet64 + process.PagedMemorySize64); |
var availableMemory = NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.WorkingSet64 + process.PagedMemorySize64);This calculates the memory used by the process (WorkingSet64 + PagedMemorySize64), not the available system memory. This is semantically incorrect and misleading.
2. No error handling for invalid process memory values
On Windows CI environments with restricted memory access, process.PagedMemorySize64 may return 0 or negative values, causing metric creation to fail silently:
akka.net/src/contrib/cluster/Akka.Cluster.Metrics/Serialization/Metric.cs
Lines 141 to 150 in 3abeec2
| } | |
| /// <summary> | |
| /// Internal usage | |
| /// </summary> | |
| [InternalApi] | |
| public static Either<long, double> ConvertNumber(AnyNumber number) | |
| { | |
| switch (number.Type) | |
| { |
The Defined() method rejects negative values, but there's no fallback when process memory properties fail.
3. Test timeout set exactly at xUnit limit
The test timeout was increased to 60 seconds in PR #7798, which exactly matches xUnit's longRunningTestSeconds threshold:
akka.net/src/xunit.runner.json
Line 3 in 3abeec2
| "longRunningTestSeconds": 60, |
This leaves no margin for error and causes immediate test failure if metric collection takes slightly longer.
4. Synchronous Thread.Sleep on first sample
The DefaultCollector blocks for 500ms on the first CPU sample:
akka.net/src/contrib/cluster/Akka.Cluster.Metrics/Collectors/DefaultCollector.cs
Line 120 in 3abeec2
| Thread.Sleep(500); |
While this alone shouldn't cause a 60-second timeout, it contributes to the problem when combined with retry logic.
Reproduction
The issue manifests specifically on Windows CI runners but not on Linux or local Windows environments, suggesting it's related to constrained/virtualized environments where process memory APIs behave differently.
Error from PR #7801:
System.Exception : Failed to initialize test data
---- System.Threading.Tasks.TaskCanceledException : A task was canceled.
at Akka.Cluster.Metrics.Tests.MetricsCollectorSpec.MetricsCollector_should_collector_accurate_metrics_for_node() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Metrics.Tests\MetricsCollectorSpec.cs:line 64
Proposed Solutions
- Fix MemoryAvailable calculation: Either rename the metric to reflect what it actually measures, or implement proper available system memory detection
- Add resilient fallbacks: When
PagedMemorySize64returns invalid values, use alternative memory measurements - Reduce test timeout: Set to 10-15 seconds to stay well below xUnit's limit
- Make metrics optional: Allow the collector to return partial metrics rather than failing completely
- Add diagnostics: Include detailed logging about which metrics failed and why
Impact
This issue causes consistent CI failures on Windows, blocking PR merges and reducing confidence in the metrics collection system.
Related PRs
- Fix metrics specs race conditions #7798 - Initial attempt to fix race conditions (increased timeout to 60s)
- Fix race condition in
ClusterLogVerboseDefaultSpectest #7801 - Where the issue continues to manifest