Skip to content

DefaultCollector: MemoryAvailable metric reports wrong value and fails on Windows CI #7803

@Aaronontheweb

Description

@Aaronontheweb

Description

The MetricsCollectorSpec.MetricsCollector_should_collector_accurate_metrics_for_node test is consistently failing on Windows CI due to timeout issues. Investigation reveals several underlying problems with the DefaultCollector implementation that need to be addressed.

Problems Identified

1. Incorrect semantic meaning of MemoryAvailable metric

The MemoryAvailable metric is calculated incorrectly:

var availableMemory = NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.WorkingSet64 + process.PagedMemorySize64);

var availableMemory = NodeMetrics.Types.Metric.Create(StandardMetrics.MemoryAvailable, process.WorkingSet64 + process.PagedMemorySize64);

This calculates the memory used by the process (WorkingSet64 + PagedMemorySize64), not the available system memory. This is semantically incorrect and misleading.

2. No error handling for invalid process memory values

On Windows CI environments with restricted memory access, process.PagedMemorySize64 may return 0 or negative values, causing metric creation to fail silently:

}
/// <summary>
/// Internal usage
/// </summary>
[InternalApi]
public static Either<long, double> ConvertNumber(AnyNumber number)
{
switch (number.Type)
{

The Defined() method rejects negative values, but there's no fallback when process memory properties fail.

3. Test timeout set exactly at xUnit limit

The test timeout was increased to 60 seconds in PR #7798, which exactly matches xUnit's longRunningTestSeconds threshold:

"longRunningTestSeconds": 60,

This leaves no margin for error and causes immediate test failure if metric collection takes slightly longer.

4. Synchronous Thread.Sleep on first sample

The DefaultCollector blocks for 500ms on the first CPU sample:

While this alone shouldn't cause a 60-second timeout, it contributes to the problem when combined with retry logic.

Reproduction

The issue manifests specifically on Windows CI runners but not on Linux or local Windows environments, suggesting it's related to constrained/virtualized environments where process memory APIs behave differently.

Error from PR #7801:

System.Exception : Failed to initialize test data
---- System.Threading.Tasks.TaskCanceledException : A task was canceled.
   at Akka.Cluster.Metrics.Tests.MetricsCollectorSpec.MetricsCollector_should_collector_accurate_metrics_for_node() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Metrics.Tests\MetricsCollectorSpec.cs:line 64

Proposed Solutions

  1. Fix MemoryAvailable calculation: Either rename the metric to reflect what it actually measures, or implement proper available system memory detection
  2. Add resilient fallbacks: When PagedMemorySize64 returns invalid values, use alternative memory measurements
  3. Reduce test timeout: Set to 10-15 seconds to stay well below xUnit's limit
  4. Make metrics optional: Allow the collector to return partial metrics rather than failing completely
  5. Add diagnostics: Include detailed logging about which metrics failed and why

Impact

This issue causes consistent CI failures on Windows, blocking PR merges and reducing confidence in the metrics collection system.

Related PRs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions