Skip to content

Conversation

@TimothyMothra
Copy link

@TimothyMothra TimothyMothra commented Jun 16, 2022

Fix Issue #1186.

This PR mitigates the potential infinite while loop in MetricValuesBuffer by adding an exit condition.

Changes

  • added an exit condition to MetricValuesBuffer.GetAndResetValue().

Explanation

MetricValuesBuffer has a section that could get stuck in an infinate loop here:

value = this.GetAndResetValueOnce(this.values, index);
while (this.IsInvalidValue(value))
{
spinWait.SpinOnce();
if (spinWait.Count % 100 == 0)
{
// In tests (including stress tests) we always finished wating before 100 cycles.
// However, this is a protection against en extreme case on a slow machine.
Task.Delay(10).ConfigureAwait(continueOnCapturedContext: false).GetAwaiter().GetResult();
}
value = this.GetAndResetValueOnce(this.values, index);
}

GetAndResetValueOnce will read a value from the buffer at an index and set reset that to Double.NaN.
This loop has no way to breakout if another thread has already reset this value.

This would also affect the lock in MetricSeriesAggregatorBase:

lock (buffer)
{
int maxFlushIndex = Math.Min(buffer.PeekLastWriteIndex(), buffer.Capacity - 1);
int minFlushIndex = buffer.NextFlushIndex;
if (minFlushIndex > maxFlushIndex)
{
return;
}
stage1Result = this.UpdateAggregate_Stage1(buffer, minFlushIndex, maxFlushIndex);
buffer.NextFlushIndex = maxFlushIndex + 1;
}

Alternatives considered

My previous PR replaced the lock in MetricSeriesAggregatorBase. #2595.

However, when considering what happens when we break out...
Breaking out of MetricSeriesAggregatorBase may cause the SDK to lose a complete batch of metrics.
Instead, we should only drop a single metric if breaking out of MetricValuesBuffer.

Checklist

  • I ran Unit Tests locally.
  • CHANGELOG.md updated with one line description of the fix, and a link to the original issue if available.

For significant contributions please make sure you have completed the following items:

  • Design discussion issue #
  • Changes in public surface reviewed

The PR will trigger build, unit tests, and functional tests automatically. Please follow these instructions to build and test locally.

Notes for authors:

  • FxCop and other analyzers will fail the build. To see these errors yourself, compile localy using the Release configuration.

Notes for reviewers:

  • We support comment build triggers
    • /AzurePipelines run will queue all builds
    • /AzurePipelines run <pipeline-name> will queue a specific build

@TimothyMothra TimothyMothra requested a review from cijothomas June 16, 2022 00:08
@TimothyMothra TimothyMothra marked this pull request as ready for review June 16, 2022 20:40
@TimothyMothra TimothyMothra changed the title fix Metric deadlock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue fix Metric livelock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue Jun 16, 2022
Copy link
Contributor

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need some validation before stable releasing this.

@TimothyMothra TimothyMothra enabled auto-merge (squash) June 17, 2022 18:37
@TimothyMothra TimothyMothra merged commit e9d4974 into main Jun 17, 2022
@TimothyMothra TimothyMothra deleted the tilee/1186_GetAndResetValue branch June 17, 2022 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants