Skip to content

Conversation

@vwxyzh
Copy link
Contributor

@vwxyzh vwxyzh commented Oct 18, 2018

When app server failed to handshake with service, it will stop connect.
But the app server will running to a strange state:

  1. app server is alive
  2. app server never connect to service, even issues in service are gone.
  3. customer must restart the app server to reconnect with service, and in most case customer have to restart it manually.

Solutions:

  1. stop app server when all server connections are broken.
    Big impact when service has some temporary issue.
  2. retry to connect service for ever.

}
}

private static TimeSpan GetRetryDelay(ref int retryCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining that this is exponential back off up to 5.

@davidfowl
Copy link
Member

When would this happen?

@vwxyzh
Copy link
Contributor Author

vwxyzh commented Oct 19, 2018

Handshake failed.

@davidfowl
Copy link
Member

Sure, when does that happen?

@davidfowl
Copy link
Member

BTW I think this change is good I just want to understand the reasoning. Is it pre-emptive or did we see issues here?

@vicancy
Copy link
Member

vicancy commented Oct 19, 2018

@davidfowl The issue #222 (comment) here, the server connections dropped, and never came back until a new deployment. So we are thinking it might be more robust for the server side to always try connecting

@davidfowl
Copy link
Member

This is specifically this case right:

// False means we got a HandshakeResponseMessage with error. Will take below actions:

When does this happen?

@vwxyzh
Copy link
Contributor Author

vwxyzh commented Oct 19, 2018

And we want to return more error in handshake, e.g. reach rate limit for free tiers, and when customer upgrade it should be recover.
Currently, when customer reaching rate limit, service will disconnect the connection without any useful message (PR #236 want to send a useful message to app server), and SDK will try to reconnect service in a very high rate (delay 0~1s * 5 connection per hub * hub count).

@vwxyzh
Copy link
Contributor Author

vwxyzh commented Oct 19, 2018

And I think this case:

catch (Exception ex)
{
Log.FailedToConnect(_logger, ex);
await DisposeConnection();
await Task.Delay(ReconnectInterval);
}

should also return false, instead of reconnecting with a short delay.

For example, if my access key is wrong, SDK will try to connect service, and always get an exception, web socket never connected, and retry in a high rate.

@davidfowl
Copy link
Member

For example, if my access key is wrong, SDK will try to connect service, and always get an exception, web socket never connected, any retry in a high rate.

I agree with most of what you're saying. I'd like to see some tests for these cases.

@vwxyzh vwxyzh merged commit 45236fd into Azure:dev Oct 30, 2018
@vwxyzh vwxyzh deleted the z/reconnect branch October 30, 2018 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants