Skip to content

Retry API requests in case of temporary network issues #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 2, 2021

Conversation

seanlinsley
Copy link
Member

This PR adds https://github.com/hashicorp/go-retryablehttp to retry any network requests not directly handled by the AWS SDK (since it has its own retry mechanism).

If any of these non-SDK requests should be only tried once we'll need to switch them back to server.Config.HTTPClient.

Here's an example of a request being retried:

Screen Shot 2021-07-27 at 6 22 48 PM

@seanlinsley seanlinsley requested a review from lfittl July 27, 2021 23:35
main.go Outdated
@@ -61,7 +61,8 @@ func run(ctx context.Context, wg *sync.WaitGroup, globalCollectionOpts state.Col
prefixedLogger := logger.WithPrefix(server.SectionName)
prefixedLogger.PrintVerbose("Identified as api_system_type: %s, api_system_scope: %s, api_system_id: %s", server.SystemType, server.SystemScope, server.SystemID)

conf.Servers[idx].HTTPClient = config.CreateHTTPClient(server)
conf.Servers[idx].HTTPClient = config.CreateHTTPClient(server, false)
conf.Servers[idx].HTTPClientWithRetry = config.CreateHTTPClient(server, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pass the prefixedLogger here so HTTP error logs have the right prefix (I suspect we may also need that for the existing HTTP client?)

@seanlinsley
Copy link
Member Author

Okay @lfittl I pushed updates to this branch. Unfortunately, my attempt at unifying the loggers is introducing formatting issues:

Screen Shot 2021-07-28 at 2 27 04 PM

Maybe this would fix it? hashicorp/go-retryablehttp#101

@lfittl
Copy link
Member

lfittl commented Jul 28, 2021

Okay @lfittl I pushed updates to this branch. Unfortunately, my attempt at unifying the loggers is introducing formatting issues:

Screen Shot 2021-07-28 at 2 27 04 PM

Maybe this would fix it? hashicorp/go-retryablehttp#101

Hmm, interesting. Yeah, I think at the very least we'd need something that hides the verbose messages (unless one is running with -v).

I wonder if we should just set Logger to nil? That gives a clean output, and I don't think the retry behavior is that interesting to output to end users:

lfittl@starfish pganalyze-collector % ./pganalyze-collector --test   
2021/07/28 14:31:44 I Running collector test with pganalyze-collector 0.40.0
2021/07/28 14:31:44 I [server1] Testing statistics collection...
2021/07/28 14:31:59 E [server1] Could not process server: Get "http://localhost:5000/v2/snapshots/grant": GET http://localhost:5000/v2/snapshots/grant giving up after 5 attempt(s): Get "http://localhost:5000/v2/snapshots/grant": dial tcp 127.0.0.1:5000: connect: connection refused

Additionally, when looking at the log output, I noticed something else we should consider. Previously when the server collection failed, e.g. because the API server isn't reachable at all, the actual error message from the collector (ignoring any extra output we'd get from the HTTP client) was like this:

2021/07/28 14:10:10 I [server1] Testing statistics collection...
2021/07/28 14:10:10 E [server1] Could not process server: Get "http://localhost:5000/v2/snapshots/grant": \
dial tcp [::1]:5000: connect: connection refused

Now the message looks like this:

2021/07/28 14:31:59 E [server1] Could not process server: Get "http://localhost:5000/v2/snapshots/grant": GET http://localhost:5000/v2/snapshots/grant giving up after 5 attempt(s): Get "http://localhost:5000/v2/snapshots/grant": dial tcp 127.0.0.1:5000: connect: connection refused

I wonder if we can get rid of the first Get "http://localhost:5000/v2/snapshots/grant" at least? (I think that may be caused by wrapping the HTTP error somewhere)

@seanlinsley
Copy link
Member Author

Okay, Logger has been set to nil to remove the more verbose logs.

Retryablehttp adds the URL twice to error messages, and we then wrap that error, causing it to show the URL three times. I settled on a regex hack to remove extra URLs from the message so we can still keep the connection refused detail.

Not perfect, but here's the current version of the errors when the server is down:

2021/07/29 10:06:15 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: "http://localhost:5000/v2/snapshots/grant": dial tcp [::1]:5000: connect: connection refused
2021/07/29 10:06:20 E [server1] Could not get log grant: Get "http://localhost:5000/v2/snapshots/grant_logs": "http://localhost:5000/v2/snapshots/grant_logs": dial tcp [::1]:5000: connect: connection refused

And when the server responds with an error:

2021/07/29 10:18:17 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: http://localhost:5000/v2/snapshots/grant giving up after 5 attempt(s)
2021/07/29 10:18:19 E [server1] Could not get log grant: Get "http://localhost:5000/v2/snapshots/grant_logs": http://localhost:5000/v2/snapshots/grant_logs giving up after 5 attempt(s)

If you think this isn't a terrible approach and would like them to be more consistent, I'll revisit the error parsing.

@lfittl
Copy link
Member

lfittl commented Jul 29, 2021

Retryablehttp adds the URL twice to error messages, and we then wrap that error, causing it to show the URL three times. I settled on a regex hack to remove extra URLs from the message so we can still keep the connection refused detail.

Not perfect, but here's the current version of the errors when the server is down:

2021/07/29 10:06:15 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: "http://localhost:5000/v2/snapshots/grant": dial tcp [::1]:5000: connect: connection refused
2021/07/29 10:06:20 E [server1] Could not get log grant: Get "http://localhost:5000/v2/snapshots/grant_logs": "http://localhost:5000/v2/snapshots/grant_logs": dial tcp [::1]:5000: connect: connection refused

And when the server responds with an error:

2021/07/29 10:18:17 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: http://localhost:5000/v2/snapshots/grant giving up after 5 attempt(s)
2021/07/29 10:18:19 E [server1] Could not get log grant: Get "http://localhost:5000/v2/snapshots/grant_logs": http://localhost:5000/v2/snapshots/grant_logs giving up after 5 attempt(s)

If you think this isn't a terrible approach and would like them to be more consistent, I'll revisit the error parsing.

I think something that modifies the error message with a regexp could work, but I'm not a fan of doing this in the central logger, since we're introducing knowledge about a very specific error into the generic codepath.

Maybe instead we could do this at the individual call sites, e.g. by having a FriendlyHTTPErrorMessage method (or something like that) that is called before we pass back the error from functions that are using the retrying HTTP client.

@seanlinsley
Copy link
Member Author

Okay, the log cleaner is now in a separate util function. Now all four scenarios only show the URL once:

2021/07/30 08:58:05 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: "http://localhost:5000/v2/snapshots/grant": dial tcp [::1]:5000: connect: connection refused
2021/07/30 08:58:18 E [server1] Could not get log grant: "http://localhost:5000/v2/snapshots/grant_logs": dial tcp [::1]:5000: connect: connection refused
2021/07/30 09:01:48 E [server1] Could not collect activity for server: could not get default grant for activity snapshot: http://localhost:5000/v2/snapshots/grant giving up after 5 attempt(s)
2021/07/30 09:01:52 E [server1] Could not get log grant: http://localhost:5000/v2/snapshots/grant_logs giving up after 5 attempt(s)

Copy link
Member

@lfittl lfittl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@seanlinsley seanlinsley merged commit 314edc8 into main Aug 2, 2021
@seanlinsley seanlinsley deleted the retry-api-requests branch August 2, 2021 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants