-
Notifications
You must be signed in to change notification settings - Fork 355
Upgrade AWS EC2 usage to Go SDK V2 #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade AWS EC2 usage to Go SDK V2 #1146
Conversation
|
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @gargipanatula! |
|
Hi @gargipanatula. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| ec2.InstanceStateNameShuttingDown, | ||
| ec2.InstanceStateNameStopping, | ||
| ec2.InstanceStateNameStopped, | ||
| string(ec2types.InstanceStateNamePending), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change the array type []string{..} to the new ec2 type and cast it to string when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this would be better, but it's only used as a parameter to newEc2Filter, which expects a []string{}. Converting aliveFilter to strings just for that function might be a little messy, so maybe we can keep it as a []string{}? Let me know if that sounds ok, open to either way.
|
|
||
| // An interface to satisfy the ec2.Client API. | ||
| // More details about this pattern: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/unit-testing.html | ||
| type EC2API interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this interface type? is it for testing purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Previously, awsSdkEC2 and MockedEC2API wrapped ec2iface.EC2API to mock the ec2 client. In v2, the API doesn't exist, so we have to create our own interface and wrap that instead.
More details: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/unit-testing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we can only keep the interface for functions required for testing for simplicity. For example, if only CreateSecurityGroup is needed for the tests, we can only keep it and remove others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. All these guys are mocked though, mostly by FakeEC2Impl, so maybe we can keep it as is for now to minimize changes in this PR and address it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename it to EC2APIForTesting?
pkg/providers/v1/aws_sdk.go
Outdated
| return kmsClient, nil | ||
| } | ||
|
|
||
| func (p *awsSDKProvider) AddHandlersV2(ctx context.Context, regionName string, cfg awsv2.Config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a code comment here to note that original implementation in SDK v1?
| type EC2 interface { | ||
| // Query EC2 for instances matching the filter | ||
| DescribeInstances(request *ec2.DescribeInstancesInput) ([]*ec2.Instance, error) | ||
| DescribeInstances(ctx context.Context, request *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) ([]ec2types.Instance, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need EC2API interface above, if we already have this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC2API is used to mock the EC2 client in general, whereas this interface is used to provide custom implementations over existing EC2 functionality (e.g. returning []*ec2.SecurityGroup instead of the default ec2.DescribeSecurityGroupsOutput in DescribeSecurityGroups()). I assume this was originally created for mocking, because some operations return paginated results which are hard to mock.
One example of how this is in findSecurityGroup. Cloud.ec2 is iface.EC2, and the result of cloud.ec2.DescribeSecurityGroups is []*ec2.SecurityGroup.
Compare this to DescribeSecurityGroups, where awsSdkEC2.ec2 is EC2API, and the result of awsSdkEC2.ec2.DescribeSecurityGroups is *ec2.DescribeSecurityGroupsOutput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, the detailed implementation for mock can be different but shouldn't the interface be the same? We can only keep this interface and reuse it for mocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the original authors wanted two levels of mocking. The first one is that they wanted to mock the EC2 Client functions, hence their usage of ec2iface.EC2API (now EC2API). The second one is that they had their own set of functions, and to mock these, they made iface.EC2. These functions are not really related to the EC2 Client, they're essentially functions that happen to call the EC2 API and do their own thing.
I believe that if we take out iface.EC2, then we don't have a way to mock things like awsSdkEC2.DescribeSecurityGroups
| return out, metadata, err | ||
| } | ||
|
|
||
| func awsServiceAndNameV2(ctx context.Context) (string, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share the original implementation for this in v1?
| klog.Infof("Starting the tagging controller") | ||
| for i := 0; i < tc.workerCount; i++ { | ||
| go wait.Until(tc.work, tc.nodeMonitorPeriod, stopCh) | ||
| go wait.Until(func() { tc.work(ctx) }, tc.nodeMonitorPeriod, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wrap in a func ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't wrap, it interprets tc.work(ctx) as a function call, so I wrapped it to satisfy the fact that .Until needs a func()
| } | ||
|
|
||
| go taggingcontroller.Run(ctx.Done()) | ||
| go taggingcontroller.Run(ctx, ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just pass ctx as the argument as the second one is based on ctx anyway?
pkg/providers/v1/aws.go
Outdated
|
|
||
| request := &ec2.DescribeInstancesInput{ | ||
| InstanceIds: []*string{instanceID.awsString()}, | ||
| InstanceIds: []string{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the instance if removed? without the filter it will query all the instances right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad - fixed.
pkg/providers/v1/aws_sdk.go
Outdated
| }) | ||
|
|
||
| func (p *awsSDKProvider) Compute(ctx context.Context, regionName string) (iface.EC2, error) { | ||
| cfg, err := awsConfig.LoadDefaultConfig(context.TODO(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx is already provided as function param right, can we use it instead of TODO()?
| Credentials: p.creds, | ||
| } | ||
| awsConfig = awsConfig.WithCredentialsChainVerboseErrors(true). | ||
| WithEndpointResolver(p.cfg.GetResolver()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do provide override for the endpoint to be used https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/config/config.go#L192-L204
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - added client level overrides!
pkg/providers/v1/aws_sdk.go
Outdated
| } | ||
|
|
||
| func (p *awsSDKProvider) AddHandlersV2(ctx context.Context, regionName string, cfg awsv2.Config) { | ||
| cfg.APIOptions = append(cfg.APIOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to test these in unit test? if not can you share the log details just to make sure we the behavior is same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep will paste the logs in the PR description, thank you.
pkg/providers/v1/aws_sdk.go
Outdated
|
|
||
| delayer := p.getCrossRequestRetryDelay(regionName) | ||
| if delayer != nil { | ||
| cfg.APIOptions = append(cfg.APIOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test this using an unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification, do we want to add a test for getCrossRequestRetryDelay or for the middleware functions themselves? The functions called in middleware, like ComputeDelayForRequest are tested in separate unit tests.
pkg/providers/v1/retry_handler.go
Outdated
| } | ||
| func (r customRetryer) IsErrorRetryable(err error) bool { | ||
| var ae smithy.APIError | ||
| if errors.As(err, &ae) && strings.Contains(ae.Error(), NON_RETRYABLE_ERROR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see you are creating this error as a normal one using errors.new, then will errors.As succeed here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right - errors.As would be false for NON_RETRYABLE_ERRORs. Will do a simple error message check instead.
| func (c *Cloud) addSecurityGroupIngress(securityGroupID string, addPermissions []*ec2.IpPermission) (bool, error) { | ||
| func (c *Cloud) addSecurityGroupIngress(ctx context.Context, securityGroupID string, addPermissions []ec2types.IpPermission) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we pass the value instead of pointer type here?
| func (c *Cloud) removeSecurityGroupIngress(securityGroupID string, removePermissions []*ec2.IpPermission) (bool, error) { | ||
| func (c *Cloud) removeSecurityGroupIngress(ctx context.Context, securityGroupID string, removePermissions []ec2types.IpPermission) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| var found *ec2.IpPermission | ||
| var found *ec2types.IpPermission | ||
| for _, groupPermission := range group.IpPermissions { | ||
| if ipPermissionExists(removePermission, groupPermission, hasUserID) { | ||
| found = removePermission | ||
| if ipPermissionExists(&removePermission, &groupPermission, hasUserID) { | ||
| found = &removePermission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if passing value is desired for this function, we can directly use value type var found ec2types.IpPermission. There is no need to transform to pointer and back then.
|
|
||
| existing := subnetsByAZ[az] | ||
| if existing == nil { | ||
| existing, exists := subnetsByAZ[az] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe worth thinking a new name for existing
|
|
||
| // An interface to satisfy the ec2.Client API. | ||
| // More details about this pattern: https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/unit-testing.html | ||
| type EC2API interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we can only keep the interface for functions required for testing for simplicity. For example, if only CreateSecurityGroup is needed for the tests, we can only keep it and remove others.
pkg/providers/v1/aws_sdk.go
Outdated
| cfg, err := awsConfig.LoadDefaultConfig(context.TODO(), | ||
| awsConfig.WithRegion(regionName), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating the config, can we use
config.LoadDefaultConfig(ctx, config.WithDefaultsMode(aws.DefaultsModeInRegion),
awsConfig.WithRegion(regionName),
... // plus any other configurations
)
that will give us default configurations https://github.com/aws/aws-sdk-go-v2/blob/main/aws/defaults/defaults.go#L26-L32
pkg/providers/v1/aws_sdk.go
Outdated
| cfg.Region = regionName | ||
| cfg.Credentials = p.credsV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can embed them into awsConfig.Withxxx in awsConfig.LoadDefaultConfig above
| o.Retryer = &customRetryer{ | ||
| retry.NewStandard(), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same for retry, which can also be embedded into awsConfig.WithRetryer. But given we specify the default mode, the standard retryer will be picked by default so we don't need to explicitly set this.
If there is a need to set up custom retryer we can do something like:
awsConfig.WithRetryer(func() aws.Retryer {
return retry.AddWithMaxAttempts(retry.NewStandard(), 5)
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need to have a custom retryer is because we need to mark requests as non retryable under specific conditions. In v2, we do this via a combination of middleware and custom retry logic (see here: https://github.com/kubernetes/cloud-provider-aws/pull/1146/files#r2082328615).
The custom retryer you shared would definitely limit the number of retries, but it would do it for all requests. In this case, we need to be able to catch a specific case and never retry that.
| type EC2 interface { | ||
| // Query EC2 for instances matching the filter | ||
| DescribeInstances(request *ec2.DescribeInstancesInput) ([]*ec2.Instance, error) | ||
| DescribeInstances(ctx context.Context, request *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) ([]ec2types.Instance, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, the detailed implementation for mock can be different but shouldn't the interface be the same? We can only keep this interface and reuse it for mocking
pkg/providers/v1/retry_handler.go
Outdated
| // Standard retry implementation, except that it doesn't retry RequestLimitExceeded errors. | ||
| // This works in tandem with CrossRequestRetryDelay, which reports these errors before | ||
| // the middleware exits upon seeing this error | ||
| type customRetryer struct { | ||
| awsv2.Retryer | ||
| } | ||
| func (r customRetryer) IsErrorRetryable(err error) bool { | ||
| var ae smithy.APIError | ||
| if errors.As(err, &ae) && strings.Contains(ae.Error(), NON_RETRYABLE_ERROR) { | ||
| return false | ||
| } | ||
| return r.Retryer.IsErrorRetryable(err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? I didn't see it used anywhere.
v2 retryer itself has an implementation to decide which errors are retryable https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/retryable_error.go#L15. But I didn't see the purpose of custom retryer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're replicating a handler function, which used to mark a request as non retryable. In v2, we can't directly mark a request as non retryable, so we have to 1) throw a custom error from the middleware stage and 2) have a retryer that can catch that error and return false for IsErrorRetryable().
Original handler:
| func (c *CrossRequestRetryDelay) BeforeSign(r *request.Request) { |
It's used for the EC2 aws sdk go v2 client:
| o.Retryer = &customRetryer{ |
|
/ok-to-test |
|
/retest |
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kmala, yue9944882 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The AWS SDK Go V1 is being deprecated on July 31, 2025. This PR migrates usage of the EC2 SDK to use V2.
Context: https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-go-v1-on-july-31-2025/
Changes:
BeforeSigncaused some requests to be marked as non-retryable in some cases. Now, due to the V2's migration to middleware, the Finalize step (throughdelayPreSign()) now throws errors with a customnonRetryableErrorerror code when the same conditions are met. The EC2 Client is configured with a custom retrier,customRetryer, which will catch this error and cause the request to not be retried.Old logs:
New logs:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Testing details:
Added unit tests to verify custom endpoint resolver and retrier in the EC2 Client, included in the
aws_sdk_test.gofile.Ran
make && make test. Created a personal cluster and manually added nodes to verify that tags were still properly added. Ran e2e tests on the cluster using the following commands: