-
Notifications
You must be signed in to change notification settings - Fork 316
Description
What happened:
When attempting to upgrade an EKS Anywhere (v0.22.5) cluster running on Docker with podIamConfig added for IRSA configuration:
podIamConfig:
serviceAccountIssuer: https://<idp-url>
The upgrade process using:
eksctl anywhere upgrade cluster -f cluster_name.yaml
results in a failed state where the control plane fails to initialize properly. Key system pods experience API connection failures.
Key symptoms:
- cert-manager-cainjector fails with:
0617 13:05:13.617199 1 controller.go:218] "error checking if certificates.cert-manager.io CRD is installed" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
E0617 13:05:13.617228 1 controller.go:225] "error retrieving certificate.cert-manager.io CRDs" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
E0617 13:05:13.617236 1 main.go:43] "error executing command" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
- kube-apiserver logs:
E0617 13:08:41.681859 1 cacher.go:478] cacher (machinedeployments.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha3, Kind=MachineDeployment: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=MachineDeployment failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683128 1 reflector.go:569] storage/cacher.go:/cluster.x-k8s.io/clusters: failed to list cluster.x-k8s.io/v1alpha4, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused
E0617 13:08:41.683152 1 cacher.go:478] cacher (clusters.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha4, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683132 1 reflector.go:569] storage/cacher.go:/bootstrap.cluster.x-k8s.io/kubeadmconfigs: failed to list bootstrap.cluster.x-k8s.io/v1alpha4, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused
E0617 13:08:41.683161 1 cacher.go:478] cacher (kubeadmconfigs.bootstrap.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list bootstrap.cluster.x-k8s.io/v1alpha4, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused; reinitializing...
W0617 13:08:41.683181 1 reflector.go:569] storage/cacher.go:/cluster.x-k8s.io/clusters: failed to list cluster.x-k8s.io/v1alpha3, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused
E0617 13:08:41.683190 1 cacher.go:478] cacher (clusters.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha3, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683181 1 reflector.go:569] storage/cacher.go:/bootstrap.cluster.x-k8s.io/kubeadmconfigs: failed to list bootstrap.cluster.x-k8s.io/v1alpha3, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused
E0617 13:08:41.683199 1 cacher.go:478] cacher (kubeadmconfigs.bootstrap.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list bootstrap.cluster.x-k8s.io/v1alpha3, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused; reinitializing...
- eksa-controller-manager logs:
failed to get API group resources: unable to retrieve the complete list of server APIs: clusterctl.cluster.x-k8s.io/v1alpha3: Get "https://10.96.0.1:443/apis/clusterctl.cluster.x-k8s.io/v1alpha3": dial tcp 10.96.0.1:443: connect: connection refused
Service state:
default/kubernetes ClusterIP 10.96.0.1
capi-system/capi-webhook-service ClusterIP 10.107.129.178
The control plane becomes unreachable and the cluster fails to recover post-upgrade.
What you expected to happen:
Cluster upgrade to complete successfully with podIamConfig enabled, and all system components (including the Kubernetes API server and webhook services) to start and operate without failure.
How to reproduce it (as minimally and precisely as possible):
Deploy a working EKS Anywhere cluster on Docker using a basic cluster spec.
Modify cluster_name.yaml and add:
podIamConfig:
serviceAccountIssuer: https://<your-issuer-url>
Run
eksctl anywhere upgrade cluster -f cluster_name.yaml
Anything else we need to know?
The issue appears directly tied to the addition of podIamConfig.
The capi-webhook-service becomes unreachable post-upgrade.
May relate to how IRSA modifies API server configuration and webhook certs in a local Docker environment.
Environment:
EKS Anywhere Release: v0.22.5
EKS Distro Release:
kube-apiserver:v1.32.3-eks-1-32-13
kube-controller-manager:v1.32.3-eks-1-32-13
kube-scheduler:v1.32.3-eks-1-32-13
kube-proxy:v1.32.3-eks-1-32-13
coredns:v1.11.4-eks-1-32-13
Docker version: Docker version 24.x
Host OS: Ubuntu 22.04
Platform: Local development environment using Docker