Skip to content

EKS Anywhere upgrade with podIamConfig causes API server crash and webhook connection failures #9824

@Raghavendiran-2002

Description

@Raghavendiran-2002

What happened:

When attempting to upgrade an EKS Anywhere (v0.22.5) cluster running on Docker with podIamConfig added for IRSA configuration:

podIamConfig:
  serviceAccountIssuer: https://<idp-url>

The upgrade process using:

eksctl anywhere upgrade cluster -f cluster_name.yaml

results in a failed state where the control plane fails to initialize properly. Key system pods experience API connection failures.
Key symptoms:

  • cert-manager-cainjector fails with:
0617 13:05:13.617199       1 controller.go:218] "error checking if certificates.cert-manager.io CRD is installed" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
E0617 13:05:13.617228       1 controller.go:225] "error retrieving certificate.cert-manager.io CRDs" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
E0617 13:05:13.617236       1 main.go:43] "error executing command" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: connect: connection refused" logger="cert-manager"
  • kube-apiserver logs:
E0617 13:08:41.681859       1 cacher.go:478] cacher (machinedeployments.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha3, Kind=MachineDeployment: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=MachineDeployment failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683128       1 reflector.go:569] storage/cacher.go:/cluster.x-k8s.io/clusters: failed to list cluster.x-k8s.io/v1alpha4, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused
E0617 13:08:41.683152       1 cacher.go:478] cacher (clusters.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha4, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683132       1 reflector.go:569] storage/cacher.go:/bootstrap.cluster.x-k8s.io/kubeadmconfigs: failed to list bootstrap.cluster.x-k8s.io/v1alpha4, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused
E0617 13:08:41.683161       1 cacher.go:478] cacher (kubeadmconfigs.bootstrap.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list bootstrap.cluster.x-k8s.io/v1alpha4, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused; reinitializing...
W0617 13:08:41.683181       1 reflector.go:569] storage/cacher.go:/cluster.x-k8s.io/clusters: failed to list cluster.x-k8s.io/v1alpha3, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused
E0617 13:08:41.683190       1 cacher.go:478] cacher (clusters.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list cluster.x-k8s.io/v1alpha3, Kind=Cluster: conversion webhook for cluster.x-k8s.io/v1beta1, Kind=Cluster failed: Post "https://capi-webhook-service.capi-system.svc:443/convert?timeout=30s": dial tcp 10.107.129.178:443: connect: connection refused; reinitializing...
W0617 13:08:41.683181       1 reflector.go:569] storage/cacher.go:/bootstrap.cluster.x-k8s.io/kubeadmconfigs: failed to list bootstrap.cluster.x-k8s.io/v1alpha3, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused
E0617 13:08:41.683199       1 cacher.go:478] cacher (kubeadmconfigs.bootstrap.cluster.x-k8s.io): unexpected ListAndWatch error: failed to list bootstrap.cluster.x-k8s.io/v1alpha3, Kind=KubeadmConfig: conversion webhook for bootstrap.cluster.x-k8s.io/v1beta1, Kind=KubeadmConfig failed: Post "https://capi-kubeadm-bootstrap-webhook-service.capi-kubeadm-bootstrap-system.svc:443/convert?timeout=30s": dial tcp 10.96.203.89:443: connect: connection refused; reinitializing...
  • eksa-controller-manager logs:
failed to get API group resources: unable to retrieve the complete list of server APIs: clusterctl.cluster.x-k8s.io/v1alpha3: Get "https://10.96.0.1:443/apis/clusterctl.cluster.x-k8s.io/v1alpha3": dial tcp 10.96.0.1:443: connect: connection refused

Service state:

default/kubernetes                          ClusterIP   10.96.0.1
capi-system/capi-webhook-service           ClusterIP   10.107.129.178

The control plane becomes unreachable and the cluster fails to recover post-upgrade.

What you expected to happen:

Cluster upgrade to complete successfully with podIamConfig enabled, and all system components (including the Kubernetes API server and webhook services) to start and operate without failure.

How to reproduce it (as minimally and precisely as possible):

Deploy a working EKS Anywhere cluster on Docker using a basic cluster spec.

Modify cluster_name.yaml and add:

podIamConfig:
  serviceAccountIssuer: https://<your-issuer-url>

Run

eksctl anywhere upgrade cluster -f cluster_name.yaml

Anything else we need to know?

The issue appears directly tied to the addition of podIamConfig.
The capi-webhook-service becomes unreachable post-upgrade.
May relate to how IRSA modifies API server configuration and webhook certs in a local Docker environment.

Environment:

EKS Anywhere Release: v0.22.5
EKS Distro Release:
kube-apiserver:v1.32.3-eks-1-32-13
kube-controller-manager:v1.32.3-eks-1-32-13
kube-scheduler:v1.32.3-eks-1-32-13
kube-proxy:v1.32.3-eks-1-32-13
coredns:v1.11.4-eks-1-32-13
Docker version: Docker version 24.x
Host OS: Ubuntu 22.04
Platform: Local development environment using Docker

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalAn issue, bug or feature request filed from outside the AWS org

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions