Skip to content

Failed to "KillPodSandbox" due to calico connection is unauthorized #7220

@sysnet4admin

Description

@sysnet4admin

After some period, Pods cannot create and delete with this message

$ kubectl describe pod <name>
error killing pod: failed to "KillPodSandbox" for "9f91266a-70a9-428f-a1d6-a2ae8d5427d1" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4657b77480472f4352e413d52e0c5d5545c675da862cc56c8e6f22d7b0577031\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

It seems to be relate with the service account of policy changed from kubernetes v1.26.0
https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#manual-secret-management-for-serviceaccounts

Here is the workaround of solution.
re-read calico-node information by restart or delete.

$ kubectl rollout restart ds -n kube-system calico-node

Expected Behavior

kubectl create or delete is working fine.

Current Behavior

It won't work properly

[root@m-k8s ~]# kubectl get po
NAME                                      READY   STATUS              RESTARTS      AGE
dpy-nginx-6564b9dbcc-d7jj5                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-vgjmw                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-wbr59                0/1     ContainerCreating   0             17m
nfs-client-provisioner-7596fb9c9c-gmpmn   0/1     Terminating         0             47h
nfs-client-provisioner-7596fb9c9c-jvmnm   1/1     Running             1 (46m ago)   42h
nginx-76d9fbf4fb-7xjgb                    0/1     Terminating         0             42h
nginx-76d9fbf4fb-dv48n                    1/1     Running             0             42h
nginx-76d9fbf4fb-kqp5j                    1/1     Running             0             42h
nginx-76d9fbf4fb-qrl4p                    1/1     Running             0             42h
nginx-76d9fbf4fb-wlpwd                    1/1     Running             0             42h

Possible Solution

`Workaround' is restart daemonset or delete pod.

OR

'Possible Solution' is that create a long period secret token for service account instead of this.
and use this secret with service account for calico-node. (it is related with #5712 #6421)

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Steps to Reproduce (for bugs)

  1. Deploy native-kubernetes by vagrant-script (link)
  2. Wait for 1-2days
  3. Deploy new deployment
[root@m-k8s ~]# k create deploy new-nginx --image=nginx --replicas=3
deployment.apps/new-nginx created
  1. Check deployment status
[root@m-k8s ~]# kubectl get po
NAME                                                       READY   STATUS              RESTARTS      AGE
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m

Context

It already applied to the code from #6218
node/pkg/cni/token_watch.go

const defaultCNITokenValiditySeconds = 24 * 60 * 60
const minTokenRetryDuration = 5 * time.Second
const defaultRefreshFraction = 4
func NewTokenRefresher(clientset *kubernetes.Clientset, namespace string, serviceAccountName string) *TokenRefresher {
	return NewTokenRefresherWithCustomTiming(clientset, namespace, serviceAccountName, defaultCNITokenValiditySeconds, minTokenRetryDuration, defaultRefreshFraction)
}

So I decoded applied JWT on the calico-node.
It confirmed 1 year(365d) properly.
JWT

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Decoded JWT's Payload

{
  "aud": [
    "https://kubernetes.default.svc.cluster.local"
  ],
  "exp": 1705975809,    <<<< Tue Jan 23 2024 02:10:09 GMT+0000 
  "iat": 1674439809,
  "iss": "https://kubernetes.default.svc.cluster.local",
  "kubernetes.io": {
    "namespace": "kube-system",
    "pod": {
      "name": "calico-node-9dgg2",
      "uid": "1ce084ea-d723-4900-b25d-4c4a55f2b49f"
    },
    "serviceaccount": {
      "name": "calico-node",
      "uid": "3dabb92f-7e3c-4e92-b895-fc77733de10f"
    },
    "warnafter": 1674443416
  },
  "nbf": 1674439809,
  "sub": "system:serviceaccount:kube-system:calico-node"
}

Thus this issue is a little different logic to verify the authorization from kubernetes.


/var/log/message from all nodes like below when it happened.

[control-plane node]

Jan 23 09:10:35 m-k8s kubelet: E0123 09:10:35.298683    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:10:50 m-k8s kubelet: E0123 09:10:50.303499    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:05 m-k8s kubelet: E0123 09:11:05.308058    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:20 m-k8s kubelet: E0123 09:11:20.300704    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:35 m-k8s kubelet: E0123 09:11:35.290727    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
<snipped>

[worker node]

Jan 21 16:44:12 w2-k8s kubelet: E0121 16:44:12.656423    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 21 16:44:27 w2-k8s kubelet: E0121 16:44:27.650877    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"

Your Environment

  • Calico version: v3.24.5, v3.25.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): native-kubernetes v1.26.0
[root@m-k8s ~]# kubectl get nodes -o wide 
NAME     STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
m-k8s    Ready    control-plane   2d19h   v1.26.0   192.168.1.10    <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w1-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.101   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w2-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.102   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w3-k8s   Ready    <none>          2d18h   v1.26.0   192.168.1.103   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
  • Operating System and version: CentOS 7.9 (3.10.0-1127.19.1.el7.x86_64)
  • Link to your project (optional):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions