Skip to content

Conversation

@jschaul
Copy link
Member

@jschaul jschaul commented Jul 5, 2023

https://wearezeta.atlassian.net/browse/WPB-3003

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

@zebot zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Jul 5, 2023
@jschaul
Copy link
Member Author

jschaul commented Jul 6, 2023

From https://concourse.ops.zinfra.io/builds/25484828 an example log when pods can't get scheduled and are stuck in "pending":

Checking pods in namespace 'test-uia95z8cusqa' that failed to schedule...

Pod rabbitmq-0 failed to schedule for the following reasons:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  10m    default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..
  Warning  FailedScheduling  4m45s  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..


Pod redis-cluster-0 failed to schedule for the following reasons:
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10m                  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

  Warning  FailedScheduling  4m45s (x2 over 10m)  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

Pod redis-cluster-1 failed to schedule for the following reasons:
Events:

  Type     Reason            Age                  From               Message

  ----     ------            ----                 ----               -------

  Warning  FailedScheduling  10m                  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

  Warning  FailedScheduling  4m45s (x2 over 10m)  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

(...)

As well as some events:

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-0                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-1                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-2                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-3                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-4                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

Showing the source of the error (misconfigured storageclass).

@jschaul
Copy link
Member Author

jschaul commented Jul 6, 2023

And example logs for failing pods:

Checking pods in namespace 'test-dmirobvahzzn' that are crashlooping...

Pod brig-767b57d45f-j64jg is crashlooping for the following reasons:

brig: Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "this-one-does-not-exist", service name: Just "9042"): does not exist (Name or service not known)


Pod galley-544c88fd6f-kprnk is crashlooping for the following reasons:

{"error":"ConnectionError (HttpExceptionRequest Request {\n  host                 = \"brig\"\n  port                 = 8080\n  secure               = False\n  requestHeaders       = [(\"Accept\",\"application/json;charset=utf-8,application/json\")]\n  path                 = \"/i/federation/remotes\"\n  queryString          = \"?\"\n  method               = \"GET\"\n  proxy                = Nothing\n  rawBody              = False\n  redirectCount        = 10\n  responseTimeout      = ResponseTimeoutDefault\n  requestVersion       = HTTP/1.1\n}\n (ConnectionFailure Network.Socket.connect: <socket: 11>: does not exist (Connection refused)))","level":"Info","msgs":["Failed to reach brig for federation setup, retrying..."]}

from https://concourse.ops.zinfra.io/builds/25541814

Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good 🚀

Comment on lines +59 to +60
kubectl -n "$NAMESPACE_1" get events | grep -v "Normal "
kubectl -n "$NAMESPACE_2" get events | grep -v "Normal "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grepping out Normal might create red herrings because I guess we won't know if things got better. How much noise does this save?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of noise. It's 95% normal events. The events only show if helm install fails - and then these warnings/errors are likely indicative of a problem, even if sometimes unrelated to the failure, I suppose often there's something to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense :shipit:

@jschaul jschaul merged commit c675bf0 into develop Jul 6, 2023
@jschaul jschaul deleted the debug-info-ci-failures branch July 6, 2023 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants