Debug info ci failures #3400

jschaul · 2023-07-05T10:51:29Z

https://wearezeta.atlassian.net/browse/WPB-3003

Checklist

Add a new entry in an appropriate subdirectory of changelog.d
Read and follow the PR guidelines

…e as expected. To be reverted

hack/bin/integration-setup-federation.sh

jschaul · 2023-07-06T09:52:57Z

From https://concourse.ops.zinfra.io/builds/25484828 an example log when pods can't get scheduled and are stuck in "pending":

Checking pods in namespace 'test-uia95z8cusqa' that failed to schedule...

Pod rabbitmq-0 failed to schedule for the following reasons:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  10m    default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..
  Warning  FailedScheduling  4m45s  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..


Pod redis-cluster-0 failed to schedule for the following reasons:
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10m                  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

  Warning  FailedScheduling  4m45s (x2 over 10m)  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

Pod redis-cluster-1 failed to schedule for the following reasons:
Events:

  Type     Reason            Age                  From               Message

  ----     ------            ----                 ----               -------

  Warning  FailedScheduling  10m                  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

  Warning  FailedScheduling  4m45s (x2 over 10m)  default-scheduler  0/9 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/9 nodes are available: 9 No preemption victims found for incoming pod..

(...)

As well as some events:

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-0                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-1                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-2                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-3                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

3m56s       Warning   ProvisioningFailed   persistentvolumeclaim/redis-data-redis-cluster-4                   storageclass.storage.k8s.io "this-one-doesn-t-exist" not found

Showing the source of the error (misconfigured storageclass).

jschaul · 2023-07-06T11:18:26Z

And example logs for failing pods:

Checking pods in namespace 'test-dmirobvahzzn' that are crashlooping...

Pod brig-767b57d45f-j64jg is crashlooping for the following reasons:

brig: Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "this-one-does-not-exist", service name: Just "9042"): does not exist (Name or service not known)


Pod galley-544c88fd6f-kprnk is crashlooping for the following reasons:

{"error":"ConnectionError (HttpExceptionRequest Request {\n  host                 = \"brig\"\n  port                 = 8080\n  secure               = False\n  requestHeaders       = [(\"Accept\",\"application/json;charset=utf-8,application/json\")]\n  path                 = \"/i/federation/remotes\"\n  queryString          = \"?\"\n  method               = \"GET\"\n  proxy                = Nothing\n  rawBody              = False\n  redirectCount        = 10\n  responseTimeout      = ResponseTimeoutDefault\n  requestVersion       = HTTP/1.1\n}\n (ConnectionFailure Network.Socket.connect: <socket: 11>: does not exist (Connection refused)))","level":"Info","msgs":["Failed to reach brig for federation setup, retrying..."]}

from https://concourse.ops.zinfra.io/builds/25541814

akshaymankar

Overall looks good 🚀

akshaymankar · 2023-07-06T11:24:21Z

hack/bin/integration-setup-federation.sh

+    kubectl -n "$NAMESPACE_1" get events | grep -v "Normal "
+    kubectl -n "$NAMESPACE_2" get events | grep -v "Normal "


Grepping out Normal might create red herrings because I guess we won't know if things got better. How much noise does this save?

A lot of noise. It's 95% normal events. The events only show if helm install fails - and then these warnings/errors are likely indicative of a problem, even if sometimes unrelated to the failure, I suppose often there's something to it.

Makes sense

jschaul added 3 commits July 5, 2023 12:50

add bash logic to find more context of why things go wrong

bd0b6aa

fail to schedule rabbit; and fail to start brig to see if the logs ar…

4ab003c

…e as expected. To be reverted

changelog

5f7a88c

zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Jul 5, 2023

fixup

002d49f

fisx reviewed Jul 5, 2023

View reviewed changes

hack/bin/integration-setup-federation.sh Outdated Show resolved Hide resolved

jschaul added 4 commits July 5, 2023 16:41

PR feedback

39a28f0

take bash from env

8276754

undo one on-purpose-unschedulable bug

a7db990

no need for context before Events:

62762f2

undo brig misconfiguration

9621754

akshaymankar approved these changes Jul 6, 2023

View reviewed changes

jschaul merged commit c675bf0 into develop Jul 6, 2023

jschaul deleted the debug-info-ci-failures branch July 6, 2023 13:08

zebot mentioned this pull request Aug 11, 2023

Release 2023-08-11 - (expected chart version 4.36.0) #3493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug info ci failures #3400

Debug info ci failures #3400

Uh oh!

jschaul commented Jul 5, 2023 •

edited

Loading

Uh oh!

Uh oh!

jschaul commented Jul 6, 2023 •

edited

Loading

Uh oh!

jschaul commented Jul 6, 2023

Uh oh!

akshaymankar left a comment

Uh oh!

akshaymankar Jul 6, 2023

Uh oh!

jschaul Jul 6, 2023

Uh oh!

akshaymankar Jul 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		kubectl -n "$NAMESPACE_1" get events \| grep -v "Normal "
		kubectl -n "$NAMESPACE_2" get events \| grep -v "Normal "

Debug info ci failures #3400

Debug info ci failures #3400

Uh oh!

Conversation

jschaul commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Uh oh!

jschaul commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschaul commented Jul 6, 2023

Uh oh!

akshaymankar left a comment

Choose a reason for hiding this comment

Uh oh!

akshaymankar Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

jschaul Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

akshaymankar Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jschaul commented Jul 5, 2023 •

edited

Loading

jschaul commented Jul 6, 2023 •

edited

Loading