Skip to content

Metrics collector fails to create watcher #1769

@drawesomenic

Description

@drawesomenic

/kind bug

What steps did you take and what happened:
I started Katib runs using Kale which leads to about 50% of the pipelines succeeding and 50% of the pipelines failing randomly with the following error message of the "metrics-logger-and-collector" container:

Mon, Jan 10 2022 4:07:47 pm | I0110 15:07:47.414005 20 main.go:342] Trial Name: test-dev-blo6q-ptpgnwzg
Mon, Jan 10 2022 4:07:47 pm | 2022/01/10 15:07:47 FATAL -- failed to create Watcher
Mon, Jan 10 2022 4:07:47 pm | goroutine 34 [running]:
Mon, Jan 10 2022 4:07:47 pm | runtime/debug.Stack()
Mon, Jan 10 2022 4:07:47 pm | /usr/local/go/src/runtime/debug/stack.go:24 +0x65
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/util.Fatal({0xcc1a11, 0x0}, {0x0, 0x0, 0x0})
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc0000bc000)
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
Mon, Jan 10 2022 4:07:47 pm | created by github.com/hpcloud/tail/watch.glob..func1
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x173

What did you expect to happen:
In the succeeding pipelines no error is thrown, but instead shows normal output:

Wed, Jan 5 2022 8:26:38 pm | I0105 19:26:37.970244 16 main.go:342] Trial Name: test-dev-gtbb0-847s8svl
Wed, Jan 5 2022 8:26:39 pm | I0105 19:26:39.075769 16 main.go:136] 2022-01-05 19:26:39 Kale kfputils:176 [INFO] Creating KFP experiment 'test-dev-gtbb0'...

Anything else you would like to add:
I also tried increasing the resources via katib-config but it did not resolve the issue. The error does not occur with specific pipeline parameters but happens randomly. The workflow is completed successfully, however, as the "metrics-logger-and-collector" container fails, also the related job and trial fails.

Environment:

  • Katib version (check the Katib controller image version): 0.12.0
  • Kubernetes version: (kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
  • OS (uname -a): Linux dashboard-shell-w5nrd 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions