-
Notifications
You must be signed in to change notification settings - Fork 235
Description
Bug Report
What did you do?
I've updated the Docker image for our operator integration tests to rancher/k3s:v1.31.0-k3s1.
What did you expect to see?
I expected our tests to continue to work.
What did you see instead? Under which circumstances?
In our integration tests the operator doesn't start anymore, throwing the following stacktrace:
Failed to start quarkus
java.lang.RuntimeException: Failed to start quarkus
at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
at io.quarkus.runtime.Application.start(Application.java:101)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at io.quarkus.runner.bootstrap.StartupActionImpl.run(StartupActionImpl.java:305)
at io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:248)
at io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:601)
at io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:651)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: io.javaoperatorsdk.operator.OperatorException: Error starting operator
at io.javaoperatorsdk.operator.Operator.start(Operator.java:166)
at io.javaoperatorsdk.operator.OperatorProducer_ProducerMethod_operator_4a13E9CyUpNLli_Z-91F5zETApk_ClientProxy.start(Unknown Source)
at io.quarkiverse.operatorsdk.runtime.AppEventListener.onStartup(AppEventListener.java:28)
at io.quarkiverse.operatorsdk.runtime.AppEventListener_Observer_onStartup_lQpwZkDdRQXLkHyaStWxDBo99eQ.notify(Unknown Source)
at io.quarkus.arc.impl.EventImpl$Notifier.notifyObservers(EventImpl.java:351)
at io.quarkus.arc.impl.EventImpl$Notifier.notify(EventImpl.java:333)
at io.quarkus.arc.impl.EventImpl.fire(EventImpl.java:80)
at io.quarkus.arc.runtime.ArcRecorder.fireLifecycleEvent(ArcRecorder.java:156)
at io.quarkus.arc.runtime.ArcRecorder.handleLifecycleEvents(ArcRecorder.java:107)
at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy_0(Unknown Source)
at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy(Unknown Source)
... 8 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
at io.javaoperatorsdk.operator.ControllerManager.start(ControllerManager.java:44)
at io.javaoperatorsdk.operator.Operator.start(Operator.java:161)
... 18 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
... 23 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:127)
at io.javaoperatorsdk.operator.processing.event.EventSourceManager.start(EventSourceManager.java:68)
at io.javaoperatorsdk.operator.processing.Controller.start(Controller.java:344)
at io.javaoperatorsdk.operator.ControllerManager.lambda$start$0(ControllerManager.java:45)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$0(ExecutorServiceManager.java:61)
at java.base@21.0.4/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base@21.0.4/java.lang.Thread.run(Thread.java:1583)
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.start(InformerManager.java:59)
at io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource.start(ManagedInformerEventSource.java:87)
at io.javaoperatorsdk.operator.processing.event.source.controller.ControllerResourceEventSource.start(ControllerResourceEventSource.java:60)
at io.javaoperatorsdk.operator.processing.event.NamedEventSource.start(NamedEventSource.java:27)
at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:122)
... 8 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
... 16 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:111)
at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.lambda$start$0(InformerManager.java:62)
... 5 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:95)
... 6 more
Caused by: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:88)
... 6 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
at io.fabric8.kubernetes.client.vertx.VertxHttpRequest.lambda$null$1(VertxHttpRequest.java:121)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:279)
at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:261)
at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:76)
at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:421)
at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:279)
at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:157)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:709)
at io.vertx.core.impl.ContextImpl.lambda$execute$4(ContextImpl.java:315)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 moreEnvironment
Kubernetes cluster type:
A custom testcontainer class OperatorK3sContainer extends K3sContainer switched from v1.30.4-k3s1 to rancher/k3s:v1.31.0-k3s1. This container is started in a custom QuarkusTestResourceLifecycleManager:
@QuarkusTest
@QuarkusTestResource(OperatorTestResourceLifecycleManager.class)
$ Mention java-operator-sdk version from pom.xml file
2024-09-05 13:19:33,260 INFO [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) QOSDK: 6.7.3 (commit: 2f026a2) built on Fri Aug 09 22:01:53 CEST 2024
2024-09-05 13:19:33,270 INFO [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) JOSDK: 4.9.2 (commit: 841bf79) built on Fri May 31 21:44:13 CEST 2024
$ java -version
openjdk version "21.0.4" 2024-07-16
OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu224.04)
OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu224.04, mixed mode, sharing)
$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Possible Solution
I think there might be a missing retry in the watcher/informer logic.
As a workaround it helps to disable this feature gate: --kube-apiserver-arg=feature-gates=ResilientWatchCacheInitialization=false (see additional context). But this only reduces the likelihood for the error to occur (1/22 test jobs succeeded -> 20/22 test jobs succeeded).
Additional context
I first stumbled across this "Fix flaking TestStoreListResourceVersion" K8s PR since it has the same error message.
After a bit of investigation I think the root cause is KEP-4568: Resilient watchcache initialization that was implemented in this PR.
ResilientWatchCacheInitialization - handling the changes to returning 429 errors instead of keeping the requests hanging
I'm not sure about a JOSDK standalone test setup, but I think with Quarkus the operator might be started concurrently or even before the test resources. My assumption is that with older K8s versions the CRD watcher was just hanging until the storage was initialized. And now it is rejected right away with a 429 Message: storage is (re)initializing.
I further assume that this needs a retry logic in the operator SDK, because this is probably not handled by the fabric8 K8s client, only by the GoLang K8s client.