Skip to content

Error starting operator with K8s v1.31.0 (429 Message: storage is (re)initializing) #2526

@Donnerbart

Description

@Donnerbart

Bug Report

What did you do?

I've updated the Docker image for our operator integration tests to rancher/k3s:v1.31.0-k3s1.

What did you expect to see?

I expected our tests to continue to work.

What did you see instead? Under which circumstances?

In our integration tests the operator doesn't start anymore, throwing the following stacktrace:

Failed to start quarkus
java.lang.RuntimeException: Failed to start quarkus
	at io.quarkus.runner.ApplicationImpl.doStart(Unknown Source)
	at io.quarkus.runtime.Application.start(Application.java:101)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at io.quarkus.runner.bootstrap.StartupActionImpl.run(StartupActionImpl.java:305)
	at io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:248)
	at io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:601)
	at io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:651)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Caused by: io.javaoperatorsdk.operator.OperatorException: Error starting operator
	at io.javaoperatorsdk.operator.Operator.start(Operator.java:166)
	at io.javaoperatorsdk.operator.OperatorProducer_ProducerMethod_operator_4a13E9CyUpNLli_Z-91F5zETApk_ClientProxy.start(Unknown Source)
	at io.quarkiverse.operatorsdk.runtime.AppEventListener.onStartup(AppEventListener.java:28)
	at io.quarkiverse.operatorsdk.runtime.AppEventListener_Observer_onStartup_lQpwZkDdRQXLkHyaStWxDBo99eQ.notify(Unknown Source)
	at io.quarkus.arc.impl.EventImpl$Notifier.notifyObservers(EventImpl.java:351)
	at io.quarkus.arc.impl.EventImpl$Notifier.notify(EventImpl.java:333)
	at io.quarkus.arc.impl.EventImpl.fire(EventImpl.java:80)
	at io.quarkus.arc.runtime.ArcRecorder.fireLifecycleEvent(ArcRecorder.java:156)
	at io.quarkus.arc.runtime.ArcRecorder.handleLifecycleEvents(ArcRecorder.java:107)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy_0(Unknown Source)
	at io.quarkus.deployment.steps.LifecycleEventsBuildStep$startupEvent1144526294.deploy(Unknown Source)
	... 8 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
	at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
	at io.javaoperatorsdk.operator.ControllerManager.start(ControllerManager.java:44)
	at io.javaoperatorsdk.operator.Operator.start(Operator.java:161)
	... 18 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
	... 23 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start source ControllerResourceEventSource
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:127)
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.start(EventSourceManager.java:68)
	at io.javaoperatorsdk.operator.processing.Controller.start(Controller.java:344)
	at io.javaoperatorsdk.operator.ControllerManager.lambda$start$0(ControllerManager.java:45)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$0(ExecutorServiceManager.java:61)
	at java.base@21.0.4/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base@21.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base@21.0.4/java.lang.Thread.run(Thread.java:1583)
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:72)
	at java.base@21.0.4/java.util.ArrayList.forEach(ArrayList.java:1596)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.executeAndWaitForAllToComplete(ExecutorServiceManager.java:67)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.boundedExecuteAndWaitForAllToComplete(ExecutorServiceManager.java:47)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.start(InformerManager.java:59)
	at io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource.start(ManagedInformerEventSource.java:87)
	at io.javaoperatorsdk.operator.processing.event.source.controller.ControllerResourceEventSource.start(ControllerResourceEventSource.java:60)
	at io.javaoperatorsdk.operator.processing.event.NamedEventSource.start(NamedEventSource.java:27)
	at io.javaoperatorsdk.operator.processing.event.EventSourceManager.startEventSource(EventSourceManager.java:122)
	... 8 more
Caused by: java.util.concurrent.ExecutionException: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at io.javaoperatorsdk.operator.api.config.ExecutorServiceManager.lambda$executeAndWaitForAllToComplete$2(ExecutorServiceManager.java:70)
	... 16 more
Caused by: io.javaoperatorsdk.operator.OperatorException: Couldn't start informer for hivemq-platforms.hivemq.com/v1 resources
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:111)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager.lambda$start$0(InformerManager.java:62)
	... 5 more
Caused by: io.javaoperatorsdk.operator.OperatorException: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:95)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
	at io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper.start(InformerWrapper.java:88)
	... 6 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://localhost:32964/apis/hivemq.com/v1/hivemq-platforms?resourceVersion=0. Message: storage is (re)initializing. Received status: Status(apiVersion=v1, code=429, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=1, uid=null, additionalProperties={}), kind=Status, message=storage is (re)initializing, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=TooManyRequests, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:142)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:51)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base@21.0.4/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
	at io.fabric8.kubernetes.client.vertx.VertxHttpRequest.lambda$null$1(VertxHttpRequest.java:121)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:279)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:261)
	at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:76)
	at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250)
	at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:421)
	at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:279)
	at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:157)
	at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:709)
	at io.vertx.core.impl.ContextImpl.lambda$execute$4(ContextImpl.java:315)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

Environment

Kubernetes cluster type:

A custom testcontainer class OperatorK3sContainer extends K3sContainer switched from v1.30.4-k3s1 to rancher/k3s:v1.31.0-k3s1. This container is started in a custom QuarkusTestResourceLifecycleManager:

@QuarkusTest
@QuarkusTestResource(OperatorTestResourceLifecycleManager.class)

$ Mention java-operator-sdk version from pom.xml file

2024-09-05 13:19:33,260 INFO  [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) QOSDK: 6.7.3 (commit: 2f026a2) built on Fri Aug 09 22:01:53 CEST 2024
2024-09-05 13:19:33,270 INFO  [io.qua.ope.dep.VersionAlignmentCheckingStep] (build-6) JOSDK: 4.9.2 (commit: 841bf79) built on Fri May 31 21:44:13 CEST 2024

$ java -version

openjdk version "21.0.4" 2024-07-16
OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu224.04)
OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu224.04, mixed mode, sharing)

$ kubectl version

Client Version: v1.31.0
Kustomize Version: v5.4.2

Possible Solution

I think there might be a missing retry in the watcher/informer logic.

As a workaround it helps to disable this feature gate: --kube-apiserver-arg=feature-gates=ResilientWatchCacheInitialization=false (see additional context). But this only reduces the likelihood for the error to occur (1/22 test jobs succeeded -> 20/22 test jobs succeeded).

Additional context

I first stumbled across this "Fix flaking TestStoreListResourceVersion" K8s PR since it has the same error message.

After a bit of investigation I think the root cause is KEP-4568: Resilient watchcache initialization that was implemented in this PR.

ResilientWatchCacheInitialization - handling the changes to returning 429 errors instead of keeping the requests hanging

https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4568-resilient-watchcache-initialization#risks-and-mitigations

I'm not sure about a JOSDK standalone test setup, but I think with Quarkus the operator might be started concurrently or even before the test resources. My assumption is that with older K8s versions the CRD watcher was just hanging until the storage was initialized. And now it is rejected right away with a 429 Message: storage is (re)initializing.

I further assume that this needs a retry logic in the operator SDK, because this is probably not handled by the fabric8 K8s client, only by the GoLang K8s client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions