Skip to content

Failed to execute job because missing mist_worker.jar #528

@zero88

Description

@zero88

Context

  • Stack includes: 1 Spark Master (without mist) + Worker (without mist) + Mist Master
  • Spark version 2.4.0
  • Mist version 1.1.0
downtime="3600s"
max-conn-failures=5
max-parallel-jobs=1
precreated=false
run-options=""
spark-conf {
    "spark.master"="spark://spark-master:7077"
    "spark.submit.deployMode"="cluster"
    "spark.dynamicAllocation.enabled"="true"
    "spark.shuffle.service.enabled"="true"
}
streaming-duration="1s"

Log

mist_1           | 2018-11-09 10:59:42,857 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
spark-master_1   | 2018-11-09 10:59:42,937 INFO  org.apache.spark.deploy.master.Master Registering worker 29c1c06e51e3:9099 with 8 cores, 13.7 GB RAM
spark-worker_1   | 2018-11-09 10:59:42,966 INFO  org.apache.spark.deploy.worker.Worker Successfully registered with master spark://spark-master:7077
mist_1           | 2018-11-09 10:59:43,184 INFO  akka.remote.Remoting Starting remoting
mist_1           | 2018-11-09 10:59:43,412 INFO  akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://[email protected]:2551]
mist_1           | 2018-11-09 10:59:43,521 INFO  org.flywaydb.core.internal.util.VersionPrinter Flyway 4.1.1 by Boxfuse
mist_1           | 2018-11-09 10:59:43,826 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory Database: jdbc:h2:file:/opt/mist/data/recovery.db (H2 1.4)
mist_1           | 2018-11-09 10:59:44,014 INFO  org.flywaydb.core.internal.command.DbValidate Successfully validated 2 migrations (execution time 00:00.018s)
mist_1           | 2018-11-09 10:59:44,027 INFO  org.flywaydb.core.internal.command.DbMigrate Current version of schema "PUBLIC": 2
mist_1           | 2018-11-09 10:59:44,027 INFO  org.flywaydb.core.internal.command.DbMigrate Schema "PUBLIC" is up to date. No migration necessary.
mist_1           | 2018-11-09 10:59:44,540 INFO  io.hydrosphere.mist.master.MasterServer$ LogsSystem started
mist_1           | 2018-11-09 10:59:46,042 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mist_1           | 2018-11-09 10:59:46,995 INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
mist_1           | 2018-11-09 10:59:47,264 INFO  akka.remote.Remoting Starting remoting
mist_1           | 2018-11-09 10:59:47,601 INFO  akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://[email protected]:40605]
mist_1           | 2018-11-09 10:59:48,197 INFO  io.hydrosphere.mist.master.MasterServer$ FunctionInfoProvider started
mist_1           | 2018-11-09 10:59:48,646 INFO  io.hydrosphere.mist.master.MasterServer$ Main service started
mist_1           | 2018-11-09 10:59:49,686 INFO  io.hydrosphere.mist.master.MasterServer$ Http interface started
mist_1           | 2018-11-09 10:59:49,692 INFO  io.hydrosphere.mist.master.Master$ Mist master started
mist_1           | 2018-11-09 11:00:04,797 INFO  io.hydrosphere.mist.master.execution.ContextFrontend Starting executor k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe for k8s-master
mist_1           | 2018-11-09 11:00:04,833 INFO  io.hydrosphere.mist.master.execution.ContextFrontend Context k8s-master - connected state(active connections: 0, max: 1)
mist_1           | 2018-11-09 11:00:04,845 INFO  io.hydrosphere.mist.master.execution.workers.starter.LocalSparkSubmit Try submit local worker k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1, cmd: /opt/spark/bin/spark-submit --conf spark.eventLog.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.submit.deployMode=cluster --conf spark.master=spark://spark-master:7077 --conf spark.eventLog.dir=/data/spark/events --conf spark.dynamicAllocation.enabled=true --conf spark.eventLog.compress=true --class io.hydrosphere.mist.worker.Worker /opt/mist/mist-worker.jar --master 172.20.0.5:2551 --name k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1
spark-master_1   | 2018-11-09 11:00:07,315 INFO  org.apache.spark.deploy.master.Master Driver submitted org.apache.spark.deploy.worker.DriverWrapper
spark-master_1   | 2018-11-09 11:00:07,318 INFO  org.apache.spark.deploy.master.Master Launching driver driver-20181109110007-0000 on worker worker-20181109105941-29c1c06e51e3-9099
spark-worker_1   | 2018-11-09 11:00:07,355 INFO  org.apache.spark.deploy.worker.Worker Asked to launch driver driver-20181109110007-0000
spark-worker_1   | 2018-11-09 11:00:07,367 INFO  org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 11:00:07,390 INFO  org.apache.spark.util.Utils Copying /opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1   | 2018-11-09 11:00:07,400 INFO  org.apache.spark.deploy.worker.DriverRunner Killing driver process!
spark-worker_1   | 2018-11-09 11:00:07,404 WARN  org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
spark-master_1   | 2018-11-09 11:00:07,460 INFO  org.apache.spark.deploy.master.Master Removing driver: driver-20181109110007-0000
spark-master_1   | 2018-11-09 11:00:12,769 INFO  org.apache.spark.deploy.master.Master 172.20.0.5:40290 got disassociated, removing it.
spark-master_1   | 2018-11-09 11:00:12,770 INFO  org.apache.spark.deploy.master.Master 172.20.0.5:42207 got disassociated, removing it.
mist_1           | 2018-11-09 11:00:12,897 ERROR io.hydrosphere.mist.master.execution.workers.ExclusiveConnector Could not start worker connection
mist_1           | java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:00:06,479 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:00:12,424 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86);	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102);	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107);	at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526);	at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253);	at java.nio.file.Files.copy(Files.java:1274);	at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664);	at org.apache.spark.util.Utils$.copyFile(Utils.scala:635);	at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719);	at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509);	at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155);	at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173);	at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
mist_1           | 	at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1           | 	at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1           | 	at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
mist_1           | 	at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
mist_1           | 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
mist_1           | 	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
mist_1           | 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
mist_1           | 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
mist_1           | 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
mist_1           | 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Job log

INFO 2018-11-09T11:58:22.53 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] Waiting worker connection
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] InitializedEvent(externalId=None)
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] QueuedEvent
ERROR 2018-11-09T11:59:02.636 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] FailedEvent with Error: 
 java.lang.RuntimeException: Context is broken
	at io.hydrosphere.mist.master.execution.JobActor$$anonfun$io$hydrosphere$mist$master$execution$JobActor$$initial$1.applyOrElse(JobActor.scala:59)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
	at io.hydrosphere.mist.master.execution.JobActor.akka$actor$Timers$$super$aroundReceive(JobActor.scala:24)
	at akka.actor.Timers$class.aroundReceive(Timers.scala:44)
	at io.hydrosphere.mist.master.execution.JobActor.aroundReceive(JobActor.scala:24)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
	at akka.actor.ActorCell.invoke(ActorCell.scala:496)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:58:56,046 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:59:01,870 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86);	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102);	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107);	at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526);	at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253);	at java.nio.file.Files.copy(Files.java:1274);	at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664);	at org.apache.spark.util.Utils$.copyFile(Utils.scala:635);	at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719);	at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509);	at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155);	at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173);at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
	at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
	at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
	at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
	at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Local worker log

2018-11-09 11:58:24,154 WARN  org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-11-09 11:58:30,109 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
	at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
	at java.nio.file.Files.copy(Files.java:1274)
	at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664)
	at org.apache.spark.util.Utils$.copyFile(Utils.scala:635)
	at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
	at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509)
	at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)
	at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)
	at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)

Suspicious
Seem hard code when using $MIST_HOME for folder path to mist_worker.jar on spark worker

org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions