-
Notifications
You must be signed in to change notification settings - Fork 69
Closed
Description
Context
- Stack includes: 1
Spark Master(withoutmist) +Worker(withoutmist) +Mist Master - Spark version
2.4.0 - Mist version
1.1.0
downtime="3600s"
max-conn-failures=5
max-parallel-jobs=1
precreated=false
run-options=""
spark-conf {
"spark.master"="spark://spark-master:7077"
"spark.submit.deployMode"="cluster"
"spark.dynamicAllocation.enabled"="true"
"spark.shuffle.service.enabled"="true"
}
streaming-duration="1s"
Log
mist_1 | 2018-11-09 10:59:42,857 INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started
spark-master_1 | 2018-11-09 10:59:42,937 INFO org.apache.spark.deploy.master.Master Registering worker 29c1c06e51e3:9099 with 8 cores, 13.7 GB RAM
spark-worker_1 | 2018-11-09 10:59:42,966 INFO org.apache.spark.deploy.worker.Worker Successfully registered with master spark://spark-master:7077
mist_1 | 2018-11-09 10:59:43,184 INFO akka.remote.Remoting Starting remoting
mist_1 | 2018-11-09 10:59:43,412 INFO akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://[email protected]:2551]
mist_1 | 2018-11-09 10:59:43,521 INFO org.flywaydb.core.internal.util.VersionPrinter Flyway 4.1.1 by Boxfuse
mist_1 | 2018-11-09 10:59:43,826 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory Database: jdbc:h2:file:/opt/mist/data/recovery.db (H2 1.4)
mist_1 | 2018-11-09 10:59:44,014 INFO org.flywaydb.core.internal.command.DbValidate Successfully validated 2 migrations (execution time 00:00.018s)
mist_1 | 2018-11-09 10:59:44,027 INFO org.flywaydb.core.internal.command.DbMigrate Current version of schema "PUBLIC": 2
mist_1 | 2018-11-09 10:59:44,027 INFO org.flywaydb.core.internal.command.DbMigrate Schema "PUBLIC" is up to date. No migration necessary.
mist_1 | 2018-11-09 10:59:44,540 INFO io.hydrosphere.mist.master.MasterServer$ LogsSystem started
mist_1 | 2018-11-09 10:59:46,042 WARN org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mist_1 | 2018-11-09 10:59:46,995 INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started
mist_1 | 2018-11-09 10:59:47,264 INFO akka.remote.Remoting Starting remoting
mist_1 | 2018-11-09 10:59:47,601 INFO akka.remote.Remoting Remoting started; listening on addresses :[akka.tcp://[email protected]:40605]
mist_1 | 2018-11-09 10:59:48,197 INFO io.hydrosphere.mist.master.MasterServer$ FunctionInfoProvider started
mist_1 | 2018-11-09 10:59:48,646 INFO io.hydrosphere.mist.master.MasterServer$ Main service started
mist_1 | 2018-11-09 10:59:49,686 INFO io.hydrosphere.mist.master.MasterServer$ Http interface started
mist_1 | 2018-11-09 10:59:49,692 INFO io.hydrosphere.mist.master.Master$ Mist master started
mist_1 | 2018-11-09 11:00:04,797 INFO io.hydrosphere.mist.master.execution.ContextFrontend Starting executor k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe for k8s-master
mist_1 | 2018-11-09 11:00:04,833 INFO io.hydrosphere.mist.master.execution.ContextFrontend Context k8s-master - connected state(active connections: 0, max: 1)
mist_1 | 2018-11-09 11:00:04,845 INFO io.hydrosphere.mist.master.execution.workers.starter.LocalSparkSubmit Try submit local worker k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1, cmd: /opt/spark/bin/spark-submit --conf spark.eventLog.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.submit.deployMode=cluster --conf spark.master=spark://spark-master:7077 --conf spark.eventLog.dir=/data/spark/events --conf spark.dynamicAllocation.enabled=true --conf spark.eventLog.compress=true --class io.hydrosphere.mist.worker.Worker /opt/mist/mist-worker.jar --master 172.20.0.5:2551 --name k8s-master_96a1ce36-460a-4f3b-b8ba-735ddb2a33fe_1
spark-master_1 | 2018-11-09 11:00:07,315 INFO org.apache.spark.deploy.master.Master Driver submitted org.apache.spark.deploy.worker.DriverWrapper
spark-master_1 | 2018-11-09 11:00:07,318 INFO org.apache.spark.deploy.master.Master Launching driver driver-20181109110007-0000 on worker worker-20181109105941-29c1c06e51e3-9099
spark-worker_1 | 2018-11-09 11:00:07,355 INFO org.apache.spark.deploy.worker.Worker Asked to launch driver driver-20181109110007-0000
spark-worker_1 | 2018-11-09 11:00:07,367 INFO org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1 | 2018-11-09 11:00:07,390 INFO org.apache.spark.util.Utils Copying /opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
spark-worker_1 | 2018-11-09 11:00:07,400 INFO org.apache.spark.deploy.worker.DriverRunner Killing driver process!
spark-worker_1 | 2018-11-09 11:00:07,404 WARN org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
spark-master_1 | 2018-11-09 11:00:07,460 INFO org.apache.spark.deploy.master.Master Removing driver: driver-20181109110007-0000
spark-master_1 | 2018-11-09 11:00:12,769 INFO org.apache.spark.deploy.master.Master 172.20.0.5:40290 got disassociated, removing it.
spark-master_1 | 2018-11-09 11:00:12,770 INFO org.apache.spark.deploy.master.Master 172.20.0.5:42207 got disassociated, removing it.
mist_1 | 2018-11-09 11:00:12,897 ERROR io.hydrosphere.mist.master.execution.workers.ExclusiveConnector Could not start worker connection
mist_1 | java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:00:06,479 WARN org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:00:12,424 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar; at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86); at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102); at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107); at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526); at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253); at java.nio.file.Files.copy(Files.java:1274); at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664); at org.apache.spark.util.Utils$.copyFile(Utils.scala:635); at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719); at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509); at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155); at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173); at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
mist_1 | at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1 | at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
mist_1 | at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
mist_1 | at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
mist_1 | at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
mist_1 | at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
mist_1 | at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
mist_1 | at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
mist_1 | at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
mist_1 | at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Job log
INFO 2018-11-09T11:58:22.53 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] Waiting worker connection
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] InitializedEvent(externalId=None)
INFO 2018-11-09T11:58:22.534 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] QueuedEvent
ERROR 2018-11-09T11:59:02.636 [bec629b4-7cc4-482e-8ccb-9a7856f701d2] FailedEvent with Error:
java.lang.RuntimeException: Context is broken
at io.hydrosphere.mist.master.execution.JobActor$$anonfun$io$hydrosphere$mist$master$execution$JobActor$$initial$1.applyOrElse(JobActor.scala:59)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at io.hydrosphere.mist.master.execution.JobActor.akka$actor$Timers$$super$aroundReceive(JobActor.scala:24)
at akka.actor.Timers$class.aroundReceive(Timers.scala:44)
at io.hydrosphere.mist.master.execution.JobActor.aroundReceive(JobActor.scala:24)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: Process terminated with error java.lang.RuntimeException: Process exited with status code 255 and out: 2018-11-09 11:58:56,046 WARN org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable;2018-11-09 11:59:01,870 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar;java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar; at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86); at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102); at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107); at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526); at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253); at java.nio.file.Files.copy(Files.java:1274); at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664); at org.apache.spark.util.Utils$.copyFile(Utils.scala:635); at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719); at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509); at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155); at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173);at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
at io.hydrosphere.mist.master.execution.workers.WorkerRunner$DefaultRunner$$anonfun$continueSetup$1$1.applyOrElse(WorkerRunner.scala:39)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Local worker log
2018-11-09 11:58:24,154 WARN org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-11-09 11:58:30,109 ERROR org.apache.spark.deploy.ClientEndpoint Exception from cluster was: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:664)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:635)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:719)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:509)
at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:155)
at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
Suspicious
Seem hard code when using $MIST_HOME for folder path to mist_worker.jar on spark worker
org.apache.spark.deploy.worker.DriverRunner Copying user jar file:/opt/mist/mist-worker.jar to /opt/spark/work/driver-20181109110007-0000/mist-worker.jar
org.apache.spark.deploy.worker.Worker Driver driver-20181109110007-0000 failed with unrecoverable exception: java.nio.file.NoSuchFileException: /opt/mist/mist-worker.jar
Metadata
Metadata
Assignees
Labels
No labels