Apache Spark job runs locally but throwing null pointer on Google Cloud Cluster
I have an application for Apache Spark that I have been until now running/testing on local machine using command:
spark --class "main.SomeMainClass" --master local jarfile.jar
And everything runs alright however when I submit this very same job to Google Cloud Dataproc Engine it throws
NullPointerException as follows:
Caused by: java.lang.NullPointerException at geneticClasses.FitnessCalculator.calculateFitness(FitnessCalculator.java:30) at geneticClasses.StringIndividualMapReduce.calculateFitness(StringIndividualMapReduce.java:91) at mapreduce.Mapper.lambda$mapCalculateFitness$3d84c37$1(Mapper.java:30) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018) at . . .
This error is thrown from worker node as it's occuring during
map phase. What's the difference between local mode and actual cluster except that local mode just simulates worker nodes as separate threads?
FitnessCalculator sits on driver node and all methods are static. Do I need to make it
Serializable so it can be shipped to the worker node together with other code?
Thank youАвтор: MichaelDD Источник Размещён: 27.10.2019 08:04
You say that
FitnessCalculator only has static methods and that it works in local mode. My guess is that you have some static object (initialized to
null) that you set in the driver and then attempt to use within a Spark task at
FitnessCalculator.java:30. That won't work unfortunately.
Changes to static fields aren't distributed to Spark workers. The reason it works in
local mode is that the workers are running within the same JVM (Java Virtual Machine) as the driver, so they coincidentally have access to the new value.