Spark cluster full of heartbeat timeouts, executors exiting on their own

Spark cluster full of heartbeat timeouts, executors exiting on their own

Asked on January 12, 2019 in Apache-spark.
Add Comment


  • 3 Answer(s)

    Here set the spark.network.timeout to a higher value in spark-defaults.conf.

    When  I was also able to We can set the timeout as follows by using spark-submit:

    $SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar
    
    Answered on January 12, 2019.
    Add Comment

    Due to OOMs executors being killed by YARN. The logs on the individual executors should inspected (look for the text “running beyond physical memory”). If there is many executors just inspect all of the logs manually, The suggestion is by monitoring the job in the Spark UI while it runs. When the task fails, In the UI it will be reported. Make sure that some tasks will report failure due to missing executors that have already been killed, so note to look at causes for each of the individual failing tasks.

    Note: By simply repartitioning the data at appropriate places in the code most OOM problems can be solved quickly. Or, We need to scale up machines to accommodate the need for memory.

    Answered on January 12, 2019.
    Add Comment

    The answer was rather simple. In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is handy).

    When using spark-submit I was also able to set the timeout as follows:

    $SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar
    Answered on January 13, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.