Impact Analysis of system.exit() usage in Spark Jobs

Is system.exit() a show stopper in Spark Jobs running through YARN in Cluster Mode. What is its impact?




Motivation

    The first question needs to be addressed - Is it good to use the system.exit() in the spark jobs?. Ideally, not only in Spark jobs but in general for any JVM-based applications, it's a big NaaaaaH!

    It kills the Spark Job prematurely by shutting down the whole JVM sequence abruptly. Although its usage is the least encouraged way even for lowest business priority applications. Please do some extensive research on it. This post clearly articulates only the impacts of the system.exit usage on the Spark applications. I faced a real-time issue that motivated me and it's entirely my experience that I'm expressing with you which I put forward into this article.


Background

    Precisely speaking, I recently started working on a project for one of my financial client where they have already migrated traditional Map Reduce jobs to Spark jobs which does the processing of files that are landing in the inbound directory of edge node from the legacy systems through the network and generate the validation email reports/alerts for data being processed to the higher level of business. Those jobs were executed in Spark Client mode earlier. I know what's the thought running in your mind right now. Yes,  undoubtedly email reports was an ancient technique but the architecture can't be transfigured as quick as a wink, it's a continuous effort.

    However, the team faced some challenges in sending email reports over the cluster mode since the SMPT needs to be set up in all the nodes of the cluster to send email because the spark job is distributed to all the underlying data nodes in cluster mode. Since it’s a highly secured environment that processes confidential customer data, the IT team didn’t support/encourage the SMTP installations on all the nodes of the cluster, I know it sounds a bit ridiculous but they exist. So the team decided to use client mode where the driver runs locally on the edge node and will execute the email function and that’s where the SMPT has been set up. 


Root Cause

    Running the spark jobs in the client mode was a big drawback which is degrading the performance of spark jobs and causing latency. So I suggested them to switch towards cluster mode which gives more computational power to the spark job and handles the spark job failures efficiently. I provided another solution on how to send email reports in the cluster mode bypassing the SMTP setup on all the nodes of the cluster which I will talk about in the below part of the article.

    Confining to our point of interest - As they moved towards cluster mode they also faced another interesting bug where they implemented system.exit() functionality to end the spark job without further proceeding/processing whenever any bad data which doesn't satisfy business rules is being supplied to the data pipeline. So whenever this scenario appears the system.exit() is triggered within the spark job running through cluster mode the job ends abruptly and yarn considers due to some underlying intermittent or network issues it as a spark job failure. So YARN starts another AM(Application Master) which tries to run the spark job again for the second time because there's a default parameter spark.yarn.maxAttempts whose default value is set to 2.  The corrupted data files were already moved to the archive directory after the first run but the spark job in the second attempt still tries to access data files in the same inbound directory which leaves the spark job as a failure in the second attempt. In both the first and second runs the email report is being sent to business as a failure in the ETL processing. This has been a hassle for our team to tackle and it alone took a few days for me to deep dive into the issue looking at all the spark driver and yarn application master logs and finding the root cause.


Solution

    So I tried to convince the team to make a few modifications to the current architecture. Here are those: 

  1. Implemented an audit table that captures each job run details along with all processing metrics that are required for the email report.
  1. Removing all the system.exit() functionality in current spark jobs. Instead, having a post-job action that will be triggered after the spark job is executed instead of abruptly ending the spark job but finishing it gracefully triggers the post-job action.
  1. That simple post-job action collects all the data from the audit table based on the run date and generates the email report which will only execute on the edge node where SMTP is installed bypassing the need for SMTP setup on all the nodes of the cluster.

    This way I was able to resolve two major issues within one simple solution.


Take Away

    As an industry best practice try to avoid using system.exit() functionality in general. There are multiple techniques as an alternate solution to this functionality. 

    Facts based on my research, there are few blogs that suggest setting the spark.yarn.maxAttempts value to 1 which I strictly don't recommend because that way we are literally losing the YARN cluster mode capability to minimize network latency between the drivers and the executors.


Comments

Popular posts from this blog

Build Enterprise AI Agent with Azure AI Foundry and Power BI MCP

πŸš€ The End of the Spark Upgrade: Why "Versionless Spark" is a Game Changer for AI