Understanding and Handling Spark Errors

Understanding and Handling Spark Errors#

Spark errors can be very long, often with redundant information and can appear intimidating at first. However, if you know which parts of the error message to look at you will often be able to resolve it. Sometimes you may want to handle errors programmatically, enabling you to simplify the output of an error message, or to continue the code execution in some circumstances.

Errors can be rendered differently depending on the software you are using to write code, e.g. CDSW will generally give you long passages of red text whereas Jupyter notebooks have code highlighting. This example uses the CDSW error messages as this is the most commonly used tool to write code at the ONS. The general principles are the same regardless of IDE used to write code.

There are some examples of errors given here but the intention of this article is to help you debug errors for yourself rather than being a list of all potential problems that you may encounter. We focus on error messages that are caused by Spark code.

Practical tips for error handling in Spark#

Spark error messages can be long, but the most important principle is that the first line returned is the most important. This first line gives a description of the error, put there by the package developers. The output when you get an error will often be larger than the length of the screen and so you may have to scroll up to find this. The examples in the next sections show some PySpark and sparklyr errors.

The most likely cause of an error is your code being incorrect in some way. Use the information given on the first line of the error message to try and resolve it. In many cases this will give you enough information to help diagnose and attempt to resolve the situation. If you are still struggling, try using a search engine; Stack Overflow will often be the first result and whatever error you have you are very unlikely to be the first person to have encountered it. If you are still stuck, then consulting your colleagues is often a good next step.

Remember that Spark uses the concept of lazy evaluation, which means that your error might be elsewhere in the code to where you think it is, since the plan will only be executed upon calling an action. If you suspect this is the case, try and put an action earlier in the code and see if it runs. Repeat this process until you have found the line of code which causes the error. With more experience of coding in Spark you will come to know which areas of your code could cause potential issues

Errors which appear to be related to memory are important to mention here. The first solution should not be just to increase the amount of memory; instead see if other solutions can work, for instance breaking the lineage with checkpointing or staging tables. See the Ideas for optimising Spark code in the first instance. Increasing the memory should be the last resort.

Occasionally your error may be because of a software or hardware issue with the Spark cluster rather than your code. It is worth resetting as much as possible, e.g. if you are using a Docker container then close and reopen a session. If there are still issues then raise a ticket with your organisations IT support department.

If you are struggling to get started with Spark then ensure that you have read the Getting Started with Spark article; in particular, ensure that your environment variables are set correctly.

A recap of Python and R errors#

Python Errors

PySpark errors are just a variation of Python errors and are structured the same way, so it is worth looking at the documentation for errors and the base exceptions.

Some PySpark errors are fundamentally Python coding issues, not PySpark. An example is where you try and use a variable that you have not defined, for instance, when creating a new DataFrame without a valid Spark session:

from pyspark.sql import SparkSession, functions as F

data = [["Cat", 10], ["Dog", 5]]
columns = ["animal", "count"]

animal_df = spark.createDataFrame(data, columns)

NameError: name 'spark' is not defined
NameError                                 Traceback (most recent call last)
in engine
----> 1 animal_df = spark.createDataFrame(data, columns)

NameError: name 'spark' is not defined

The error message on the first line here is clear: name 'spark' is not defined, which is enough information to resolve the problem: we need to start a Spark session.

This error has two parts, the error message and the stack trace. The stack trace tells us the specific line where the error occurred, but this can be long when using nested functions and packages. Generally you will only want to look at the stack trace if you cannot understand the error from the error message or want to locate the line of code which needs changing.

R Errors

sparklyr errors are just a variation of base R errors and are structured the same way. Some sparklyr errors are fundamentally R coding issues, not sparklyr. An example is where you try and use a variable that you have not defined, for instance, when creating a new sparklyr DataFrame without first setting sc to be the Spark session:

library(sparklyr)
library(dplyr)

# Create a base R DataFrame
animal_df <- data.frame(
        animal = c("Cat", "Dog"),
        count = c(10, 5),
        stringsAsFactors = FALSE)

# Copy base R DataFrame to the Spark cluster
animal_sdf <- sparklyr::sdf_copy_to(sc, animal_df)

Error in sdf_copy_to(sc, animal_df) : object 'sc' not found

The error message here is easy to understand: sc, the Spark connection object, has not been defined. To resolve this, we just have to start a Spark session. Not all base R errors are as easy to debug as this, but they will generally be much shorter than Spark specific errors.

Example Spark error: missing file#

When using Spark, sometimes errors from other languages that the code is compiled into can be raised. You may see messages about Scala and Java errors. Do not be overwhelmed, just locate the error message on the first line rather than being distracted. An example is reading a file that does not exist.

Python Example

For more details on why Python error messages can be so long, especially with Spark, you may want to read the documentation on Exception Chaining.

Try using spark.read.parquet() with an incorrect file path:

spark = (SparkSession.builder.master("local[2]")
         .appName("errors")
         .getOrCreate())

file_path = "this/is_not/a/file_path.parquet"

no_df = spark.read.parquet(file_path)

AnalysisException: 'Path does not exist: hdfs://.../this/is_not/a/file_path.parquet;'
Py4JJavaError                             Traceback (most recent call last)
...

The full error message is not given here as it is very long and some of it is platform specific, so try running this code in your own Spark session. You will see a long error message that has raised both a Py4JJavaError and an AnalysisException. The Py4JJavaError is caused by Spark and has become an AnalysisException in Python.

We can ignore everything else apart from the first line as this contains enough information to resolve the error:

AnalysisException: 'Path does not exist: hdfs://.../this/is_not/a/file_path.parquet;'

The code will work if the file_path is correct; this can be confirmed with .show():

import yaml
with open("ons-spark/config.yaml") as f:
    config = yaml.safe_load(f)
    
file_path = config["rescue_path"]

rescue = spark.read.parquet(file_path)
rescue.select("incident_number", "animal_group").show(3)

+---------------+------------+
|incident_number|animal_group|
+---------------+------------+
|       80771131|         Cat|
|      141817141|       Horse|
|143166-22102016|        Bird|
+---------------+------------+
only showing top 3 rows

R Example

Try using spark_read_parquet() with an incorrect file path:

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "errors",
    config = sparklyr::spark_config())

file_path <- "this/is_not/a/file_path.parquet"
no_df <- sparklyr::spark_read_parquet(sc, path=file_path)

Error: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://.../this/is_not/a/file_path.parquet;
...

The full error message is not given here as it is very long and some of it is platform specific, so try running this code in your own Spark session. Although both java and scala are mentioned in the error, ignore this and look at the first line as this contains enough information to resolve the error:

Error: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://.../this/is_not/a/file_path.parquet;

The code will work if the file_path is correct; this can be confirmed with glimpse():

rescue <- sparklyr::spark_read_parquet(sc, path=config$rescue_path)

rescue %>%
    sparklyr::select(incident_number, animal_group) %>%
    pillar::glimpse()

Rows: ??
Columns: 2
Database: spark_connection
$ incident_number <chr> "80771131", "141817141", "143166-22102016", "43051141"…
$ animal_group    <chr> "Cat", "Horse", "Bird", "Cat", "Dog", "Deer", "Deer", …

Understanding Errors: Summary of key points#

Spark error messages can be long, but most of the output can be ignored
Look at the first line; this is the error message and will often give you all the information you need
The stack trace tells you where the error occurred but can be very long and can be misleading in some circumstances
Error messages can contain information about errors in other languages such as Java and Scala, but these can mostly be ignored

Handling exceptions in Spark#

When there is an error with Spark code, the code execution will be interrupted and will display an error message. In many cases this will be desirable, giving you chance to fix the error and then restart the script. You can however use error handling to print out a more useful error message. Sometimes you may want to handle the error and then let the code continue.

It is recommend to read the sections above on understanding errors first, especially if you are new to error handling in Python or base R.

The most important principle for handling errors is to look at the first line of the code. This will tell you the exception type and it is this that needs to be handled. The examples here use error outputs from CDSW; they may look different in other editors.

Error handling can be a tricky concept and can actually make understanding errors more difficult if implemented incorrectly, so you may want to get more experience before trying some of the ideas in this section.

What is error handling and why use it?#

You will often have lots of errors when developing your code and these can be put in two categories: syntax errors and runtime errors. A syntax error is where the code has been written incorrectly, e.g. a missing comma, and has to be fixed before the code will compile. A runtime error is where the code compiles and starts running, but then gets interrupted and an error message is displayed, e.g. trying to divide by zero or non-existent file trying to be read in. We saw some examples in the the section above. Only runtime errors can be handled.

We saw that Spark errors are often long and hard to read. You can use error handling to test if a block of code returns a certain type of error and instead return a clearer error message. This can save time when debugging.

You can also set the code to continue after an error, rather than being interrupted. You may want to do this if the error is not critical to the end result. If you do this it is a good idea to print a warning with the print() statement or use logging, e.g. using the Python logger.

Warning

Just because the code runs does not mean it gives the desired results, so make sure you always test your code!
It is useful to know how to handle errors, but do not overuse it. Remember that errors do occur for a reason and you do not usually need to try and catch every circumstance where the code might fail.

How to handle Spark errors#

Handling Errors in PySpark

PySpark errors can be handled in the usual Python way, with a try/except block. Python contains some base exceptions that do not need to be imported, e.g. NameError and ZeroDivisionError. Package authors sometimes create custom exceptions which need to be imported to be handled; for PySpark errors you will likely need to import AnalysisException from pyspark.sql.utils and potentially Py4JJavaError from py4j.protocol:

from py4j.protocol import Py4JJavaError
from pyspark.sql.utils import AnalysisException

Handling Errors in sparklyr

Unlike Python (and many other languages), R uses a function for error handling, tryCatch(). sparklyr errors are still R errors, and so can be handled with tryCatch(). Error handling functionality is contained in base R, so there is no need to reference other packages. Advanced R has more details on tryCatch()

Although error handling in this way is unconventional if you are used to other languages, one advantage is that you will often use functions when coding anyway and it becomes natural to assign tryCatch() to a custom function.

Error Handling Examples#

Example 1: No Spark session#

A simple example of error handling is ensuring that we have a running Spark session. If want to run this code yourself, restart your container or console entirely before looking at this section.

Python Example

Recall the NameError from earlier:

from pyspark.sql import SparkSession, functions as F

data = [["Cat", 10], ["Dog", 5]]
columns = ["animal", "count"]

animal_df = spark.createDataFrame(data, columns)

NameError: name 'spark' is not defined
NameError                                 Traceback (most recent call last)
in engine
----> 1 animal_df = spark.createDataFrame(data, columns)

NameError: name 'spark' is not defined

We can handle this exception and give a more useful error message.

In Python you can test for specific error types and the content of the error message. This ensures that we capture only the error which we want and others can be raised as usual.

In this example, first test for NameError and then check that the error message is "name 'spark' is not defined".

The syntax here is worth explaining:

The code within the try: block has active error handing. Code outside this will not have any errors handled.
If a NameError is raised, it will be handled. Other errors will be raised as usual.
e is the error message object; to test the content of the message convert it to a string with str(e)
Within the except: block str(e) is tested and if it is "name 'spark' is not defined", a NameError is raised but with a custom error message that is more useful than the default
Raising the error from None prevents exception chaining and reduces the amount of output
If the error message is not "name 'spark' is not defined" then the exception is raised as usual

try:
    animal_df = spark.createDataFrame(data, columns)
    animal_df.show()
except NameError as e:
    if str(e) == "name 'spark' is not defined":
        raise NameError("No running Spark session. Start one before creating a DataFrame") from None
    else:
        raise

NameError: No running Spark session. Start one before creating a DataFrame
NameError                Traceback (most recent call last)
in engine
      4 except NameError as e:
      5     if str(e) == "name 'spark' is not defined":
----> 6         raise NameError("No running Spark session. Start one before creating a DataFrame") from None
      7     else:
      8         raise

NameError: No running Spark session. Start one before creating a DataFrame

This error message is more useful than the previous one as we know exactly what to do to get the code to run correctly: start a Spark session and run the code again:

spark = (SparkSession.builder.master("local[2]")
         .appName("errors")
         .getOrCreate())

try:
    animal_df = spark.createDataFrame(data, columns)
    animal_df.show()
except NameError as e:
    if str(e) == "name 'spark' is not defined":
        raise NameError("No running Spark session. Start one before creating a DataFrame") from None
    else:
        raise

+------+-----+
|animal|count|
+------+-----+
|   Cat|   10|
|   Dog|    5|
+------+-----+

As there are no errors in the try block the except block is ignored here and the desired result is displayed.

R Example

Recall the object 'sc' not found error from earlier:

library(sparklyr)
library(dplyr)

# Create a base R DataFrame
animal_df <- data.frame(
        animal = c("Cat", "Dog"),
        count = c(10, 5),
        stringsAsFactors = FALSE)

# Copy base R DataFrame to the Spark cluster
animal_sdf <- sdf_copy_to(sc, animal_df)

Error in sdf_copy_to(sc, animal_df) : object 'sc' not found

We can handle this exception and give a more useful error message.

In R you can test for the content of the error message. This ensures that we capture only the specific error which we want and others can be raised as usual. In this example, see if the error message contains object 'sc' not found.

The syntax here is worth explaining:

The expression to test and the error handling code are both contained within the tryCatch() statement; code outside this will not have any errors handled.
Code assigned to expr will be attempted to run
If there is no error, the rest of the code continues as usual
If an error is raised, the error function is called, with the error message e as an input
grepl() is used to test if "AnalysisException: Path does not exist" is within e; if it is, then an error is raised with a custom error message that is more useful than the default
If the message is anything else, stop(e) will be called, which raises an error with e as the message

tryCatch(
    expr = {
        # Copy base R DataFrame to the Spark cluster
        animal_sdf <- sparklyr::sdf_copy_to(sc, animal_df)
        # Preview data
        pillar::glimpse(animal_sdf)
    },
    error = function(e){
        # Test to see if the error message contains `object 'sc' not found`
        if(grepl("object 'sc' not found", e, fixed=TRUE)){
            # Raise error with custom message if true
            stop("No running Spark session. Start one before creating a sparklyr DataFrame")            
        }else{
            # Raise error without modification
            stop(e)
        }
    })

Error in value[[3L]](cond) : 
  No running Spark session. Start one before creating a sparklyr DataFrame

This error message is more useful than the previous one as we know exactly what to do to get the code to run correctly: start a Spark session and run the code again:

# Start Spark session
sc <- sparklyr::spark_connect(
  master = "local[2]",
  app_name = "errors",
  config = sparklyr::spark_config())

tryCatch(
    expr = {
        # Copy base R DataFrame to the Spark cluster
        animal_sdf <- sparklyr::sdf_copy_to(sc, animal_df)
        # Preview data
        pillar::glimpse(animal_sdf)
    },
    error = function(e){
        # Test to see if the error message contains `object 'sc' not found`
        if(grepl("object 'sc' not found", e, fixed=TRUE)){
            # Raise error with custom message if true
            stop("No running Spark session. Start one before creating a sparklyr DataFrame")            
        }else{
            # Raise error without modification
            stop(e)
        }
    })

Rows: ??
Columns: 2
Database: spark_connection
$ animal <chr> "Cat", "Dog"
$ count  <dbl> 10, 5

As there are no errors in expr the error statement is ignored here and the desired result is displayed.

Example 2: Handle multiple errors in a function#

This example shows how functions can be used to handle errors.

Python Example

We have started to see how useful try/except blocks can be, but it adds extra lines of code which interrupt the flow for the reader. As such it is a good idea to wrap error handling in functions. You should document why you are choosing to handle the error and the docstring of a function is a natural place to do this.

As an example, define a wrapper function for spark.read.csv which reads a CSV file from HDFS. This can handle two types of errors:

If the Spark context has been stopped, it will return a custom error message that is much shorter and descriptive
If the path does not exist the same error message will be returned but raised from None to shorten the stack trace

Only the first error which is hit at runtime will be returned. Logically this makes sense: the code could logically have multiple problems but the execution will halt at the first, meaning the rest can go undetected until the first is fixed.

This function uses some Python string methods to test for error message equality: str.find() and slicing strings with [:].

from py4j.protocol import Py4JJavaError
from pyspark.sql.utils import AnalysisException

def read_csv_handle_exceptions(file_path):
    """
    Read a CSV from HDFS and return a Spark DF
    
    Custom exceptions will be raised for trying to read the CSV from a stopped
        Spark context and if the path does not exist.
    
    Args:
        file_path (string): path of CSV on HDFS
        
    Returns:
        spark DataFrame
    """
    try:
        return spark.read.csv(file_path, header=True, inferSchema=True)
    except Py4JJavaError as e:
        # Uses str(e).find() to search for specific text within the error
        if str(e).find("java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext") > -1:
            # Use ... from None to ignore the stack trace in the output
            raise Exception("Spark session has been stopped. Please start a new Spark session.") from None
        else:
            # Raise an exception if the error message is anything else
            raise
    except AnalysisException as e:
        # See if the first 21 characters are the error we want to capture
        if str(e)[:21] == "'Path does not exist:":
            raise Exception(e) from None
        else:
            raise

Stop the Spark session and try to read in a CSV:

spark.stop()
no_df = read_csv_handle_exceptions("this/is_not/a/file_path.csv")

Exception: 'Path does not exist: hdfs://.../this/is_not/a/file_path.csv;'
Exception                                 Traceback (most recent call last)
in engine
----> 1 df = read_csv_handle_exceptions("this/is_not/a/file_path.csv")

<ipython-input-1-394f508cffc3> in read_csv_handle_exceptions(file_path)
     13         # See if the first 21 characters are the error we want to capture
     14         if str(e)[:21] == "'Path does not exist:":
---> 15             raise Exception(e) from None
     16         else:
     17             raise

Exception: 'Path does not exist: hdfs://.../this/is_not/a/file_path.csv;'

Fix the path; this will give the other error:

import yaml
with open("ons-spark/config.yaml") as f:
    config = yaml.safe_load(f)
    
rescue_path_csv = config["rescue_path_csv"]
rescue = read_csv_handle_exceptions(rescue_path_csv)

Exception: Spark session has been stopped. Please start a new Spark session.
Exception                                 Traceback (most recent call last)
in engine
----> 1 rescue = read_csv_handle_exceptions(rescue_path_csv)

<ipython-input-1-de3ee93967c9> in read_csv_handle_exceptions(file_path)
     17         if str(e).find("java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext") > -1:
     18             # Use ... from None to ignore the stack trace in the output
---> 19             raise Exception("Spark session has been stopped. Please start a new Spark session.") from None
     20         else:
     21             # Raise an exception if the error message is anything else

Exception: Spark session has been stopped. Please start a new Spark session.

Correct both errors by starting a Spark session and reading the correct path:

spark = (SparkSession.builder.master("local[2]")
         .appName("errors")
         .getOrCreate())

rescue = read_csv_handle_exceptions(rescue_path_csv)
rescue.select("IncidentNumber", "AnimalGroupParent").show(3)

+--------------+-----------------+
|IncidentNumber|AnimalGroupParent|
+--------------+-----------------+
|        139091|              Dog|
|        275091|              Fox|
|       2075091|              Dog|
+--------------+-----------------+
only showing top 3 rows

A better way of writing this function would be to add spark as a parameter to the function:

def read_csv_handle_exceptions(spark, file_path):

Writing the code in this way prompts for a Spark session and so should lead to fewer user errors when writing the code.

R Example

We have started to see how useful the tryCatch() function is, but it adds extra lines of code which interrupt the flow for the reader. It is easy to assign a tryCatch() function to a custom function and this will make your code neater. You should document why you are choosing to handle the error in your code.

As an example, define a wrapper function for spark_read_csv() which reads a CSV file from HDFS. This can handle two types of errors:

If the Spark context has been stopped, it will return a custom error message that is much shorter and descriptive
If the path does not exist the default error message will be returned

Only the first error which is hit at runtime will be returned. Logically this makes sense: the code could logically have multiple problems but the execution will halt at the first, meaning the rest can go undetected until the first is fixed.

This function uses grepl() to test if the error message contains a specific string:

read_csv_handle_exceptions <- function(file_path){
    tryCatch(
        expr = {
            # Read a CSV file from HDFS
            sdf <- sparklyr::spark_read_csv(sc,
                                            file_path,
                                            header=TRUE,
                                            infer_schema=TRUE)
            return(sdf)
        },
        error = function(e){
            # See if the error is invalid connection and return custom error message if true
            if(grepl("invalid connection", e, fixed=TRUE)){
                stop("No running Spark session. Start one before creating a sparklyr DataFrame")
            # See if the file path is valid; if not, return custom error message
            }else if(grepl("AnalysisException: Path does not exist", e, fixed=TRUE)){
                stop(paste("File path:",
                           file_path,
                           "does not exist. Please supply a valid file path.",
                           sep=" "))
            # If the error message is neither of these, return the original error
            }else{stop(e)}
        })
}

Stop the Spark session and try to read in a CSV:

sparklyr::spark_disconnect(sc)
file_path <- "this/is_not/a/file_path.csv"
no_sdf <- read_csv_handle_exceptions(file_path)

Error in value[[3L]](cond) : 
  No running Spark session. Start one before creating a sparklyr DataFrame

Start a Spark session and try the function again; this will give the other error:

sc <- sparklyr::spark_connect(
  master = "local[2]",
  app_name = "errors",
  config = sparklyr::spark_config())
        
no_sdf <- read_csv_handle_exceptions(file_path)

Error in value[[3L]](cond) : 
  File path: this/is_not/a/file_path.csv does not exist. Please supply a valid file path

Run without errors by supplying a correct path:

config <- yaml::yaml.load_file("ons-spark/config.yaml")

rescue <- read_csv_handle_exceptions(config$rescue_path_csv)
rescue %>%
    sparklyr::select(IncidentNumber, AnimalGroupParent) %>%
    pillar::glimpse()

Rows: ??
Columns: 2
Database: spark_connection
$ IncidentNumber    <chr> "139091", "275091", "2075091", "2872091", "3553091",…
$ AnimalGroupParent <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", "Unknown - H…

A better way of writing this function would be to add sc as a parameter to the function:

read_csv_handle_exceptions <- function(sc, file_path)

Writing the code in this way prompts for a Spark session and so should lead to fewer user errors when writing the code.

Example 3: Capture and ignore the error#

Another option is to capture the error and ignore it. Generally you will only want to do this in limited circumstances when you are ignoring errors that you expect, and even then it is better to anticipate them using logic. After all, the code returned an error for a reason!

This example counts the number of distinct values in a column, returning 0 and printing a message if the column does not exist.

Python Example

Define a Python function in the usual way:

def distinct_count(df, input_column):
    """
    Returns the number of unique values of a specified column in a Spark DF.
    
    Will return an error if input_column is not in df
    
    Args:
        df (spark DataFrame): input DataFrame
        input_column (string): name of a column in df for which the distinct count is required
        
    Returns:
        int: Count of unique values in input_column
    """
    try:
        return df.select(input_column).distinct().count()
    except AnalysisException as e:
        # Derive an expected error
        expected_error_str = f"cannot resolve '`{input_column}`' given input columns"
        # Test if the error contains the expected_error_str
        if str(e).find(expected_error_str) > -1:
            # Print a message and continue
            print(f"Column `{input_column}` does not exist. Returning `0`")
            return 0
        else:
            # Raise an error otherwise
            raise

Try one column which exists and one which does not:

rescue_path = config["rescue_path"]
rescue = spark.read.parquet(rescue_path)

distinct_count(rescue, "incident_number")

distinct_count(rescue, "column_that_does_not_exist")

Column `column_that_does_not_exist` does not exist. Returning `0`
0

A better way would be to avoid the error in the first place by checking if the column exists before the .distinct():

def distinct_count(df, input_column):
    # Test if column exists
    if input_column in df.columns:
        return df.select(input_column).distinct().count()
    # Return 0 and print message if it does not exist
    else:
        print(f"Column `{input_column}` does not exist. Returning `0`")
        return 0
        
distinct_count(rescue, "column_that_does_not_exist")

Column `column_that_does_not_exist` does not exist. Returning `0`
0

R Example

Define an R function in the usual way:

distinct_count <- function(sdf, input_column){
    tryCatch(
        expr = {
            # Get the distinct count of input_column
            return(
                sdf %>%
                    sparklyr::select(all_of(input_column)) %>%
                    sparklyr::distinct() %>%
                    sparklyr::sdf_nrow()
            )},
        error = function(e){
            # If the column does not exist, return 0 and print out a message
            #    rather than raise an exception
            if(grepl("Can't subset columns that don't exist", e, fixed=TRUE)){
                print(paste("Column",
                            input_column,
                            "does not exist. Returning `0`",
                           sep = " "))
                return(0)
            # If the error is anything else, return the original error message
            }else{stop(e)}            
        }
    )
}

Try one column which exists and one which does not:

incident_count <- distinct_count(rescue, "IncidentNumber")
incident_count

[1] 5898

zero_count <- distinct_count(rescue, "column_that_does_not_exist")
zero_count  

[1] "Column column_that_does_not_exist does not exist. Returning `0`"
[1] 0

A better way would be to avoid the error in the first place by checking if the column exists:

distinct_count <- function(sdf, input_column){
    col_names <- sdf %>% colnames()
    # See if the column exists
    if(input_column %in% col_names){
        return(
                # Get the distinct count of input_column
                sdf %>%
                    sparklyr::select(all_of(input_column)) %>%
                    sparklyr::distinct() %>%
                    sparklyr::sdf_nrow()
            )
    }else{
        # If the column does not exist, return 0 and print out a message
        print(paste("Column",
                    input_column,
                    "does not exist. Returning `0`",
                    sep = " "))
        return(0)
    }
}

zero_count <- distinct_count(rescue, "column_that_does_not_exist")
zero_count

[1] "Column column_that_does_not_exist does not exist. Returning `0`"
[1] 0

Clean up options#

It is worth briefly mentioning the finally clause which exists in both Python and R.

In Python, finally is added at the end of a try/except block. This is where clean up code which will always be ran regardless of the outcome of the try/except. See Defining Clean Up Action for more information.

The tryCatch() function in R has two other options:

warning: Used to handle warnings; the usage is the same as error
finally: This is code that will be ran regardless of any errors, often used for clean up if needed

Further Resources#

Spark at the ONS Articles:

PySpark Documentation:

spark.read.parquet()
spark.read.csv
pyspark.sql.utils: source code for AnalysisException
.distinct()

Python Documentation:

sparklyr and tidyverse Documentation:

Understanding and Handling Spark Errors

Contents

Understanding and Handling Spark Errors#

Practical tips for error handling in Spark#

A recap of Python and R errors#

Example Spark error: missing file#

Understanding Errors: Summary of key points#

Handling exceptions in Spark#

What is error handling and why use it?#

How to handle Spark errors#

Error Handling Examples#

Example 1: No Spark session#

Example 2: Handle multiple errors in a function#

Example 3: Capture and ignore the error#

Clean up options#

Further Resources#