Rounding differences in Python, R and Spark

Rounding differences in Python, R and Spark#

Python, R and Spark have different ways of rounding numbers which end in \(.5\); Python and R round to the nearest even integer (sometimes called bankers rounding), whereas Spark will round away from zero (up in the conventional mathematical way for positive numbers, and round down for negative numbers), in the same way as in Excel.

This can be confusing when using PySpark and sparklyr if you are used to the behaviour in Python and R.

Comparison of rounding methods#

Create a DataFrame with numbers all ending in .5, both positive and negative, using spark.range()/sdf_seq() and then dividing the id column by 2:

from pyspark.sql import SparkSession, functions as F
import pandas as pd
import numpy as np

spark = (SparkSession.builder.master("local[2]")
         .appName("rounding")
         .getOrCreate())

sdf = spark.range(-7, 8, 2).select((F.col("id") / 2).alias("half_id"))
sdf.show()

library(sparklyr)
library(dplyr)

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "rounding",
    config = sparklyr::spark_config())

sdf <- sparklyr:::sdf_seq(sc, -7, 8, 2) %>%
    sparklyr::mutate(half_id = id / 2) %>%
    sparklyr::select(half_id)
    
sdf %>%
    sparklyr::collect() %>%
    print()

+-------+
|half_id|
+-------+
|   -3.5|
|   -2.5|
|   -1.5|
|   -0.5|
|    0.5|
|    1.5|
|    2.5|
|    3.5|
+-------+

# A tibble: 8 × 1
  half_id
    <dbl>
  -3.5
  -2.5
  -1.5
  -0.5
   0.5
   1.5
   2.5
   3.5

Round using Spark with F.round()/round(); this will round away from zero (up for positive numbers and down for negative):

sdf = sdf.withColumn("spark_round", F.round("half_id"))
sdf.toPandas()

sdf <- sdf %>%
    sparklyr::mutate(spark_round = round(half_id))

sdf %>%
    sparklyr::collect() %>%
    print()

   half_id  spark_round
   -3.5         -4.0
   -2.5         -3.0
   -1.5         -2.0
   -0.5         -1.0
    0.5          1.0
    1.5          2.0
    2.5          3.0
    3.5          4.0

# A tibble: 8 × 2
  half_id spark_round
    <dbl>       <dbl>
  -3.5          -4
  -2.5          -3
  -1.5          -2
  -0.5          -1
   0.5           1
   1.5           2
   2.5           3
   3.5           4

Now try using Python/R; this will use the bankers method of rounding:

pdf = sdf.toPandas()
pdf["python_round"] = round(pdf["half_id"], 0)
pdf

tdf <- sdf %>%
    sparklyr::collect() %>%
    sparklyr::mutate(r_round = round(half_id)) %>%
    print()

   half_id  spark_round  python_round
   -3.5         -4.0          -4.0
   -2.5         -3.0          -2.0
   -1.5         -2.0          -2.0
   -0.5         -1.0          -0.0
    0.5          1.0           0.0
    1.5          2.0           2.0
    2.5          3.0           2.0
    3.5          4.0           4.0

# A tibble: 8 × 3
  half_id spark_round r_round
    <dbl>       <dbl>   <dbl>
  -3.5          -4      -4
  -2.5          -3      -2
  -1.5          -2      -2
  -0.5          -1       0
   0.5           1       0
   1.5           2       2
   2.5           3       2
   3.5           4       4

The two methods have returned different results, despite both using functions named round().

Just like in Python, pandas and numpy also use bankers rounding:

pdf["pd_round"] = pdf["half_id"].round()
pdf["np_round"] = np.round(pdf["half_id"])
pdf

   half_id  spark_round  python_round  pd_round  np_round
   -3.5         -4.0          -4.0      -4.0      -4.0
   -2.5         -3.0          -2.0      -2.0      -2.0
   -1.5         -2.0          -2.0      -2.0      -2.0
   -0.5         -1.0          -0.0      -0.0      -0.0
    0.5          1.0           0.0       0.0       0.0
    1.5          2.0           2.0       2.0       2.0
    2.5          3.0           2.0       2.0       2.0
    3.5          4.0           4.0       4.0       4.0

You can use the Python and R style of bankers rounding in Spark with F.bround()/bround():

sdf = sdf.withColumn("spark_bround", F.bround("half_id"))
sdf.toPandas()

sdf <- sdf %>%
    sparklyr::mutate(spark_bround = bround(half_id))

sdf %>%
    sparklyr::collect() %>%
    print()

   half_id  spark_round  spark_bround
   -3.5         -4.0          -4.0
   -2.5         -3.0          -2.0
   -1.5         -2.0          -2.0
   -0.5         -1.0           0.0
    0.5          1.0           0.0
    1.5          2.0           2.0
    2.5          3.0           2.0
    3.5          4.0           4.0

# A tibble: 8 × 3
  half_id spark_round spark_bround
    <dbl>       <dbl>        <dbl>
  -3.5          -4           -4
  -2.5          -3           -2
  -1.5          -2           -2
  -0.5          -1            0
   0.5           1            0
   1.5           2            2
   2.5           3            2
   3.5           4            4

Other information on rounding#

UDFs and `spark_apply()`#

User Defined Functions (UDFs) in Python, and R code ran on the Spark cluster with spark_apply() will use bankers rounding, in common with Python and R.

Python 2#

The rounding method changed to bankers rounding in Python 3. In Python 2, it used the round away from zero method, the same as Spark. It is strongly recommended to use Python 3 for any new code development. Spark 3 has dropped support for Python 2.

Other common software#

Both Excel and SPSS Statistics use the Spark method of rounding away from zero. If you are new to coding and are learning Python or R predominately to use Spark, be careful when using regular Python or R functions.

Testing#

Given that there are different ways of rounding depending on the language used, it is a good idea to thoroughly unit test your functions to ensure that they behave as expected.

Further Resources#

Spark at the ONS Articles:

Unit Testing in Spark

PySpark Documentation:

sparklyr Documentation:

Spark SQL Documentation: