Data binning#

Data binning is a data pre-processing method which transforms continuous or discrete data to categorical. The original values which fall into a specific interval are replaced by a value representive of that interval.

In a spark dataframe this is easily implemented by applying the when() function for PySpark and the if_else() (or case_when()) function from dplyr package for sparklyr.

In R, the case_when() package is a general vectorised if_else() and provides a more flexible and readable way to express conditional logic statements. With case_when() we can apply multiple conditions in a single call, quite convenient for 3 or more conditions since if_else() is designed for binary conditions.

Creating spark session and sample data#

We will start by creating a sample dataframe with Python or R and moving it into a spark cluster. For demonstration reasons we will create a dataframe with 10 rows with id and age columns, we will also populate age column with random values from 1 to 30.

#import packages

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
import numpy as np

spark = (SparkSession.builder.master("local[2]")
         .appName("bin-continuous-variables-pyspark")
         .getOrCreate())

np.random.seed(2023)

#number of rows to create synthetic data
n = 10

#create pandas dataframe
df = pd.DataFrame({
  'id':np.arange(n),
  'age':np.random.randint(1,30,n)
  })

#convert to spark dataframe
sdf = spark.createDataFrame(df)

#quick look
sdf.show()

Let’s have a quick look at the data:

+---+---+
|age| id|
+---+---+
| 24|  0|
| 26|  1|
|  7|  2|
| 24|  3|
|  2|  4|
| 29|  5|
|  4|  6|
| 21|  7|
| 21|  8|
| 23|  9|
+---+---+

Binning data into 2 groups#

The next step is defining how to parse the values into bins. This article focuses on the implentation of binning in a spark cluster, the allocation of data into bins is outside the scope of this article. For demonstation reasons we will create a new column age_bracket with the value young when age is less than or equal to 9 and old otherwise.

sdf = (
    sdf.withColumn('age_bracket',
                  F.when(sdf.age <= 9, 'young')
                    .otherwise('old')
                    )
)

#quick look
sdf.show()

Let’s have another look at the data:

+---+---+-----------+
|age| id|age_bracket|
+---+---+-----------+
| 24|  0|        old|
| 26|  1|        old|
|  7|  2|      young|
| 24|  3|        old|
|  2|  4|      young|
| 29|  5|        old|
|  4|  6|      young|
| 21|  7|        old|
| 21|  8|        old|
| 23|  9|        old|
+---+---+-----------+

Binning data into 3 or more groups#

To split assign the data to 3 or more categories the same approach can be implemented as shown below. In this example, age_bracket is labeled as young when age is less than or equal to 9, as old when age is less than or equal to 19 and oldest otherwise.

In the R code snippet we are using case_when(). By avoiding the need of nested if_else() statements, we are expressing multiple conditions in a more readable and concise way, hence an easier code to understand and maintain.

sdf = (
    sdf.withColumn('age_bracket',
                  F.when(sdf.age <= 9, 'young')
                    .when(sdf.age <= 19, 'older')
                    .otherwise('oldest')
                    )
                  
)

sdf.show()

Let’s check the output again:

+---+---+-----------+
|age| id|age_bracket|
+---+---+-----------+
| 24|  0|     oldest|
| 26|  1|     oldest|
|  7|  2|      young|
| 24|  3|     oldest|
|  2|  4|      young|
| 29|  5|     oldest|
|  4|  6|      young|
| 21|  7|     oldest|
| 21|  8|     oldest|
| 23|  9|     oldest|
+---+---+-----------+