Data binning

Data binning#

Data binning is a data pre-processing method which transforms continuous or discrete data to categorical. The original values which fall into a specific interval are replaced by a value representive of that interval.

In a spark dataframe this is easily implemented by applying the when() function for PySpark and the if_else() (or case_when()) function from dplyr package for sparklyr.

In R, the case_when() package is a general vectorised if_else() and provides a more flexible and readable way to express conditional logic statements. With case_when() we can apply multiple conditions in a single call, quite convenient for 3 or more conditions since if_else() is designed for binary conditions.

Creating spark session and sample data#

We will start by creating a sample dataframe with Python or R and moving it into a spark cluster. For demonstration reasons we will create a dataframe with 10 rows with id and age columns, we will also populate age column with random values from 1 to 30.

#import packages

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
import numpy as np

spark = (SparkSession.builder.master("local[2]")
         .appName("bin-continuous-variables-pyspark")
         .getOrCreate())

np.random.seed(2023)

#number of rows to create synthetic data
n = 10

#create pandas dataframe
df = pd.DataFrame({
  'id':np.arange(n),
  'age':np.random.randint(1,30,n)
  })

#convert to spark dataframe
sdf = spark.createDataFrame(df)

#quick look
sdf.show()

library(sparklyr)
library(dplyr)

sc <- sparklyr::spark_connect(
  master = "local[2]",
  app_name = "bin-continuous-variables-sparklyr",
  config = sparklyr::spark_config())
  
set.seed(2024)

#number of rows to create synthetic data
n = 10

df <- data.frame(
  "id" = 1:n,
  "age" = sample(1:30,n,rep=TRUE))

sdf <- copy_to(sc ,df)

#quick look
sdf

Let’s have a quick look at the data:

+---+---+
|age| id|
+---+---+
| 24|  0|
| 26|  1|
|  7|  2|
| 24|  3|
|  2|  4|
| 29|  5|
|  4|  6|
| 21|  7|
| 21|  8|
| 23|  9|
+---+---+

# Source: spark<?> [?? x 3]
      id   age
   <int> <int>
   1    26
   2    10
   3    21
   4    21
   5    14
   6    22
   7    13
   8    10
   9    27
  10     4

Binning data into 2 groups#

The next step is defining how to parse the values into bins. This article focuses on the implentation of binning in a spark cluster, the allocation of data into bins is outside the scope of this article. For demonstation reasons we will create a new column age_bracket with the value young when age is less than or equal to 9 and old otherwise.

sdf = (
    sdf.withColumn('age_bracket',
                  F.when(sdf.age <= 9, 'young')
                    .otherwise('old')
                    )
)

#quick look
sdf.show()

sdf <- sdf %>%
  mutate(age_bracket = ifelse(age <= 9,
                              "young",
                              "old"
                              )
        )

#quick look
sdf

Let’s have another look at the data:

+---+---+-----------+
|age| id|age_bracket|
+---+---+-----------+
| 24|  0|        old|
| 26|  1|        old|
|  7|  2|      young|
| 24|  3|        old|
|  2|  4|      young|
| 29|  5|        old|
|  4|  6|      young|
| 21|  7|        old|
| 21|  8|        old|
| 23|  9|        old|
+---+---+-----------+

# Source: spark<?> [?? x 3]
      id   age age_bracket
   <int> <int> <chr>      
   1    26 old        
   2    10 old        
   3    21 old        
   4    21 old        
   5    14 old        
   6    22 old        
   7    13 old        
   8    10 old        
   9    27 old        
  10     4 young      

Binning data into 3 or more groups#

To split assign the data to 3 or more categories the same approach can be implemented as shown below. In this example, age_bracket is labeled as young when age is less than or equal to 9, as old when age is less than or equal to 19 and oldest otherwise.

In the R code snippet we are using case_when(). By avoiding the need of nested if_else() statements, we are expressing multiple conditions in a more readable and concise way, hence an easier code to understand and maintain.

sdf = (
    sdf.withColumn('age_bracket',
                  F.when(sdf.age <= 9, 'young')
                    .when(sdf.age <= 19, 'older')
                    .otherwise('oldest')
                    )
                  
)

sdf.show()

sdf <- sdf %>%
  mutate(age_bracket = case_when(
  age <= 9 ~ "young",
  age <= 19 ~ "old",
  .default = "oldest"))

sdf

Let’s check the output again:

+---+---+-----------+
|age| id|age_bracket|
+---+---+-----------+
| 24|  0|     oldest|
| 26|  1|     oldest|
|  7|  2|      young|
| 24|  3|     oldest|
|  2|  4|      young|
| 29|  5|     oldest|
|  4|  6|      young|
| 21|  7|     oldest|
| 21|  8|     oldest|
| 23|  9|     oldest|
+---+---+-----------+

# Source: spark<?> [?? x 3]
      id   age age_bracket
   <int> <int> <chr>      
   1    26 oldest     
   2    10 old        
   3    21 oldest     
   4    21 oldest     
   5    14 old        
   6    22 oldest     
   7    13 old        
   8    10 old        
   9    27 oldest     
  10     4 young   

Data binning

Contents

Data binning#

Creating spark session and sample data#

Binning data into 2 groups#

Binning data into 3 or more groups#