Add leading zeros with lpad()

Add leading zeros with `lpad()`#

There are some datasets that consist of code columns, which are fixed width and are stored as strings even though they look like numbers. For instance, the data might have values such as "000123" and "456789".

However, sometimes these can appear as integers which will lose the initial zeros. So "000123" becomes 123. Commonly this can happen if at some point the data is manipulated in Excel and later read in a CSV file; it can also happen if the column is converted to a numeric type. An ONS example is Standard Industrial Classification (SIC). SIC codes are five digits, but some begin with \(0\).

There is an easy way to change this back into the correct format: F.lpad() from the functions package in PySpark and the Spark function lpad() inside mutate() in sparklyr. This will add a string that you specify to the start, making every value a fixed width string. For instance, "123" becomes "000123" but "456789" remains the same.

Example: Incident Numbers#

The Animal Rescue data has an incident_number column, which is unique and of variable length. We will add leading zeros to this column to make it of a consistent length.

First, start a Spark session and read in the Animal Rescue data, filter on "Police" and select the relevant columns:

import yaml
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.master("local[2]").appName("padding").getOrCreate()

with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)
    
rescue_path = config["rescue_path"]
rescue = (spark.read.parquet(rescue_path)
          .filter(F.col("origin_of_call") == "Police")
          .orderBy("date_time_of_call")
          .select("incident_number", "origin_of_call"))
rescue.show(5)

library(sparklyr)

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "padding",
    config = sparklyr::spark_config())

config <- yaml::yaml.load_file("ons-spark/config.yaml")

rescue <- sparklyr::spark_read_parquet(sc, config$rescue_path) %>%
    sparklyr::filter(origin_of_call == "Police") %>%
    dplyr::arrange(date_time_of_call) %>%
    sparklyr::select(incident_number, origin_of_call)

rescue %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()    

+---------------+--------------+
|incident_number|origin_of_call|
+---------------+--------------+
|      146647151|        Police|
|       66969111|        Police|
|      103407111|        Police|
|      137525091|        Police|
|      158794091|        Police|
+---------------+--------------+
only showing top 5 rows

# A tibble: 5 × 2
  incident_number origin_of_call
  <chr>           <chr>         
1 146647151       Police        
2 66969111        Police        
3 103407111       Police        
4 137525091       Police        
5 158794091       Police        

The input to lpad() will most often be either a string or an integer. In this example, incident_number is a string:

rescue.printSchema()

pillar::glimpse(rescue)

root
 |-- incident_number: string (nullable = true)
 |-- origin_of_call: string (nullable = true)

Rows: ??
Columns: 2
Database: spark_connection
Ordered by: date_time_of_call
$ incident_number <chr> "146647151", "66969111", "103407111", "137525091", "15…
$ origin_of_call  <chr> "Police", "Police", "Police", "Police", "Police", "Pol…

Use F.length()/length() to demonstrate that the incident_number column is not always the same size:

rescue = rescue.withColumn("incident_no_length", F.length("incident_number"))
rescue.groupBy("incident_no_length").count().orderBy("incident_no_length").show()

rescue <- rescue %>%
    sparklyr::mutate(incident_no_length = length(incident_number))

rescue %>%
    dplyr::group_by(incident_no_length, origin_of_call) %>%
    dplyr::summarise(count = n()) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

+------------------+-----+
|incident_no_length|count|
+------------------+-----+
|                 6|    1|
|                 7|    6|
|                 8|   57|
|                 9|   65|
+------------------+-----+

# A tibble: 4 × 3
  incident_no_length origin_of_call count
               <int> <chr>          <dbl>
1                  6 Police             1
2                  8 Police            57
3                  7 Police             6
4                  9 Police            65

We want the incident_number column to be a string with a fixed width of \(9\), with zeros at the start if the length is shorter than this.

lpad() takes three arguments. The first argument, col is the column name, the second, len is the fixed width of the string, and the third, pad, the value to pad it with if it is too short, often "0". The data type returned from lpad() will always be a string.

rescue = rescue.withColumn("padded_incident_no", F.lpad(F.col("incident_number"), 9, "0"))
rescue.orderBy("incident_no_length").show(5)
rescue.orderBy("incident_no_length", ascending=False).show(5)

rescue <- rescue %>%
    sparklyr::mutate(padded_incident_no = lpad(incident_number, 9, "0"))

rescue %>%
    dplyr::arrange(incident_no_length) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

rescue %>%
    dplyr::arrange(desc(incident_no_length)) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

+---------------+--------------+------------------+------------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|
+---------------+--------------+------------------+------------------+
|         955141|        Police|                 6|         000955141|
|        7003121|        Police|                 7|         007003121|
|        6311101|        Police|                 7|         006311101|
|        5930131|        Police|                 7|         005930131|
|        3223101|        Police|                 7|         003223101|
+---------------+--------------+------------------+------------------+
only showing top 5 rows

+---------------+--------------+------------------+------------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|
+---------------+--------------+------------------+------------------+
|      205017101|        Police|                 9|         205017101|
|      207037111|        Police|                 9|         207037111|
|      135844101|        Police|                 9|         135844101|
|      216289101|        Police|                 9|         216289101|
|      145879151|        Police|                 9|         145879151|
+---------------+--------------+------------------+------------------+
only showing top 5 rows

# A tibble: 5 × 4
  incident_number origin_of_call incident_no_length padded_incident_no
  <chr>           <chr>                       <int> <chr>             
1 955141          Police                          6 000955141         
2 3215101         Police                          7 003215101         
3 3223101         Police                          7 003223101         
4 7003121         Police                          7 007003121         
5 6311101         Police                          7 006311101         
# A tibble: 5 × 4
  incident_number origin_of_call incident_no_length padded_incident_no
  <chr>           <chr>                       <int> <chr>             
1 163121101       Police                          9 163121101         
2 103225141       Police                          9 103225141         
3 114153091       Police                          9 114153091         
4 101172091       Police                          9 101172091         
5 110211101       Police                          9 110211101         

Be careful; if you set the fixed width to be too short you can lose data:

rescue = rescue.withColumn("too_short_inc_no", F.lpad(F.col("incident_number"), 3, "0"))
rescue.orderBy("incident_no_length", ascending=False).show(5)

rescue <- rescue %>%
    sparklyr::mutate(too_short_inc_no = lpad(incident_number, 3, "0"))

rescue %>%
    dplyr::arrange(desc(incident_no_length)) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

+---------------+--------------+------------------+------------------+----------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|too_short_inc_no|
+---------------+--------------+------------------+------------------+----------------+
|      114153091|        Police|                 9|         114153091|             114|
|      138096091|        Police|                 9|         138096091|             138|
|      110211101|        Police|                 9|         110211101|             110|
|      101172091|        Police|                 9|         101172091|             101|
|      102278091|        Police|                 9|         102278091|             102|
+---------------+--------------+------------------+------------------+----------------+
only showing top 5 rows

# A tibble: 5 × 5
  incident_number origin_of_call incident_no_length padded_incident_no
  <chr>           <chr>                       <int> <chr>             
1 163121101       Police                          9 163121101         
2 103225141       Police                          9 103225141         
3 114153091       Police                          9 114153091         
4 101172091       Police                          9 101172091         
5 110211101       Police                          9 110211101         
# … with 1 more variable: too_short_inc_no <chr>

You can have values other than zero for the last argument and they do not have to be width 1, although there are fewer use cases for this. For example:

rescue = rescue.withColumn("silly_example", F.lpad(F.col("incident_number"), 14, "xyz"))
rescue.orderBy("incident_no_length").show(5)

rescue <- rescue %>%
    sparklyr::mutate(silly_example = lpad(incident_number, 14, "xyz"))
    
rescue %>%
    dplyr::arrange(incident_no_length) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

+---------------+--------------+------------------+------------------+----------------+--------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|too_short_inc_no| silly_example|
+---------------+--------------+------------------+------------------+----------------+--------------+
|         955141|        Police|                 6|         000955141|             955|xyzxyzxy955141|
|        7003121|        Police|                 7|         007003121|             700|xyzxyzx7003121|
|        6311101|        Police|                 7|         006311101|             631|xyzxyzx6311101|
|        5930131|        Police|                 7|         005930131|             593|xyzxyzx5930131|
|        3223101|        Police|                 7|         003223101|             322|xyzxyzx3223101|
+---------------+--------------+------------------+------------------+----------------+--------------+
only showing top 5 rows

# A tibble: 5 × 6
  incident_number origin_of_call incident_no_length padded_incident_no
  <chr>           <chr>                       <int> <chr>             
1 955141          Police                          6 000955141         
2 3215101         Police                          7 003215101         
3 3223101         Police                          7 003223101         
4 7003121         Police                          7 007003121         
5 6311101         Police                          7 006311101         
# … with 2 more variables: too_short_inc_no <chr>, silly_example <chr>

Padding on the right: `rpad()`#

There is also a similar function, F.rpad()/rpad, that works in the same way with identical arguments, just padding to the right instead:

rescue = rescue.withColumn("right_padded_inc_no", F.rpad(F.col("incident_number"), 9, "0"))
rescue.select("incident_number", "right_padded_inc_no").show(5)

rescue <- rescue %>%
    sparklyr::mutate(right_padded_inc_no = rpad(incident_number, 9, "0"))
    
rescue %>%
    dplyr::arrange(right_padded_inc_no) %>%
    sparklyr::select(incident_number, right_padded_inc_no) %>%
    head(5) %>%
    sparklyr::collect() %>%
    print()

+---------------+-------------------+
|incident_number|right_padded_inc_no|
+---------------+-------------------+
|      146647151|          146647151|
|       66969111|          669691110|
|      103407111|          103407111|
|      137525091|          137525091|
|      158794091|          158794091|
+---------------+-------------------+
only showing top 5 rows

# A tibble: 5 × 2
  incident_number right_padded_inc_no
  <chr>           <chr>              
1 53242091        532420910          
2 45531111        455311110          
3 19747141        197471410          
4 21512131        215121310          
5 220308091       220308091          

Further Resources#

Spark at the ONS Articles:

Using Spark Functions in sparklyr: note that lpad/rpad are inherited directly from Spark SQL and need to be used inside mutate(), rather than imported from the sparklyr package as standalone functions

PySpark Documentation:

Spark SQL Documentation:

Add leading zeros with lpad()

Contents

Add leading zeros with lpad()#

Example: Incident Numbers#

Padding on the right: rpad()#

Further Resources#

Add leading zeros with `lpad()`#

Padding on the right: `rpad()`#