Add leading zeros with lpad()#

There are some datasets that consist of code columns, which are fixed width and are stored as strings even though they look like numbers. For instance, the data might have values such as "000123" and "456789".

However, sometimes these can appear as integers which will lose the initial zeros. So "000123" becomes 123. Commonly this can happen if at some point the data is manipulated in Excel and later read in a CSV file; it can also happen if the column is converted to a numeric type. An ONS example is Standard Industrial Classification (SIC). SIC codes are five digits, but some begin with \(0\).

There is an easy way to change this back into the correct format: F.lpad() from the functions package in PySpark and the Spark function lpad() inside mutate() in sparklyr. This will add a string that you specify to the start, making every value a fixed width string. For instance, "123" becomes "000123" but "456789" remains the same.

Example: Incident Numbers#

The Animal Rescue data has an incident_number column, which is unique and of variable length. We will add leading zeros to this column to make it of a consistent length.

First, start a Spark session and read in the Animal Rescue data, filter on "Police" and select the relevant columns:

import yaml
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.master("local[2]").appName("padding").getOrCreate()

with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)
    
rescue_path = config["rescue_path"]
rescue = (spark.read.parquet(rescue_path)
          .filter(F.col("origin_of_call") == "Police")
          .orderBy("date_time_of_call")
          .select("incident_number", "origin_of_call"))
rescue.show(5)
+---------------+--------------+
|incident_number|origin_of_call|
+---------------+--------------+
|      146647151|        Police|
|       66969111|        Police|
|      103407111|        Police|
|      137525091|        Police|
|      158794091|        Police|
+---------------+--------------+
only showing top 5 rows

The input to lpad() will most often be either a string or an integer. In this example, incident_number is a string:

rescue.printSchema()
root
 |-- incident_number: string (nullable = true)
 |-- origin_of_call: string (nullable = true)

Use F.length()/length() to demonstrate that the incident_number column is not always the same size:

rescue = rescue.withColumn("incident_no_length", F.length("incident_number"))
rescue.groupBy("incident_no_length").count().orderBy("incident_no_length").show()
+------------------+-----+
|incident_no_length|count|
+------------------+-----+
|                 6|    1|
|                 7|    6|
|                 8|   57|
|                 9|   65|
+------------------+-----+

We want the incident_number column to be a string with a fixed width of \(9\), with zeros at the start if the length is shorter than this.

lpad() takes three arguments. The first argument, col is the column name, the second, len is the fixed width of the string, and the third, pad, the value to pad it with if it is too short, often "0". The data type returned from lpad() will always be a string.

rescue = rescue.withColumn("padded_incident_no", F.lpad(F.col("incident_number"), 9, "0"))
rescue.orderBy("incident_no_length").show(5)
rescue.orderBy("incident_no_length", ascending=False).show(5)
+---------------+--------------+------------------+------------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|
+---------------+--------------+------------------+------------------+
|         955141|        Police|                 6|         000955141|
|        7003121|        Police|                 7|         007003121|
|        6311101|        Police|                 7|         006311101|
|        5930131|        Police|                 7|         005930131|
|        3223101|        Police|                 7|         003223101|
+---------------+--------------+------------------+------------------+
only showing top 5 rows

+---------------+--------------+------------------+------------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|
+---------------+--------------+------------------+------------------+
|      205017101|        Police|                 9|         205017101|
|      207037111|        Police|                 9|         207037111|
|      135844101|        Police|                 9|         135844101|
|      216289101|        Police|                 9|         216289101|
|      145879151|        Police|                 9|         145879151|
+---------------+--------------+------------------+------------------+
only showing top 5 rows

Be careful; if you set the fixed width to be too short you can lose data:

rescue = rescue.withColumn("too_short_inc_no", F.lpad(F.col("incident_number"), 3, "0"))
rescue.orderBy("incident_no_length", ascending=False).show(5)
+---------------+--------------+------------------+------------------+----------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|too_short_inc_no|
+---------------+--------------+------------------+------------------+----------------+
|      114153091|        Police|                 9|         114153091|             114|
|      138096091|        Police|                 9|         138096091|             138|
|      110211101|        Police|                 9|         110211101|             110|
|      101172091|        Police|                 9|         101172091|             101|
|      102278091|        Police|                 9|         102278091|             102|
+---------------+--------------+------------------+------------------+----------------+
only showing top 5 rows

You can have values other than zero for the last argument and they do not have to be width 1, although there are fewer use cases for this. For example:

rescue = rescue.withColumn("silly_example", F.lpad(F.col("incident_number"), 14, "xyz"))
rescue.orderBy("incident_no_length").show(5)
+---------------+--------------+------------------+------------------+----------------+--------------+
|incident_number|origin_of_call|incident_no_length|padded_incident_no|too_short_inc_no| silly_example|
+---------------+--------------+------------------+------------------+----------------+--------------+
|         955141|        Police|                 6|         000955141|             955|xyzxyzxy955141|
|        7003121|        Police|                 7|         007003121|             700|xyzxyzx7003121|
|        6311101|        Police|                 7|         006311101|             631|xyzxyzx6311101|
|        5930131|        Police|                 7|         005930131|             593|xyzxyzx5930131|
|        3223101|        Police|                 7|         003223101|             322|xyzxyzx3223101|
+---------------+--------------+------------------+------------------+----------------+--------------+
only showing top 5 rows

Padding on the right: rpad()#

There is also a similar function, F.rpad()/rpad, that works in the same way with identical arguments, just padding to the right instead:

rescue = rescue.withColumn("right_padded_inc_no", F.rpad(F.col("incident_number"), 9, "0"))
rescue.select("incident_number", "right_padded_inc_no").show(5)
+---------------+-------------------+
|incident_number|right_padded_inc_no|
+---------------+-------------------+
|      146647151|          146647151|
|       66969111|          669691110|
|      103407111|          103407111|
|      137525091|          137525091|
|      158794091|          158794091|
+---------------+-------------------+
only showing top 5 rows

Further Resources#

Spark at the ONS Articles:

  • Using Spark Functions in sparklyr: note that lpad/rpad are inherited directly from Spark SQL and need to be used inside mutate(), rather than imported from the sparklyr package as standalone functions

PySpark Documentation:

Spark SQL Documentation:

Other links: