Arrays Functions in PySpark#
PySpark DataFrames can contain array columns. You can think of a PySpark array column in a similar way to a Python list. Arrays can be useful if you have data of a variable length. They can be tricky to handle, so you may want to create new rows for each element in the array, or change them to a string.
There are many functions for handling arrays. Here we will just demonstrate a few of them. For a full list, take a look at the PySpark documentation.
Note that since Spark 3.0, arrays are supported in sparklyr, although they are not covered in this article. The relevant sparklyr functions begin hof_
(higher order function), e.g. hof_transform()
Creating a DataFrame with arrays#
You will encounter arrays most frequently when reading in data. Spark can read parquet files that contain array columns. For this example, we will create a small DataFrame manually with an array column. To do this, simply create the DataFrame in the usual way, but supply a Python list for the column values to spark.createDataFrame()
. The lists do not have to have the same number of elements.
If using a schema to create the DataFrame, import ArrayType()
or use array<type>
if using DDL notation, which is array<string>
in this example.
The example used here will use champions of the Wimbledon tennis tournament. The score for a tennis match is often listed by individual sets, which can be displayed as an array. This array will be of variable length, as the match stops once someone wins two sets in women’s matches and three in men’s, meaning the array can vary between two and five elements. As the tournament was cancelled in 2020, these entries will be null
.
from pyspark.sql import SparkSession, functions as F
spark = (SparkSession.builder.master("local[2]")
.appName("arrays")
.getOrCreate())
champions = spark.createDataFrame([
[2017, "Gentlemen's Singles", "Federer", "Čilić", ["6-3", "6-1", "6-4"]],
[2018, "Gentlemen's Singles", "Djokovic", "Anderson", ["6-2", "6-2", "7-6"]],
[2019, "Gentlemen's Singles", "Djokovic", "Federer", ["7-6", "1-6", "7-6", "4-6", "13-12"]],
[2020, "Gentlemen's Singles", None, None, None], #No tournament in 2020
[2021, "Gentlemen's Singles", "Djokovic", "Berrettini", ["6-7", "6-4", "6-4", "6-3"]],
[2017, "Ladies' Singles", "Muguruza", "V Williams", ["7-5", "6-0"]],
[2018, "Ladies' Singles", "Kerber", "S Williams", ["6-3", "6-3"]],
[2019, "Ladies' Singles", "Halep", "S Williams", ["6-2", "6-2"]],
[2020, "Ladies' Singles", None, None, None], #No tournament in 2020]
[2021, "Ladies' Singles", "Barty", "Plíšková", ["6-3", "6-7", "6-3"]],
], ["tournament_year", "event", "champion", "runner_up", "final_score"])
champions.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+
|tournament_year|event |champion|runner_up |final_score |
+---------------+-------------------+--------+----------+---------------------------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|
|2020 |Gentlemen's Singles|null |null |null |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |
|2019 |Ladies' Singles |Halep |S Williams|[6-2, 6-2] |
|2020 |Ladies' Singles |null |null |null |
|2021 |Ladies' Singles |Barty |Plíšková |[6-3, 6-7, 6-3] |
+---------------+-------------------+--------+----------+---------------------------+
Use .dtypes
to confirm that final_score
is a string array:
champions.dtypes
[('tournament_year', 'bigint'),
('event', 'string'),
('champion', 'string'),
('runner_up', 'string'),
('final_score', 'array<string>')]
Get length of array: F.size()
#
F.size()
returns the number of elements in the array. This returns -1
for null
values. In this example, we can see how many sets were played in each match:
champions_size = champions.withColumn("total_sets", F.size("final_score"))
champions_size.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+----------+
|tournament_year|event |champion|runner_up |final_score |total_sets|
+---------------+-------------------+--------+----------+---------------------------+----------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |3 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |3 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|5 |
|2020 |Gentlemen's Singles|null |null |null |-1 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |4 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |2 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |2 |
|2019 |Ladies' Singles |Halep |S Williams|[6-2, 6-2] |2 |
|2020 |Ladies' Singles |null |null |null |-1 |
|2021 |Ladies' Singles |Barty |Plíšková |[6-3, 6-7, 6-3] |3 |
+---------------+-------------------+--------+----------+---------------------------+----------+
Convert array to string: F.concat_ws()
#
You may be familiar with F.concat_ws()
to concatenate string columns. It can also be used with arrays. null
values will be mapped to an empty string.
The first argument is the separator, followed by the columns to concatenate.
champions_concat = (champions
# Traditional use of concat_ws: concatenate strings
.withColumn("match", F.concat_ws(" v ", "champion", "runner_up"))
# Can also use with arrays
.withColumn("final_score_string", F.concat_ws(", ", "final_score"))
.select("tournament_year", "event", "match", "final_score", "final_score_string"))
champions_concat.show(truncate=False)
+---------------+-------------------+---------------------+---------------------------+-------------------------+
|tournament_year|event |match |final_score |final_score_string |
+---------------+-------------------+---------------------+---------------------------+-------------------------+
|2017 |Gentlemen's Singles|Federer v Čilić |[6-3, 6-1, 6-4] |6-3, 6-1, 6-4 |
|2018 |Gentlemen's Singles|Djokovic v Anderson |[6-2, 6-2, 7-6] |6-2, 6-2, 7-6 |
|2019 |Gentlemen's Singles|Djokovic v Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|7-6, 1-6, 7-6, 4-6, 13-12|
|2020 |Gentlemen's Singles| |null | |
|2021 |Gentlemen's Singles|Djokovic v Berrettini|[6-7, 6-4, 6-4, 6-3] |6-7, 6-4, 6-4, 6-3 |
|2017 |Ladies' Singles |Muguruza v V Williams|[7-5, 6-0] |7-5, 6-0 |
|2018 |Ladies' Singles |Kerber v S Williams |[6-3, 6-3] |6-3, 6-3 |
|2019 |Ladies' Singles |Halep v S Williams |[6-2, 6-2] |6-2, 6-2 |
|2020 |Ladies' Singles | |null | |
|2021 |Ladies' Singles |Barty v Plíšková |[6-3, 6-7, 6-3] |6-3, 6-7, 6-3 |
+---------------+-------------------+---------------------+---------------------------+-------------------------+
Convert string to array: F.split()
#
The opposite of F.concat_ws()
is F.split()
. This example uses the separator ", "
(as above), but it can also handle regexes.
Note that the empty strings (that were originally null
values in the array column) have been mapped to empty arrays.
champions_array = (champions_concat
.withColumn("final_score_array",
F.split("final_score_string", ", ")))
champions_array.drop("final_score").show(truncate=False)
+---------------+-------------------+---------------------+-------------------------+---------------------------+
|tournament_year|event |match |final_score_string |final_score_array |
+---------------+-------------------+---------------------+-------------------------+---------------------------+
|2017 |Gentlemen's Singles|Federer v Čilić |6-3, 6-1, 6-4 |[6-3, 6-1, 6-4] |
|2018 |Gentlemen's Singles|Djokovic v Anderson |6-2, 6-2, 7-6 |[6-2, 6-2, 7-6] |
|2019 |Gentlemen's Singles|Djokovic v Federer |7-6, 1-6, 7-6, 4-6, 13-12|[7-6, 1-6, 7-6, 4-6, 13-12]|
|2020 |Gentlemen's Singles| | |[] |
|2021 |Gentlemen's Singles|Djokovic v Berrettini|6-7, 6-4, 6-4, 6-3 |[6-7, 6-4, 6-4, 6-3] |
|2017 |Ladies' Singles |Muguruza v V Williams|7-5, 6-0 |[7-5, 6-0] |
|2018 |Ladies' Singles |Kerber v S Williams |6-3, 6-3 |[6-3, 6-3] |
|2019 |Ladies' Singles |Halep v S Williams |6-2, 6-2 |[6-2, 6-2] |
|2020 |Ladies' Singles | | |[] |
|2021 |Ladies' Singles |Barty v Plíšková |6-3, 6-7, 6-3 |[6-3, 6-7, 6-3] |
+---------------+-------------------+---------------------+-------------------------+---------------------------+
Create a new row for each array element: F.explode()
and F.explode_outer()
#
F.explode()
is used to create a new row for each element in the array. It can be much easier to handle the data from the array in this format.
In this example, F.explode()
extracts the results of the individual sets, so the new column is called set_score
.
champions_explode = champions.withColumn("set_score",
F.explode("final_score"))
champions_explode.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+---------+
|tournament_year|event |champion|runner_up |final_score |set_score|
+---------------+-------------------+--------+----------+---------------------------+---------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-3 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-1 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-4 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|1-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|4-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|13-12 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-7 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-3 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |7-5 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |6-0 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |6-3 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |6-3 |
|2019 |Ladies' Singles |Halep |S Williams|[6-2, 6-2] |6-2 |
+---------------+-------------------+--------+----------+---------------------------+---------+
only showing top 20 rows
F.explode_outer()
works in the same way, but retains null
values:
champions_explode_outer = champions.withColumn("set_score",
F.explode_outer("final_score"))
champions_explode_outer.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+---------+
|tournament_year|event |champion|runner_up |final_score |set_score|
+---------------+-------------------+--------+----------+---------------------------+---------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-3 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-1 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |6-4 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|1-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|4-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|13-12 |
|2020 |Gentlemen's Singles|null |null |null |null |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-7 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |6-3 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |7-5 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |6-0 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |6-3 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |6-3 |
+---------------+-------------------+--------+----------+---------------------------+---------+
only showing top 20 rows
Explode with position: F.posexplode()
and F.posexplode_outer()
#
F.posexplode()
returns two columns, the position of the element within the array as well as the element itself. In common with Python lists, this starts at zero.
Note that you have to use .select()
with .alias()
rather than .withColumn()
, due to the function returning two columns. .select("*")
selects all the existing columns, similar to SELECT *
in SQL.
champions_posexplode = (champions
.select("*",
F.posexplode("final_score").alias("set_no", "set_score"))
# Add 1 to set number, as index starts from 0
.withColumn("set_no", F.col("set_no") + 1))
champions_posexplode.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|tournament_year|event |champion|runner_up |final_score |set_no|set_score|
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |1 |6-3 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |2 |6-1 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |3 |6-4 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |1 |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |2 |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |3 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|1 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|2 |1-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|3 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|4 |4-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|5 |13-12 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |1 |6-7 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |2 |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |3 |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |4 |6-3 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |1 |7-5 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |2 |6-0 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |1 |6-3 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |2 |6-3 |
|2019 |Ladies' Singles |Halep |S Williams|[6-2, 6-2] |1 |6-2 |
+---------------+-------------------+--------+----------+---------------------------+------+---------+
only showing top 20 rows
Combine with .filter()
to extract just the result of the first set:
champions_posexplode.filter(F.col("set_no") == 1).show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|tournament_year|event |champion|runner_up |final_score |set_no|set_score|
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |1 |6-3 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |1 |6-2 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|1 |7-6 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |1 |6-7 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |1 |7-5 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |1 |6-3 |
|2019 |Ladies' Singles |Halep |S Williams|[6-2, 6-2] |1 |6-2 |
|2021 |Ladies' Singles |Barty |Plíšková |[6-3, 6-7, 6-3] |1 |6-3 |
+---------------+-------------------+--------+----------+---------------------------+------+---------+
Use F.posexplode_outer()
to retain the null
values:
champions_posexplode_outer = (champions
.select("*",
F.posexplode_outer("final_score").alias("set_no", "set_score"))
# Add 1 to set number, as index starts from 0
.withColumn("set_no", F.col("set_no") + 1))
champions_posexplode_outer.show(truncate=False)
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|tournament_year|event |champion|runner_up |final_score |set_no|set_score|
+---------------+-------------------+--------+----------+---------------------------+------+---------+
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |1 |6-3 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |2 |6-1 |
|2017 |Gentlemen's Singles|Federer |Čilić |[6-3, 6-1, 6-4] |3 |6-4 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |1 |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |2 |6-2 |
|2018 |Gentlemen's Singles|Djokovic|Anderson |[6-2, 6-2, 7-6] |3 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|1 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|2 |1-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|3 |7-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|4 |4-6 |
|2019 |Gentlemen's Singles|Djokovic|Federer |[7-6, 1-6, 7-6, 4-6, 13-12]|5 |13-12 |
|2020 |Gentlemen's Singles|null |null |null |null |null |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |1 |6-7 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |2 |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |3 |6-4 |
|2021 |Gentlemen's Singles|Djokovic|Berrettini|[6-7, 6-4, 6-4, 6-3] |4 |6-3 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |1 |7-5 |
|2017 |Ladies' Singles |Muguruza|V Williams|[7-5, 6-0] |2 |6-0 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |1 |6-3 |
|2018 |Ladies' Singles |Kerber |S Williams|[6-3, 6-3] |2 |6-3 |
+---------------+-------------------+--------+----------+---------------------------+------+---------+
only showing top 20 rows
Further Resources#
Spark at the ONS Articles:
PySpark Documentation
Acknowledgements#
Thanks to Johannes Hechler for his material on PySpark arrays.