Date functions

Date functions#

Handling dates is tricky in most programming languages, and Spark is no exception. This article gives examples of a few date functions, including interval, which is not well documented in PySpark. This page was written in December and so we have decided to make our examples festive! Note that this is far from exhaustive; see the links in Further Resources for more information which explain dates in more detail. To demonstrate the date functions we first need to start a Spark session and read in the Animal Rescue data:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.master("local[2]").appName("date-functions").getOrCreate()

rescue = (spark.read.csv("/training/animal_rescue.csv", header=True, inferSchema=True)
          .withColumnRenamed("IncidentNumber", "incident_number")
          .withColumnRenamed("DateTimeOfCall", "date_time_of_call")
          .select("incident_number", "date_time_of_call"))

rescue.limit(5).toPandas()

library(sparklyr)
library(dplyr)

default_config <- sparklyr::spark_config()

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "date-functions",
    config = default_config)

rescue <- sparklyr::spark_read_csv(
    sc,
    path="/training/animal_rescue.csv", 
    header=TRUE,
    infer_schema=TRUE) %>%
    rename(
        incident_number = IncidentNumber,
        date_time_of_call = DateTimeOfCall,
       )  %>%
    select(incident_number, date_time_of_call)

rescue

  incident_number date_time_of_call
        139091  01/01/2009 03:01
        275091  01/01/2009 08:51
       2075091  04/01/2009 10:07
       2872091  05/01/2009 12:27
       3553091  06/01/2009 15:23

# Source: spark<?> [?? x 2]
   incident_number date_time_of_call
   <chr>           <chr>            
139091          01/01/2009 03:01 
275091          01/01/2009 08:51 
2075091         04/01/2009 10:07 
2872091         05/01/2009 12:27 
3553091         06/01/2009 15:23 
3742091         06/01/2009 19:30 
4011091         07/01/2009 06:29 
4211091         07/01/2009 11:55 
4306091         07/01/2009 13:48 
4715091         07/01/2009 21:24 
# ℹ more rows

The date_time_of_call column contains dates and times, but is actually a string. From the R output we can check that under the column name the data type is char, which stands for character and is used for storing string or character values in a variable. To get the same insights in PySpark we need to print the schema:

rescue.printSchema()

root
 |-- incident_number: string (nullable = true)
 |-- date_time_of_call: string (nullable = true)

Convert string to date with `to_date()`#

Before using any date functions we want to change the date_time_of_call column to a date type, omitting the time, using F.to_date().

This has two arguments, the name of the column and the date format, dd/MM/yyyy in this example:

rescue = rescue.withColumn("date_of_call", F.to_date("date_time_of_call", "dd/MM/yyyy")).drop("date_time_of_call")
rescue.limit(5).toPandas()

rescue <- rescue %>% mutate(date_of_call = to_date(date_time_of_call, "dd/MM/yyyy"))  %>% 
      select(-c(date_time_of_call))

print(rescue, n = 5)

  incident_number date_of_call
        139091   2009-01-01
        275091   2009-01-01
       2075091   2009-01-04
       2872091   2009-01-05
       3553091   2009-01-06

# Source: spark<?> [?? x 2]
  incident_number date_of_call
  <chr>           <date>      
1 139091          2009-01-01  
2 275091          2009-01-01  
3 2075091         2009-01-04  
4 2872091         2009-01-05  
5 3553091         2009-01-06  
# ℹ more rows

From the R output we can confirm that indeed date_of_call column is a date data type. For PySpark we must print the schema to confirm that.

rescue.printSchema()

root
 |-- incident_number: string (nullable = true)
 |-- date_of_call: date (nullable = true)

Format dates with `F.date_format()`#

rescue = rescue.withColumn("call_month_day", F.date_format("date_of_call", "MMM-dd"))

rescue.limit(5).toPandas()

rescue <- rescue %>%
  mutate(call_month_day = date_format(date_of_call, "MMM-dd"))

print(rescue, n = 5)

  incident_number date_of_call call_month_day
        139091   2009-01-01         Jan-01
        275091   2009-01-01         Jan-01
       2075091   2009-01-04         Jan-04
       2872091   2009-01-05         Jan-05
       3553091   2009-01-06         Jan-06

# Source: spark<?> [?? x 3]
  incident_number date_of_call call_month_day
  <chr>           <date>       <chr>         
1 139091          2009-01-01   Jan-01        
2 275091          2009-01-01   Jan-01        
3 2075091         2009-01-04   Jan-04        
4 2872091         2009-01-05   Jan-05        
5 3553091         2009-01-06   Jan-06        
# ℹ more rows

If only one component of the date is required then there are specific functions: F.year(), F.month() and F.dayofmonth(); in theory these should be more efficient than F.date_format(), although less elegant to write if being combined. For sparklyr we can achieve the same result by applying year(), month(), dayofmonth() with mutate().

In this example we are aiming to analyse the incidents during festive days. There are many ways to approach this problem, ideally we want to leverage the spark narrow transformations if possible, since they are much more efficient compared to wide transformations. For PySpark, we firstly create a dictionary with the festive days, and then we filter the dataframe based on the keys. This approach is convenient since we can include the dictionary with the replacements within replace function.

In sparklyr we approach the same problem a little differently, we create a new column with case_when and then we filter the matched cases.

festive_dates = {
    "Dec-24": "Christmas Eve",
    "Dec-25": "Christmas Day",
    "Dec-26": "Boxing Day",
    "Jan-01": "New Years Day",  
}
 
festive_rescues = rescue.filter(F.col('call_month_day').isin(list(festive_dates.keys())))

festive_rescues.limit(5).toPandas()

festive_rescues <- rescue %>%
  mutate(festive_day = case_when(
      call_month_day == "Dec-24" ~  "Christmas Eve",
      call_month_day == "Dec-25" ~   "Christmas Day",
      call_month_day == "Dec-26" ~  "Boxing Day",
      call_month_day == "Jan-01" ~  "New Years Day",
    .default = NA)) %>%
  filter(!is.na(festive_day))

print(festive_rescues, n = 5 )

  incident_number date_of_call call_month_day adjusted_date adjusted_month_day
        139091   2009-01-01         Jan-01    2008-12-19             Dec-19
        275091   2009-01-01         Jan-01    2008-12-19             Dec-19
     229900091   2009-12-24         Dec-24    2009-12-11             Dec-11
     230352091   2009-12-25         Dec-25    2009-12-12             Dec-12
     230868091   2009-12-26         Dec-26    2009-12-13             Dec-13

# Source: spark<?> [?? x 4]
  incident_number date_of_call call_month_day festive_day  
  <chr>           <date>       <chr>          <chr>        
1 139091          2009-01-01   Jan-01         New Years Day
2 275091          2009-01-01   Jan-01         New Years Day
3 229900091       2009-12-24   Dec-24         Christmas Eve
4 230352091       2009-12-25   Dec-25         Christmas Day
5 230868091       2009-12-26   Dec-26         Boxing Day   
# ℹ more rows

We can now get the number of animal rescue incidents on each festive day:

festive_rescues.groupBy("date_of_call", "call_month_day").count().orderBy("call_month_day","date_of_call").replace(festive_dates).withColumnRenamed("call_month_day","festive_day").limit(10).toPandas()

festive_rescues %>% 
  group_by(date_of_call,festive_day) %>%
  summarize(count = n()) %>% 
  arrange(date_of_call) %>%
  print(n = 10)

  date_of_call    festive_day  count
 2009-12-24  Christmas Eve      1
 2010-12-24  Christmas Eve      3
 2011-12-24  Christmas Eve      2
 2012-12-24  Christmas Eve      1
 2013-12-24  Christmas Eve      1
 2014-12-24  Christmas Eve      3
 2016-12-24  Christmas Eve      1
 2018-12-24  Christmas Eve      2
 2009-12-25  Christmas Day      1
 2010-12-25  Christmas Day      1

# Source:     spark<?> [?? x 3]
# Groups:     date_of_call
# Ordered by: date_of_call
   date_of_call festive_day   count
   <date>       <chr>         <dbl>
 1 2009-01-01   New Years Day     2
 2 2009-12-24   Christmas Eve     1
 3 2009-12-25   Christmas Day     1
 4 2009-12-26   Boxing Day        2
 5 2010-01-01   New Years Day     2
 6 2010-12-24   Christmas Eve     3
 7 2010-12-25   Christmas Day     1
 8 2010-12-26   Boxing Day        3
 9 2011-01-01   New Years Day     2
10 2011-12-24   Christmas Eve     2
# ℹ more rows

Add and subtract dates with `interval`#

The interval function adds or subtracts days, months and years to a date. Despite being a very useful function it is unfortunately poorly documented within PySpark. interval is not contained in the PySpark functions module, and needs to be wrapped in F.expr(). The most common format is:

F.expr("column_name + interval y years m months d days")

We can apply the same expression in sparklyr by leveraging the sql() function.

The PySpark functions module contains F.date_add(), F.date_sub() and F.add_months(). These functions are not the easiest to remember as they are inconsistently named, and there is no function for adding years; you can create your own by multiplying F.add_months() by 12. interval is easier to use and a better choice in most circumstances.

For instance, if we assume that a report has to be produced for each animal rescue incident within a year and three months, then we can use interval 1 year 3 months. The singular year is used here rather than years; both work correctly regardless of the integer value used.

festive_rescues.withColumn("report_date", F.expr("date_of_call + interval 1 year 3 months")).limit(5).toPandas()

rescue %>%
  mutate(report_date = sql("date_of_call + interval 1 year 3 months")) %>%
  print(n = 5)

  incident_number date_of_call call_month_day adjusted_date  \
        139091   2009-01-01         Jan-01    2008-12-19   
        275091   2009-01-01         Jan-01    2008-12-19   
     229900091   2009-12-24         Dec-24    2009-12-11   
     230352091   2009-12-25         Dec-25    2009-12-12   
     230868091   2009-12-26         Dec-26    2009-12-13   

  adjusted_month_day report_date  
           Dec-19  2010-04-01  
           Dec-19  2010-04-01  
           Dec-11  2011-03-24  
           Dec-12  2011-03-25  
           Dec-13  2011-03-26  

# Source: spark<?> [?? x 4]
  incident_number date_of_call call_month_day report_date
  <chr>           <date>       <chr>          <date>     
1 139091          2009-01-01   Jan-01         2010-04-01 
2 275091          2009-01-01   Jan-01         2010-04-01 
3 2075091         2009-01-04   Jan-04         2010-04-04 
4 2872091         2009-01-05   Jan-05         2010-04-05 
5 3553091         2009-01-06   Jan-06         2010-04-06 
# ℹ more rows

You can also use negative numbers with interval. As a festive example, in some Orthodox churches in central and eastern Europe Christmas is celebrated 13 days later than in Britain, with Christmas Day on the 7th of January. To calculate how many animals were rescued on the Orthodox festive days without redefining the festive_days DataFrame we can use - interval with 13 days, then join festive days. We do not need to specify months or years as we are only using days. Boxing Day is also being filtered out as it is not widely celebrated outside of the Commonwealth.

rescue = rescue.withColumn("adjusted_date", F.expr("date_of_call - interval 13 days"))
rescue.limit(5).toPandas()

rescue <- rescue %>%
  mutate(adjusted_date = sql("date_of_call - interval 13 days"))
  
print(rescue, n = 5 )

  incident_number date_of_call call_month_day adjusted_date adjusted_month_day
        139091   2009-01-01         Jan-01    2008-12-19             Dec-19
        275091   2009-01-01         Jan-01    2008-12-19             Dec-19
       2075091   2009-01-04         Jan-04    2008-12-22             Dec-22
       2872091   2009-01-05         Jan-05    2008-12-23             Dec-23
       3553091   2009-01-06         Jan-06    2008-12-24             Dec-24

# Source: spark<?> [?? x 4]
  incident_number date_of_call call_month_day adjusted_date
  <chr>           <date>       <chr>          <date>       
1 139091          2009-01-01   Jan-01         2008-12-19   
2 275091          2009-01-01   Jan-01         2008-12-19   
3 2075091         2009-01-04   Jan-04         2008-12-22   
4 2872091         2009-01-05   Jan-05         2008-12-23   
5 3553091         2009-01-06   Jan-06         2008-12-24   
# ℹ more rows

orthodox_festive_days = {
    "Dec-24": "Christmas Eve",
    "Dec-25": "Christmas Day",
    "Jan-01": "New Years Day"
}

orthodox_festive_rescues = (rescue
                           . withColumn("adjusted_month_day", F.date_format("adjusted_date", "MMM-dd"))
                            .filter(F.col('adjusted_month_day').isin(list(orthodox_festive_days.keys())))
                            .replace(orthodox_festive_days)
                            .selectExpr("incident_number","date_of_call","adjusted_month_day as orthodox_festive_day"))


orthodox_festive_rescues.limit(5).toPandas()

orthodox_festive_rescues <- rescue %>%
  mutate(adjusted_month_day = date_format(adjusted_date, "MMM-dd")) %>%

  mutate(orthodox_festive_day = case_when(
      adjusted_month_day == "Dec-24" ~  "Christmas Eve",
      adjusted_month_day == "Dec-25" ~   "Christmas Day",
      adjusted_month_day == "Jan-01" ~  "New Years Day",
    .default = NA)) %>%
  filter(!is.na(orthodox_festive_day))
    
print(orthodox_festive_rescues, n = 5)

  incident_number date_of_call orthodox_festive_day
       3553091   2009-01-06        Christmas Eve
       3742091   2009-01-06        Christmas Eve
       4011091   2009-01-07        Christmas Day
       4211091   2009-01-07        Christmas Day
       4306091   2009-01-07        Christmas Day

# Source: spark<?> [?? x 6]
  incident_number date_of_call call_month_day adjusted_date adjusted_month_day
  <chr>           <date>       <chr>          <date>        <chr>             
1 3553091         2009-01-06   Jan-06         2008-12-24    Dec-24            
2 3742091         2009-01-06   Jan-06         2008-12-24    Dec-24            
3 4011091         2009-01-07   Jan-07         2008-12-25    Dec-25            
4 4211091         2009-01-07   Jan-07         2008-12-25    Dec-25            
5 4306091         2009-01-07   Jan-07         2008-12-25    Dec-25            
# ℹ more rows
# ℹ 1 more variable: orthodox_festive_day <chr>

As before, we can get the counts:

(orthodox_festive_rescues
     .groupBy("date_of_call", "orthodox_festive_day")
     .count()
     .orderBy("date_of_call")
     .limit(10)
     .toPandas())

orthodox_festive_rescues %>% 
  group_by(date_of_call,orthodox_festive_day) %>%
  summarize(count = n()) %>% 
  arrange(date_of_call) %>%
  print(n = 10)

  date_of_call orthodox_festive_day  count
 2009-01-06        Christmas Eve      2
 2009-01-07        Christmas Day      4
 2009-01-14        New Years Day      1
 2010-01-06        Christmas Eve      2
 2010-01-07        Christmas Day      2
 2010-01-14        New Years Day      2
 2011-01-06        Christmas Eve      1
 2012-01-06        Christmas Eve      1
 2012-01-07        Christmas Day      2
 2012-01-14        New Years Day      1

# Source:     spark<?> [?? x 3]
# Groups:     date_of_call
# Ordered by: date_of_call
   date_of_call orthodox_festive_day count
   <date>       <chr>                <dbl>
 1 2009-01-06   Christmas Eve            2
 2 2009-01-07   Christmas Day            4
 3 2009-01-14   New Years Day            1
 4 2010-01-06   Christmas Eve            2
 5 2010-01-07   Christmas Day            2
 6 2010-01-14   New Years Day            2
 7 2011-01-06   Christmas Eve            1
 8 2012-01-06   Christmas Eve            1
 9 2012-01-07   Christmas Day            2
10 2012-01-14   New Years Day            1
# ℹ more rows

`interval` with times and weeks#

Weeks can be used; note that there is no native PySpark function for adding weeks. Time periods, e.g. hours, can also be used with interval, which may be useful when working with timestamps. As with any function, it is recommended to use unit tests when working with dates. The Pytest for PySpark repository gives some examples of how to unit test when working with PySpark.

Documentation:#

PySpark#

sparklyr#

Acknowledgments#

Special thanks to Diogo Marques for sharing his knowledge of the interval function, and to Vicky Pickering for inspiring this festive tip!

Date functions

Contents

Date functions#

Convert string to date with `to_date()`#

Format dates with `F.date_format()`#

Add and subtract dates with `interval`#

`interval` with times and weeks#

Documentation:#

PySpark#

sparklyr#

Other links:#

Acknowledgments#

Date functions

Contents

Date functions#

Convert string to date with to_date()#

Format dates with F.date_format()#

Add and subtract dates with interval#

interval with times and weeks#

Documentation:#

PySpark#

sparklyr#

Other links:#

Acknowledgments#

Convert string to date with `to_date()`#

Format dates with `F.date_format()`#

Add and subtract dates with `interval`#

`interval` with times and weeks#