Naming Conflicts in Module Imports#
Importing modules in Python and R can lead to naming conflicts if a function with that name already exists. This article demonstrates why you should be careful when importing modules to ensure that these conflicts do not occur.
A common example in Python is using from pyspark.sql.functions import *
, which will overwrite some built-in Python functions (e.g. sum()
). Instead, it is good practice to use from pyspark.sql import functions as F
, where you prefix the functions with F
, e.g. F.sum()
.
Naming variables#
When writing code, it is important to give your variables sensible names, that are informative but not too long. A good reference on this is the Clean Code section from QA of Code for Analysis and Research. You should avoid using the names of existing built in functions for user-defined variables.
Keywords#
Some words are reserved: for instance, in Python you cannot have a variable called def
, False
or lambda
. These are referred to as keywords and the code will not even compile if you try, raising a SyntaxError
. You can generate a list of these with keyword.kwlist
.
In R, use ?reserved
to get a list of the reserved words.
import keyword
print(keyword.kwlist)
?reserved
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
Reserved package:base R Documentation
_R_e_s_e_r_v_e_d _W_o_r_d_s _i_n _R
_D_e_s_c_r_i_p_t_i_o_n:
The reserved words in R's parser are
‘if’ ‘else’ ‘repeat’ ‘while’ ‘function’ ‘for’ ‘in’ ‘next’ ‘break’
‘TRUE’ ‘FALSE’ ‘NULL’ ‘Inf’ ‘NaN’ ‘NA’ ‘NA_integer_’ ‘NA_real_’
‘NA_complex_’ ‘NA_character_’
‘...’ and ‘..1’, ‘..2’ etc, which are used to refer to arguments
passed down from a calling function, see ‘...’.
_D_e_t_a_i_l_s:
Reserved words outside quotes are always parsed to be references
to the objects linked to in the ‘Description’, and hence they are
not allowed as syntactic names (see ‘make.names’). They *are*
allowed as non-syntactic names, e.g. inside backtick quotes.
Built in functions and module imports in Python#
Python Example
You might notice that the Python keyword list is quite short and that some common Python functionality is not listed, for instance, sum()
or round()
. This means that it is possible to overwrite these; obviously this is not good practice and should be avoided.
This can be surprisingly easy to do in PySpark, and can be hard to debug if you do not know the reason for the error.
Python Example#
First, look at the documentation for sum
:
help(sum)
Help on built-in function sum in module builtins:
sum(iterable, start=0, /)
Return the sum of a 'start' value (default: 0) plus an iterable of numbers
When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may
reject non-numeric types.
Show that sum
works with a simple example: adding three integers together:
sum([1, 2, 3])
6
Now import the modules we need to use Spark. The recommended way to do this is import pyspark.sql.functions as F
, which means that whenever you want to access a function from this module you prefix it with F
, e.g. F.sum()
. Sometimes the best way to see why something is recommended is to try a different method and show it is a bad idea, in this case, importing all the functions
as *
:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
Attempting to sum the integers will now give an error:
try:
sum([1, 2, 3])
except AttributeError as e:
print(e)
'NoneType' object has no attribute '_jvm'
To see why this error exists, take another look at help(sum)
; we can see that the documentation is different to previously.
help(sum)
Help on function sum in module pyspark.sql.functions:
sum(col)
Aggregate function: returns the sum of all values in the expression.
.. versionadded:: 1.3
So by importing all the PySpark functions we have overwritten some key Python functionality. Note that this would also apply if you imported individual functions, e.g. from pyspark.sql.functions import sum
.
You can also overwrite functions with your own variables, often unintentionally. As an example, first Start a Spark session:
spark = (SparkSession.builder.master("local[2]")
.appName("module-imports")
.getOrCreate())
Create a small DataFrame:
sdf = spark.range(5).withColumn("double_id", col("id") * 2)
sdf.show()
+---+---------+
| id|double_id|
+---+---------+
| 0| 0|
| 1| 2|
| 2| 4|
| 3| 6|
| 4| 8|
+---+---------+
Loop through the columns, using col
as the control variable. This will work, but is not a good idea as it is overwriting col()
from functions
:
for col in sdf.columns:
sdf.select(col).show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
+---------+
|double_id|
+---------+
| 0|
| 2|
| 4|
| 6|
| 8|
+---------+
If we try adding another column with col()
then it will not work as we have now reassigned col
to be double_id
:
try:
sdf = sdf.withColumn("triple_id", col("id") * 3)
except TypeError as e:
print(e)
'str' object is not callable
col
'double_id'
Importing the PySpark functions
as F
and using F.col()
solves this problem:
from pyspark.sql import functions as F
sdf = sdf.withColumn("triple_id", F.col("id") * 3)
sdf.show()
+---+---------+---------+
| id|double_id|triple_id|
+---+---------+---------+
| 0| 0| 0|
| 1| 2| 3|
| 2| 4| 6|
| 3| 6| 9|
| 4| 8| 12|
+---+---------+---------+
Built in functions and package imports in R#
R Example
It is advised to use ::
to directly call a function from a package. For instance, there is a filter
function in both stats
and dplyr
; you can specify exactly which to use with dplyr::filter()
or stats::filter()
.
Note that despite being commonly used for an R DataFrame, df
is actually a built-in function for the F distribution. As such, it is not recommended to use df
for DataFrames.
?df
FDist package:stats R Documentation
_T_h_e _F _D_i_s_t_r_i_b_u_t_i_o_n
_D_e_s_c_r_i_p_t_i_o_n:
Density, distribution function, quantile function and random
generation for the F distribution with ‘df1’ and ‘df2’ degrees of
freedom (and optional non-centrality parameter ‘ncp’).
_U_s_a_g_e:
df(x, df1, df2, ncp, log = FALSE)
pf(q, df1, df2, ncp, lower.tail = TRUE, log.p = FALSE)
qf(p, df1, df2, ncp, lower.tail = TRUE, log.p = FALSE)
rf(n, df1, df2, ncp)
_A_r_g_u_m_e_n_t_s:
x, q: vector of quantiles.
p: vector of probabilities.
n: number of observations. If ‘length(n) > 1’, the length is
taken to be the number required.
df1, df2: degrees of freedom. ‘Inf’ is allowed.
ncp: non-centrality parameter. If omitted the central F is
assumed.
log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are P[X <= x],
otherwise, P[X > x].
_D_e_t_a_i_l_s:
The F distribution with ‘df1 =’ n1 and ‘df2 =’ n2 degrees of
freedom has density
f(x) = Gamma((n1 + n2)/2) / (Gamma(n1/2) Gamma(n2/2))
(n1/n2)^(n1/2) x^(n1/2 - 1)
(1 + (n1/n2) x)^-(n1 + n2)/2
for x > 0.
It is the distribution of the ratio of the mean squares of n1 and
n2 independent standard normals, and hence of the ratio of two
independent chi-squared variates each divided by its degrees of
freedom. Since the ratio of a normal and the root mean-square of
m independent normals has a Student's t_m distribution, the square
of a t_m variate has a F distribution on 1 and m degrees of
freedom.
The non-central F distribution is again the ratio of mean squares
of independent normals of unit variance, but those in the
numerator are allowed to have non-zero means and ‘ncp’ is the sum
of squares of the means. See Chisquare for further details on
non-central distributions.
_V_a_l_u_e:
‘df’ gives the density, ‘pf’ gives the distribution function ‘qf’
gives the quantile function, and ‘rf’ generates random deviates.
Invalid arguments will result in return value ‘NaN’, with a
warning.
The length of the result is determined by ‘n’ for ‘rf’, and is the
maximum of the lengths of the numerical arguments for the other
functions.
The numerical arguments other than ‘n’ are recycled to the length
of the result. Only the first elements of the logical arguments
are used.
_N_o_t_e:
Supplying ‘ncp = 0’ uses the algorithm for the non-central
distribution, which is not the same algorithm used if ‘ncp’ is
omitted. This is to give consistent behaviour in extreme cases
with values of ‘ncp’ very near zero.
The code for non-zero ‘ncp’ is principally intended to be used for
moderate values of ‘ncp’: it will not be highly accurate,
especially in the tails, for large values.
_S_o_u_r_c_e:
For the central case of ‘df’, computed _via_ a binomial
probability, code contributed by Catherine Loader (see ‘dbinom’);
for the non-central case computed _via_ ‘dbeta’, code contributed
by Peter Ruckdeschel.
For ‘pf’, _via_ ‘pbeta’ (or for large ‘df2’, _via_ ‘pchisq’).
For ‘qf’, _via_ ‘qchisq’ for large ‘df2’, else _via_ ‘qbeta’.
_R_e_f_e_r_e_n_c_e_s:
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
Language_. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) _Continuous
Univariate Distributions_, volume 2, chapters 27 and 30. Wiley,
New York.
_S_e_e _A_l_s_o:
Distributions for other standard distributions, including ‘dchisq’
for chi-squared and ‘dt’ for Student's t distributions.
_E_x_a_m_p_l_e_s:
## Equivalence of pt(.,nu) with pf(.^2, 1,nu):
x <- seq(0.001, 5, len = 100)
nu <- 4
stopifnot(all.equal(2*pt(x,nu) - 1, pf(x^2, 1,nu)),
## upper tails:
all.equal(2*pt(x, nu, lower=FALSE),
pf(x^2, 1,nu, lower=FALSE)))
## the density of the square of a t_m is 2*dt(x, m)/(2*x)
# check this is the same as the density of F_{1,m}
all.equal(df(x^2, 1, 5), dt(x, 5)/x)
## Identity: qf(2*p - 1, 1, df) == qt(p, df)^2 for p >= 1/2
p <- seq(1/2, .99, length = 50); df <- 10
rel.err <- function(x, y) ifelse(x == y, 0, abs(x-y)/mean(abs(c(x,y))))
quantile(rel.err(qf(2*p - 1, df1 = 1, df2 = df), qt(p, df)^2), .90) # ~= 7e-9
Further Resources#
PySpark Documentation:
sparklyr and tidyverse Documentation:
Python Documentation:
R Documentation: