Synthpop: Synthetic Data in R#
1. Introduction#
Synthetic data consists of artificially generated records that mimic the structure and statistical properties of real-world data, without containing any actual personal or sensitive information. It is widely used for:
Testing and validating data pipelines and statistical models
Protecting privacy and confidentiality in data sharing
Simulating rare or edge-case scenarios
Teaching and training with realistic, non-disclosive datasets
Developing and prototyping systems in non-secure environments
Learning Objectives:
By the end of this section, you will be able to:
Understand the concept and importance of synthetic data
Recognise key use cases for synthetic data in official statistics and research
Learn two main approaches for generating synthetic data in R:
Writing your own code
Using the
synthpop
R package
2. Generating synthetic data in R from scratch#
# Load the R packages
rm(list = ls())
library(dplyr)
library(readr)
library(janitor)
library(magrittr)
2.1 Numerical variables#
In this section, we will generate synthetic numerical variables using different distributions in R. We will use the following distributions:
Uniform Distribution
Normal Distribution
Binomial Distribution
Poisson Distribution
Below are the R codes to generate these variables.
Note that in the interest of reproducible results, we can use the set.seed(n)
function.
set.seed(12345)
1. Uniform Distribution: The uniform distribution generates data where each value within a specified range is equally likely. This is useful for for simulating random allocation of survey respondents to experiemental groups, where each group has an equal chance of selection.
runif()
generates data where each value within the specified range is equally likely. We can visualise this data using the following example:
uniform_dist <- runif(n = 1000,
min = 0,
max = 1)
hist(uniform_dist, main="Uniform Distribution Example", xlab="Value")
summary(uniform_dist)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001137 0.268837 0.521491 0.514048 0.756667 0.996996
2. Normal Distribution: The normal distribution, or Gaussian distribution, is useful for simulating data that clusters around a mean. This is common in many natural phenomena like modelling adult heights or test scores, which tend to cluster around a mean in population health or education statistics.
rnorm()
generates data that clusters around a specified mean, with a given standard deviation. We explore the generated data using the hist()
and summary()
functions.
normal_dist <- rnorm(n = 1000,
mean = 50,
sd = 4)
hist(normal_dist, main="Normal Distribution Example", xlab="Value")
summary(normal_dist)
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.66 47.27 49.92 49.93 52.73 63.32
3. Poisson Distribution: The Poisson distribution is used for count data, where you are counting the number of events in a fixed interval of time or space. Typical examples is simulating the number of visits to a government website per day or modelling the number of emergency service calls.
rpois()
generates count data, useful for modeling the number of events in a fixed interval.
poisson_dist <- rpois(n = 1000,
lambda = 4)
hist(poisson_dist, main="Poisson Distribution Example", xlab="Number of Events")
summary(poisson_dist)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 3.921 5.000 12.000
4. Binomial Distribution: The binomial distribution is useful for simulating the number of successes in a fixed number of trials, each with the same probability of success. This is suitable for estimating the number of individuals in a survey who respond “yes” to a yes/no question, such as employment status or vaccine uptake.
rbinom()
generates data representing the number of successes in a fixed number of trials, each with the same probability of success.
binomial_dist <- rbinom(n = 1000,
size = 20,
prob = 0.2)
hist(binomial_dist, main="Binomial Distribution Example", xlab="Number of Successes")
summary(binomial_dist)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 3.939 5.000 10.000
2.2 Character/factor variables#
In this section, we will generate synthetic character (factor)
variables using random sampling in R. These variables are useful for simulating categorical data, such as demographic information.
1. Random sampling from a vector: Random sampling from a vector allows you to generate categorical data by randomly selecting elements from a specified set. This is useful for creating variables like gender, where each observation is randomly assigned a category.
gender <- sample(x = c("M", "F"),
size = 1000,
replace = TRUE)
table(gender)
prop.table(table(gender))
gender
F M
502 498
gender
F M
0.502 0.498
2. Weighted sampling from a vector: Weighted sampling allows you to generate categorical data with specified probabilities for each category. This is useful for creating variables like marital status, where each category has a different likelihood of being selected.
marriage_status <- sample(x = c("Single", "Married", "Divorced", "Widowed"),
size = 1000,
replace = TRUE,
prob = c(0.35, 0.50, 0.10, 0.05))
table(marriage_status)
prop.table(table(marriage_status))
marriage_status
Divorced Married Single Widowed
105 518 329 48
marriage_status
Divorced Married Single Widowed
0.105 0.518 0.329 0.048
Here is the explanation of the code snippet above:
Setting a Seed (
set.seed
):
set.seed(12345)
: Sets the seed for random number generation. The number12345
can be any integer. Using the same seed will produce the same sequence of random numbers each time the code is run.
Random Sampling (
sample
):
x
: The vector of elements to choose from.size
: The number of observations to generate.replace
: Whether sampling is with replacement (TRUE
) or without replacement (FALSE
).Further reading can be found here: sampling.
Weighted Sampling (
sample
withprob
):
prob
: A vector of probabilities corresponding to the likelihood of each element inx
being selected. The probabilities must sum to1
.Further reading can be found here: sampling.
2.3 Combining data into a synthetic dataset#
Now, we will combine the numerical and categorical data into a single synthetic dataset using the data.table
package:
library(data.table)
synthetic_data1 <- data.table(uniform_dist,
normal_dist,
poisson_dist,
binomial_dist,
gender,
marriage_status)
print(head(synthetic_data1))
summary(synthetic_data1)
str(synthetic_data1)
uniform_dist normal_dist poisson_dist binomial_dist gender marriage_status
<num> <num> <int> <int> <char> <char>
1: 0.7209039 44.31870 3 3 M Single
2: 0.8757732 40.13225 6 4 M Married
3: 0.7609823 51.93886 3 6 M Married
4: 0.8861246 46.24811 5 3 F Married
5: 0.4564810 63.32293 5 3 M Married
6: 0.1663718 49.34822 5 6 F Divorced
uniform_dist normal_dist poisson_dist binomial_dist
Min. :0.001137 Min. :36.66 Min. : 0.000 Min. : 0.000
1st Qu.:0.268837 1st Qu.:47.27 1st Qu.: 2.000 1st Qu.: 3.000
Median :0.521491 Median :49.92 Median : 4.000 Median : 4.000
Mean :0.514048 Mean :49.93 Mean : 3.921 Mean : 3.939
3rd Qu.:0.756667 3rd Qu.:52.73 3rd Qu.: 5.000 3rd Qu.: 5.000
Max. :0.996996 Max. :63.32 Max. :12.000 Max. :10.000
gender marriage_status
Length:1000 Length:1000
Class :character Class :character
Mode :character Mode :character
Classes 'data.table' and 'data.frame': 1000 obs. of 6 variables:
$ uniform_dist : num 0.721 0.876 0.761 0.886 0.456 ...
$ normal_dist : num 44.3 40.1 51.9 46.2 63.3 ...
$ poisson_dist : int 3 6 3 5 5 5 7 2 2 6 ...
$ binomial_dist : int 3 4 6 3 3 6 6 4 7 6 ...
$ gender : chr "M" "M" "M" "F" ...
$ marriage_status: chr "Single" "Married" "Married" "Married" ...
- attr(*, ".internal.selfref")=<externalptr>
Converting Character Variables to Factors
To ensure that the categorical variables are treated appropriately in analyses, we should convert them to factor variables:
# Change gender and marriage_status to factor variables
synthetic_data1$gender <- as.factor(synthetic_data1$gender)
synthetic_data1$marriage_status <- as.factor(synthetic_data1$marriage_status)
# Check the structure of the dataset again
str(synthetic_data1)
Classes 'data.table' and 'data.frame': 1000 obs. of 6 variables:
$ uniform_dist : num 0.721 0.876 0.761 0.886 0.456 ...
$ normal_dist : num 44.3 40.1 51.9 46.2 63.3 ...
$ poisson_dist : int 3 6 3 5 5 5 7 2 2 6 ...
$ binomial_dist : int 3 4 6 3 3 6 6 4 7 6 ...
$ gender : Factor w/ 2 levels "F","M": 2 2 2 1 2 1 2 2 2 1 ...
$ marriage_status: Factor w/ 4 levels "Divorced","Married",..: 3 2 2 2 2 1 2 4 1 2 ...
- attr(*, ".internal.selfref")=<externalptr>
This combined synthetic dataset can now be used for various analyses, simulations, and testing purposes. It provides a comprehensive representation of both numerical and categorical variables, making it suitable for a wide range of applications.
These snippets provide a starting point for generating synthetic categorical data using random and weighted sampling with reproducibility. You can adjust the parameters to fit the specific characteristics of the population you are modeling. For more detailed information on the sample
function, refer to the R documentation.
3. Generating synthetic data using synthpop
#
In the guidiance, we will work with the R package synthpop
(Nowok, Raab, and Dibben 2016), which is one of the most advanced and dedicated packages in R to create synthetic data.
Synthpop generates synthetic data by modeling the relationships in the original dataset and then simulating new data that preserves the statistical properties of the original, while reducing disclosure risk.
3.1 Setting up synthpop
and required packages#
install.packages("synthpop")
install.packages("sparklyr")
This will install synthpop
and its dependencies and sparklyr
from ONS Artifactory.
First clear the workspace and load packages using the library()
function.
rm(list = ls())
library(synthpop)
To get a list of all synthpop
functions, use:
help(package = synthpop)
To quickly access a help file for a specific function, such as the main synthpop
function syn()
, you can type its name preceded by ?
.
?synthpop
?syn
Case 1: Synthesise synthpop
sample data SD2011
#
In most cases, you will be working with your own data, but for practice, you can use the sample data SD2011
provided with the synthpop
package.
Read the data
Read the data you want to synthesise into R
. You can use the synthpop
function read.obs()
to read data from other formats.
NOTE: synthpop
is an R package that generates synthetic data within R. While synthpop
itself does not generate data directly in Spark, the synthetic datasets it produces can be easily transferred to Spark using sparklyr
for scalable post-processing and analysis. Thus, Spark is used for processing and analysing the synthetic data, not for the initial data generation with synthpop
.
Examine your data
Start with a modest number of variables (8-12) to understand synthpop
. If your data have more variables, make a selection. The package is intended for large datasets (at least 500
observations).
Use the codebook.syn()
function to examine the features relevant to synthesising.
In this section, we will guide you through a sample synthesis using the synthpop
package in R. This example uses the SD2011
dataset sample, provided with the synthpop
package and demonstrates how to generate synthetic data with a smaller subset of variables. The SD2011 dataset is a social diagnosis survey from 2011, Poland
In this example, we will load the SD2011 dataset, select a subset of variables, generate a synthetic version using synthpop, and compare the synthetic data to the original.
Step-by-step guide#
Explore the
SD2011
dataset
Here, we will use thehelp()
method to obtain the information about the SD2011,dim()
method to get the size of the dataframe andcodebook.syn()
method to obtain the summary information about variables.
Note: SD2011 is included with the synthpop package. If you cannot find it, load it with data(SD2011). This example was tested in RStudio/VS Code/CDP with synthpop installed.
help(SD2011) # this will give you information about it
# This will give you the dimensions (rows, columns) of the SD2011 dataset
dim(SD2011) # Output: 5000 rows, 35 variables
[1] 5000 35
# Show summary information for a subset of variables; here we show only the first five variables for brevity.
synthpop::codebook.syn(SD2011[, 1:5])$tab
variable class nmiss perctmiss ndistinct details
1 sex factor 0 0.00 2 'MALE' 'FEMALE'
2 age numeric 0 0.00 79 Range: 16 - 97
3 agegr factor 4 0.08 6 See table in labs
4 placesize factor 0 0.00 6 See table in labs
5 region factor 0 0.00 16 See table in labs
Select a subset of variables
After loading the data, inspect it to understand its structure and contents.
You can review a
summary()
of all variables to generate descriptive statistics.
# Select a smaller subset of variables for demonstration
mydata <- SD2011[, c(1, 2, 3, 6, 8, 10, 11)]
# Preview the first few rows of the dataset
head(mydata)
# Get a summary of the dataset
summary(mydata)
sex age agegr edu socprof income marital
1 FEMALE 57 45-59 VOCATIONAL/GRAMMAR RETIRED 800 MARRIED
2 MALE 20 16-24 VOCATIONAL/GRAMMAR PUPIL OR STUDENT 350 SINGLE
3 FEMALE 18 16-24 VOCATIONAL/GRAMMAR PUPIL OR STUDENT NA SINGLE
4 FEMALE 78 65+ PRIMARY/NO EDUCATION RETIRED 900 WIDOWED
5 FEMALE 54 45-59 VOCATIONAL/GRAMMAR SELF-EMPLOYED 1500 MARRIED
6 MALE 20 16-24 SECONDARY PUPIL OR STUDENT -8 SINGLE
sex age agegr edu
MALE :2182 Min. :16.00 16-24: 702 PRIMARY/NO EDUCATION : 962
FEMALE:2818 1st Qu.:32.00 25-34: 726 VOCATIONAL/GRAMMAR :1613
Median :49.00 35-44: 748 SECONDARY :1482
Mean :47.68 45-59:1361 POST-SECONDARY OR HIGHER: 936
3rd Qu.:61.00 60-64: 516 NA's : 7
Max. :97.00 65+ : 943
NA's : 4
socprof income marital
RETIRED :1241 Min. : -8 SINGLE :1253
EMPLOYED IN PRIVATE SECTOR : 994 1st Qu.: 700 MARRIED :2979
EMPLOYED IN PUBLIC SECTOR : 600 Median : 1200 WIDOWED : 531
PUPIL OR STUDENT : 548 Mean : 1411 DIVORCED : 199
OTHER ECONOMICALLY INACTIVE: 444 3rd Qu.: 1800 LEGALLY SEPARATED : 7
(Other) :1140 Max. :16000 DE FACTO SEPARATED: 22
NA's : 33 NA's :683 NA's : 9
Checking for missing values The following code checks for negative and missing values in the income variable.
# Check for negative income values
table(mydata$income[mydata$income < 0], useNA = "ifany")
-8 <NA>
603 683
Synthesise data
Use the
syn()
function from thesynthpop
package to create a synthetic version of your data, using the default settings.Fix the random seed before generating the synthetic data, so that you can compare your output with our output.
set.seed(123)
# Synthesise data, handling -8 as a missing value for income
mysyn <- synthpop::syn(mydata, cont.na = list(income = -8))
Synthesis
-----------
sex age agegr edu socprof income marital
We have now created the object mysyn
, which is of class synds
and contains several pieces of information related to the synthetic data. We will quickly go over the most important output. We can examine the synthesised data using the summary and head methods.
summary(mysyn)
Synthetic object with one synthesis using methods:
sex age agegr edu socprof income marital
"sample" "cart" "cart" "cart" "cart" "cart" "cart"
sex age agegr edu
MALE :2188 Min. :16.00 16-24: 685 PRIMARY/NO EDUCATION : 958
FEMALE:2812 1st Qu.:32.00 25-34: 745 VOCATIONAL/GRAMMAR :1612
Median :49.00 35-44: 757 SECONDARY :1464
Mean :47.62 45-59:1381 POST-SECONDARY OR HIGHER: 964
3rd Qu.:61.00 60-64: 500 NA's : 2
Max. :97.00 65+ : 930
NA's : 2
socprof income marital
RETIRED :1188 Min. : -8 SINGLE :1254
EMPLOYED IN PRIVATE SECTOR :1004 1st Qu.: 700 MARRIED :3022
EMPLOYED IN PUBLIC SECTOR : 600 Median : 1200 WIDOWED : 504
PUPIL OR STUDENT : 528 Mean : 1419 DIVORCED : 183
OTHER ECONOMICALLY INACTIVE: 433 3rd Qu.: 1800 LEGALLY SEPARATED : 10
(Other) :1213 Max. :15000 DE FACTO SEPARATED: 16
NA's : 34 NA's :693 NA's : 11
head(mysyn$syn)
sex age agegr edu socprof income
1 FEMALE 48 45-59 PRIMARY/NO EDUCATION EMPLOYED IN PRIVATE SECTOR 1000
2 FEMALE 57 45-59 PRIMARY/NO EDUCATION UNEMPLOYED -8
3 MALE 68 65+ PRIMARY/NO EDUCATION RETIRED 1200
4 MALE 25 25-34 SECONDARY OTHER ECONOMICALLY INACTIVE NA
5 FEMALE 64 60-64 POST-SECONDARY OR HIGHER RETIRED 1200
6 MALE 73 65+ SECONDARY RETIRED 1300
marital
1 MARRIED
2 MARRIED
3 MARRIED
4 SINGLE
5 MARRIED
6 MARRIED
The compare()
function below produces plots and statistics that help you assess how closely the synthetic data matches the original data for each variable:
The bar plots show the distribution of counts for each value/category, with a legend distinguishing “Observed” (original data) and “Synthetic” (synthetic data).
If the bars for observed and synthetic data are similar in height for each value, this indicates that the synthetic data replicates the original distribution well.
The S_pMSE (Standardized Propensity Mean Squared Error) value, shown on each plot, quantifies the similarity between the observed and synthetic distributions.
A lower S_pMSE value (closer to 0) means the synthetic data closely matches the original data for that variable.
A higher S_pMSE value (>> 1) means there are greater differences between the synthetic and original data.
In summary, look for S_pMSE values near 0 and similar bar heights to conclude that the synthetic data replicates the original data well for a given variable. You can also get the S_pMSE values from the print(mysyn)
method
# Compare the synthetic data with the original data. In most cases, you would need to press "Press return for next variable(s)"
synthpop::compare(mysyn, mydata, stat = "counts")
Calculations done for sex
Calculations done for age
Calculations done for agegr
Calculations done for edu
Calculations done for socprof
Calculations done for income
Calculations done for marital
Comparing counts observed with synthetic
Press return for next variable(s):
Selected utility measures:
pMSE S_pMSE df
sex 0.000000 0.029265 1
age 0.000028 0.554077 4
agegr 0.000042 0.554113 6
edu 0.000083 1.654516 4
socprof 0.000143 1.274998 9
income 0.000104 1.391121 6
marital 0.000084 1.119933 6
This figure shows examples of visuals from using the multi.compare()
function:

Fig. 46 Observed_versus_synthetic_data#
Export synthetic data
# Export synthetic data to CSV format
write.syn(mysyn, filename = "mysyn_SD2011", filetype = "csv")
Synthetic data exported as csv file(s).
Information on synthetic data written to
D:/dapcats_guidance/ons-spark/synthesis_info_mysyn_SD2011.txt
Explore synthetic data
After generating synthetic data using thesynthpop
package, it is important to explore and understand the structure and components of the synthetic data object. This section provides guidance on how to explore the synthetic data object and perform additional comparisons.
I. Retrieve component names
names(mysyn)
[1] "call" "m" "syn" "method"
[5] "visit.sequence" "predictor.matrix" "smoothing" "event"
[9] "denom" "proper" "n" "k"
[13] "rules" "rvalues" "cont.na" "semicont"
[17] "drop.not.used" "drop.pred.only" "models" "seed"
[21] "var.lab" "val.lab" "obs.vars" "numtocat"
[25] "catgroups"
This command retrieves the names of the components within the
mysyn
object, helping you understand its structure.
II. Explanation of some of the components
syn
: The synthetic data set.method
: The methods used for synthesising each variable.predictor.matrix
: The matrix indicating which variables were used as predictors for each synthesised variable.visit.sequence
: The order in which the variables were synthesised.cont.na
: Information about how missing values were handled.rules
: Any rules applied during synthesis.rvalues
: Values used for rules.m
: Number of synthetic datasets created (if multiple).proper
: Indicates if proper synthesis was used.seed
: The seed used for random number generation.call
: The original call to the syn() function.data
: The original data used for synthesis.
III. Inspect key components
mysyn$method
mysyn$predictor.matrix
mysyn$visit.sequence
mysyn$cont.na
mysyn$seed
sex age agegr edu socprof income marital
"sample" "cart" "cart" "cart" "cart" "cart" "cart"
sex age agegr edu socprof income marital
sex 0 0 0 0 0 0 0
age 1 0 0 0 0 0 0
agegr 1 1 0 0 0 0 0
edu 1 1 1 0 0 0 0
socprof 1 1 1 1 0 0 0
income 1 1 1 1 1 0 0
marital 1 1 1 1 1 1 0
sex age agegr edu socprof income marital
1 2 3 4 5 6 7
$sex
[1] NA
$age
[1] NA
$agegr
[1] NA
$edu
[1] NA
$socprof
[1] NA
$income
[1] -8
$marital
[1] NA
[1] 161401295
IV. Additional comparisons
To further validate the synthetic data, you can perform additional comparisons between the synthetic and original data using the multi.compare()
function.
# Additional comparisons
synthpop::multi.compare(mysyn, mydata, var = "marital", by = "sex")
synthpop::multi.compare(mysyn, mydata, var = "income", by = "agegr")
synthpop::multi.compare(mysyn, mydata, var = "income", by = "edu", cont.type = "boxplot")
Plots of marital by sex
Numbers in each plot (observed data):
sex
MALE FEMALE
2182 2818
Plots of income by agegr
Numbers in each plot (observed data):
agegr
16-24 25-34 35-44 45-59 60-64 65+ <NA>
702 726 748 1361 516 943 4
Plots of income by edu
Numbers in each plot (observed data):
edu
PRIMARY/NO EDUCATION VOCATIONAL/GRAMMAR SECONDARY
962 1613 1482
POST-SECONDARY OR HIGHER <NA>
936 7
These figures show examples of visuals from using the multi.compare()
function:

Fig. 47 Distribution_of_marital_variable_by_sex-synthetic_versus_original_data#

Fig. 48 Distribution_of_income_variable_by_agegroup-synthetic_versus_original_data#

Fig. 49 Distribution_of_income_variable_by_educationlevel-synthetic_versus_original_data#
These steps help ensure that the synthetic data accurately reflects the structure and relationships present in the original data, making it suitable for analysis while protecting the privacy of the original data.
Using synthpop
with sparklyr
for scalable synthetic data processing#
You can use synthpop
together with sparklyr
to generate synthetic data in R and then process it at scale with Apache Spark. In the example below, we build on Case 1 by copying the synthetic SD2011 data to Spark and using dplyr
verbs for distributed analysis. We demonstrate how to group the data by sex
and calculate the mean income
, but you can apply any other Spark operations as needed. The mysyn
object (of class synds
) generated earlier is used for this Spark analysis.
install.packages("sparklyr")
library(sparklyr)
sc <- sparklyr::spark_connect(
master = "local[2]",
app_name = "synthpop-spark",
config = sparklyr::spark_config())
# Access the synthetic data generated by synthpop (see Case 1)
synthetic_sd2011 <- mysyn$syn
# Copy the synthetic data frame from R into the Spark session as a Spark table
synthetic_sd2011_tbl <- copy_to(
sc,
synthetic_sd2011,
"synthetic_sd2011",
overwrite = TRUE
)
# Now you can use dplyr verbs with Spark for scalable processing
result_tbl <- synthetic_sd2011_tbl %>%
group_by(sex) %>%
summarise(
mean_income = mean(income, na.rm = TRUE),
count = n()
)
result_tbl %>% collect()
spark_disconnect(sc)
[38;5;246m# A tibble: 2 × 3[39m
sex mean_income count
[3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m MALE [4m1[24m590. [4m2[24m188
[38;5;250m2[39m FEMALE [4m1[24m276. [4m2[24m812
Case 2: Synthesise census teaching data#
In this section, we will guide you through the process of synthesising another case study as presented by Iain Dove, a census teaching data using the synthpop
package in R. This demo is an alternative to the synthesis of SD2011
in Case 1
. This involves loading the data, creating a subset, synthesising the data, and comparing the synthetic data with the original data.
Step-by-step guide#
Loading the data
# Load the configuration and set the path to the census teaching data
config <- yaml::yaml.load_file("/home/cdsw/ons-spark/config.yaml")
census_2011_path = config$census_2011_teaching_data_path_csv
census_teaching_data <- data.table::fread(census_2011_path, skip = 1) %>%
janitor::clean_names()
Create a subset of the data
As before, we first create a subset of the data to reduce run time.
# Create a subset of the data
small_census_teaching_data <- census_teaching_data[1:100000,]
First synthesis of census data
Let’s now create the synthetic data.
# First synthesis
synthetic_census_teaching_data <- synthpop::syn(data = small_census_teaching_data,
method = "sample",
k = 100000)
Variable residence_type has only one value so its method has been changed to "constant".
Synthesis
-----------
person_id region residence_type family_composition population_base sex age marital_status student country_of_birth
health ethnic_group religion economic_activity occupation industry hours_worked_per_week approximated_social_grade
You have created a synthetic data object synthetic_census_teaching_data
of class synds
, which is a list with a number of components including the synthesised data and information on how they were created.
# Check the class of the synthetic data object
class(synthetic_census_teaching_data)
[1] "synds"
Access the synthetic dataframe
We have now created the object mysyn
, which is of class synds
(the main output class from synthpop, containing the synthetic data and metadata).
To view the synthetic data, use synthetic_census_teaching_data$syn
, and use summary()
to review all variables
# Access the synthetic dataframe
synthetic_census_teaching_data$syn
summary(synthetic_census_teaching_data)
Synthetic object with one synthesis using methods:
person_id region residence_type
"sample" "sample" "constant"
family_composition population_base sex
"sample" "sample" "sample"
age marital_status student
"sample" "sample" "sample"
country_of_birth health ethnic_group
"sample" "sample" "sample"
religion economic_activity occupation
"sample" "sample" "sample"
industry hours_worked_per_week approximated_social_grade
"sample" "sample" "sample"
person_id region residence_type family_composition
Min. :7394484 Length:100000 Length:100000 Min. :-9.000
1st Qu.:7419840 Class :character Class :character 1st Qu.: 2.000
Median :7445549 Mode :character Mode :character Median : 2.000
Mean :7445451 Mean : 2.231
3rd Qu.:7470875 3rd Qu.: 3.000
Max. :7910302 Max. : 6.000
population_base sex age marital_status student
Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.00
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:2.00
Median :1.000 Median :2.000 Median :4.00 Median :2.000 Median :2.00
Mean :1.016 Mean :1.502 Mean :3.93 Mean :1.849 Mean :1.78
3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:6.00 3rd Qu.:2.000 3rd Qu.:2.00
Max. :3.000 Max. :2.000 Max. :8.00 Max. :5.000 Max. :2.00
country_of_birth health ethnic_group religion
Min. :-9.0000 Min. :-9.000 Min. :-9.000 Min. :-9.000
1st Qu.: 1.0000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
Median : 1.0000 Median : 2.000 Median : 1.000 Median : 2.000
Mean : 0.9973 Mean : 1.672 Mean : 1.132 Mean : 2.349
3rd Qu.: 1.0000 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 2.000
Max. : 2.0000 Max. : 5.000 Max. : 5.000 Max. : 9.000
economic_activity occupation industry hours_worked_per_week
Min. :-9.0000 Min. :-9.000 Min. :-9.000 Min. :-9.000
1st Qu.: 1.0000 1st Qu.:-9.000 1st Qu.:-9.000 1st Qu.:-9.000
Median : 1.0000 Median : 4.000 Median : 4.000 Median :-9.000
Mean : 0.6395 Mean : 1.482 Mean : 2.474 Mean :-3.485
3rd Qu.: 5.0000 3rd Qu.: 7.000 3rd Qu.: 8.000 3rd Qu.: 3.000
Max. : 9.0000 Max. : 9.000 Max. :12.000 Max. : 4.000
approximated_social_grade
Min. :-9.000
1st Qu.: 1.000
Median : 2.000
Mean : 0.272
3rd Qu.: 3.000
Max. : 4.000
Comparison with real data
synthpop
contains functions to assess the accuracy of the synthetic data. Univariate measures are quite easy to satisfy, but relationships between multiple variables are much more difficult.
# Compare synthetic data with real data
synthpop::compare(object = synthetic_census_teaching_data,
data = small_census_teaching_data)
Calculations done for person_id
Calculations done for region
Calculations done for residence_type
Calculations done for family_composition
Calculations done for population_base
Only 2 groups produced for sex even after changing method.
Calculations done for sex
Calculations done for age
Calculations done for marital_status
Only 2 groups produced for student even after changing method.
Calculations done for student
Calculations done for country_of_birth
Calculations done for health
Grouping changed from 'quantile' to 'equal' in function numtocat.syn for ethnic_group because only 2 groups were produced
Calculations done for ethnic_group
Calculations done for religion
Calculations done for economic_activity
Calculations done for occupation
Calculations done for industry
Calculations done for hours_worked_per_week
Calculations done for approximated_social_grade
Comparing percentages observed with synthetic
Press return for next variable(s):
Press return for next variable(s):
Press return for next variable(s):
Press return for next variable(s):
Selected utility measures:
pMSE S_pMSE df
person_id 1e-06 0.469578 4
region 3e-06 0.642902 7
residence_type 0e+00 NaN 0
family_composition 0e+00 0.002530 2
population_base 1e-06 1.082039 2
sex 3e-06 4.329842 1
age 5e-06 2.061816 4
marital_status 2e-06 0.783082 4
student 0e+00 0.013090 1
country_of_birth 1e-06 0.752260 2
health 1e-06 0.750641 2
ethnic_group 1e-06 0.780810 2
religion 6e-06 4.579195 2
economic_activity 0e+00 0.189024 3
occupation 4e-06 2.019962 3
industry 0e+00 0.122714 3
hours_worked_per_week 2e-06 0.810361 4
approximated_social_grade 2e-06 0.793122 4
Relationships between multiple variables#
Relationships between multiple variables are more complex to model. The synthpop
default uses a saturated model, which can be challenging for large datasets. Here, we use an input CSV
file to specify which variables are related.
Load the input CSV file
config <- yaml::yaml.load_file("/home/cdsw/ons-spark/config.yaml")
census_relationship_path = config$census_relationship_file_path_csv
input <- read.csv(census_relationship_path,
stringsAsFactors = FALSE,
skip = 1) %>%
dplyr::select(variable, description, predictors)
# Add row names to make it easier
rownames(input) <- input$variable
Create predictor matrix
The predictor matrix tells synthpop
which variables are related and which variables predict each other. If variable X
is used as a predictor for variable Y
, the relationship between X
and Y
will be preserved to some extent.
The predictor matrix should have the following structure:
Each row represents a variable that is being predicted.
Each column represents a variable that is used as a predictor.
A value of 1 in the matrix indicates that the column variable is used as a predictor for the row variable.
A value of 0 indicates that the column variable is not used as a predictor for the row variable.
# Create predictor matrix in appropriate format for synthpop from CSV input
predictor_matrix <- as.data.frame(matrix(0, nrow = nrow(input),
ncol = nrow(input),
dimnames = list(input$variable, input$variable)))
# Use predictors
for (var in input$variable) {
predictor_matrix[var, as.vector((strsplit(input[var, 'predictors'], " "))[[1]])] <- 1
}
Relationships can be added manually (row
variable predicted by column
variable)
Sensible predictors could be Marital Status being predicted by Age
(under 16s not married) hours worked being predicted by Economic Activity
.
# Relationships can be added manually (row variable predicted by column variable)
predictor_matrix['Marital_Status', 'Age'] <- 1
predictor_matrix['Economic_Activity', 'Age'] <- 1
predictor_matrix['Occupation', 'Economic_Activity'] <- 1
predictor_matrix['Hours_worked_per_week', 'Economic_Activity'] <- 1
Synthesise data with relationships
These relationships are used in the synthesis model. The minnumlevels
parameter converts numeric variables with few unique values into factor variables for the synthesis.
# Synthesise data with relationships
synthetic_census_teaching_data_2 <- synthpop::syn(data = small_census_teaching_data,
predictor.matrix = as.matrix(predictor_matrix),
minnumlevels = 10)
Variable(s): region have been changed for synthesis from character to factor.
Variable(s): age, marital_status, student, ethnic_group, religion, economic_activity, occupation, hours_worked_per_week, approximated_social_grade numeric but with only 10 or fewer distinct values turned into factor(s) for synthesis.
Variable residence_type has only one value so its method has been changed to "constant".
Method "cart" is not valid for a variable without predictors (region)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (family_composition)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (population_base)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (sex)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (age)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (country_of_birth)
Method has been changed to "sample"
Method "cart" is not valid for a variable without predictors (health)
Method has been changed to "sample"
Synthesis
-----------
person_id region residence_type family_composition population_base sex age marital_status student country_of_birth
health ethnic_group religion economic_activity occupation industry hours_worked_per_week approximated_social_grade
Comparison with real data#
To compare the synthetic and original data, use:
# Compare synthetic data with real data
synthpop::compare(data = small_census_teaching_data,
object = synthetic_census_teaching_data_2)
Calculations done for person_id
Calculations done for region
Calculations done for residence_type
Calculations done for family_composition
Calculations done for population_base
Only 2 groups produced for sex even after changing method.
Calculations done for sex
Calculations done for age
Calculations done for marital_status
Only 2 groups produced for student even after changing method.
Calculations done for student
Calculations done for country_of_birth
Calculations done for health
Grouping changed from 'quantile' to 'equal' in function numtocat.syn for ethnic_group because only 2 groups were produced
Calculations done for ethnic_group
Calculations done for religion
Calculations done for economic_activity
Calculations done for occupation
Calculations done for industry
Calculations done for hours_worked_per_week
Calculations done for approximated_social_grade
Comparing percentages observed with synthetic
Press return for next variable(s):
Press return for next variable(s):
Press return for next variable(s):
Press return for next variable(s):
Selected utility measures:
pMSE S_pMSE df
person_id 2.0e-06 0.935440 4
region 1.0e-06 0.176441 7
residence_type 0.0e+00 NaN 0
family_composition 1.0e-06 0.593263 2
population_base 0.0e+00 0.217593 2
sex 6.0e-06 10.162976 1
age 4.0e-06 1.633541 4
marital_status 3.0e-06 1.106891 4
student 2.0e-06 3.309619 1
country_of_birth 1.0e-06 0.558046 2
health 0.0e+00 0.111749 2
ethnic_group 1.0e-06 0.891315 2
religion 1.0e-06 0.871088 2
economic_activity 2.0e-06 0.895892 3
occupation 7.9e-05 42.373300 3
industry 3.7e-05 19.723638 3
hours_worked_per_week 1.0e-05 4.089040 4
approximated_social_grade 5.2e-05 20.624287 4
Press enter
on your keyboard for the next variable
You can write your own code to compare the datasets, particularly on variables/statistics of interest, e.g., compare bi-variate counts.
# Define variables to compare
compare_vars <- c("economic_activity", "hours_worked_per_week")
# Frequency table of compare_vars in original data
orig_table <- small_census_teaching_data[, .(orig_count = .N), keyby = compare_vars]
# Convert to factors
orig_table <- orig_table %>%
dplyr::mutate(economic_activity = as.factor(economic_activity),
hours_worked_per_week = as.factor(hours_worked_per_week))
orig_table
Key: <economic_activity, hours_worked_per_week>
economic_activity hours_worked_per_week orig_count
<fctr> <fctr> <int>
1: -9 -9 19948
2: 1 1 2649
3: 1 2 7813
4: 1 3 24310
5: 1 4 4117
6: 2 1 692
7: 2 2 1227
8: 2 3 2966
9: 2 4 1468
10: 3 -9 3388
11: 4 -9 557
12: 4 1 1238
13: 4 2 413
14: 4 3 307
15: 4 4 33
16: 5 -9 16588
17: 6 -9 4114
18: 7 -9 3266
19: 8 -9 3324
20: 9 -9 1582
economic_activity hours_worked_per_week orig_count
# Frequency table of compare_vars in synthetic data with relationships
synth2_table <- as.data.table(synthetic_census_teaching_data_2$syn)[, .(synth2_count = .N), keyby = compare_vars]
synth2_table
# Convert economic_activity in synth2_table to a factor
synth2_table <- synth2_table %>%
dplyr::mutate(economic_activity = as.factor(economic_activity),
hours_worked_per_week = as.factor(hours_worked_per_week))
Key: <economic_activity, hours_worked_per_week>
economic_activity hours_worked_per_week synth2_count
<num> <num> <int>
1: -9 -9 20103
2: 1 1 2602
3: 1 2 8056
4: 1 3 24172
5: 1 4 3966
6: 2 1 715
7: 2 2 1295
8: 2 3 2947
9: 2 4 1505
10: 3 -9 3405
11: 4 -9 375
12: 4 1 1110
13: 4 2 407
14: 4 3 536
15: 4 4 76
16: 5 -9 16446
17: 6 -9 4155
18: 7 -9 3187
19: 8 -9 3326
20: 9 -9 1616
economic_activity hours_worked_per_week synth2_count
# Frequency table of compare_vars in basic synthetic data
synth1_table <- as.data.table(synthetic_census_teaching_data$syn)[, .(synth1_count = .N), keyby = compare_vars]
synth1_table
# Convert to factors
synth1_table <- synth1_table %>%
dplyr::mutate(economic_activity = as.factor(economic_activity),
hours_worked_per_week = as.factor(hours_worked_per_week))
Key: <economic_activity, hours_worked_per_week>
economic_activity hours_worked_per_week synth1_count
<int> <int> <int>
1: -9 -9 10636
2: -9 1 859
3: -9 2 1860
4: -9 3 5453
5: -9 4 1123
6: 1 -9 20467
7: 1 1 1724
8: 1 2 3692
9: 1 3 10691
10: 1 4 2226
11: 2 -9 3378
12: 2 1 311
13: 2 2 612
14: 2 3 1713
15: 2 4 326
16: 3 -9 1775
17: 3 1 164
18: 3 2 342
19: 3 3 891
20: 3 4 179
21: 4 -9 1360
22: 4 1 122
23: 4 2 231
24: 4 3 744
25: 4 4 147
26: 5 -9 8965
27: 5 1 796
28: 5 2 1514
29: 5 3 4595
30: 5 4 974
31: 6 -9 2141
32: 6 1 191
33: 6 2 379
34: 6 3 1145
35: 6 4 239
36: 7 -9 1718
37: 7 1 138
38: 7 2 302
39: 7 3 906
40: 7 4 179
41: 8 -9 1721
42: 8 1 143
43: 8 2 305
44: 8 3 879
45: 8 4 191
46: 9 -9 814
47: 9 1 76
48: 9 2 144
49: 9 3 430
50: 9 4 89
economic_activity hours_worked_per_week synth1_count
# Combine counts from both datasets for comparisons
comparison2_table <- merge(orig_table, synth2_table, all = TRUE)
comparison1_table <- merge(orig_table, synth1_table, all = TRUE)
# Replace NA with 0
comparison1_table <- comparison1_table %>%
dplyr::mutate(orig_count = as.double(orig_count)) %>%
dplyr::mutate(orig_count = if_else(condition = is.na(orig_count),
true = 0,
false = orig_count))
# NAs occur if the combination is not present in one of the datasets, convert these to zero counts
comparison2_table[is.na(orig_count), orig_count := 0]
comparison2_table[is.na(synth2_count), synth2_count := 0]
comparison1_table[is.na(orig_count), orig_count := 0]
comparison1_table[is.na(synth1_count), synth1_count := 0]
# Average absolute distance
comparison2_table[, mean(abs(orig_count - synth2_count))]
comparison1_table[, mean(abs(orig_count - synth1_count))]
# Average percentage difference
comparison2_table[, 100 * mean((abs(orig_count - synth2_count) / orig_count))]
comparison1_table[, 100 * mean((abs(orig_count - synth1_count) / orig_count))]
[1] 89.2
[1] 1987.4
[1] 13.91346
[1] Inf
This guide provides a concise overview of synthesizing census teaching data with relationships between multiple variables using the synthpop
package in R. For more detailed information, refer to the synthpop documentation.
References#
Acknowledgments#
Special thanks to Iain Dove for sharing his knowledge of the Synthetic data, a useful tool for ONS
, and to Wil Roberts, Elisha Mercado and Vicky Pickering for inspiring this tip!