Synthetic Data Generation in Python: Faker and Mimesis#

1. Introduction#

This is a guide on generating synthetic data in Python. Synthetic data is artificially created information that closely resembles real-world data, but contains no actual personal or sensitive information. It is invaluable for:

  • Testing and validating data pipelines and machine learning models

  • Protecting privacy and confidentiality

  • Simulating rare or edge-case scenarios

  • Enabling reproducible research and open data sharing

In this notebook, we demonstrate how to generate synthetic data using two leading Python libraries: Faker and Mimesis. For each example, we show parallel code for both libraries, highlighting their similarities and differences. We also cover best practices, advanced use cases, and how to scale up data generation with PySpark for big data applications.

By the end of this notebook, you will be able to:

  • Generate realistic synthetic data for a variety of domains

  • Choose between Faker and Mimesis based on your needs

  • Understand how to ensure reproducibility and locale-specific data

  • Scale up data generation for large datasets

  • Apply best practices for synthetic data projects

2. Install and setup#

First, install the required libraries and check their versions to ensure compatibility.

#!pip install faker mimesis
!pip install faker
import faker
import pkg_resources
print("Faker version:", pkg_resources.get_distribution("Faker").version) 
Faker version: 37.4.0

3. Setting a random seed#

Setting a random seed ensures reproducibility, so you get the same synthetic data each time you run the code.

from faker import Faker

# Set up Faker
fake_uk = Faker('en_GB')
fake_uk.seed_instance(42)

4. Providers and Locales in Faker and Mimesis#

Both Faker and Mimesis use the concept of providers (modules or classes that generate specific types of data) and locales (to match country or language conventions). Below, we show how to list available providers and generate data for different locales.

4.1 Providers#

You can explore the full list of available providers and their methods in the offical Faker and Mimesis documentations, which offer detailed information and usage examples for each one.

print("Faker Providers:")
for provider in fake_uk.providers:
    print("-", provider)
Faker Providers:
- <faker.providers.user_agent.Provider object at 0x00000199DED9E740>
- <faker.providers.ssn.en_GB.Provider object at 0x00000199DED9F670>
- <faker.providers.sbn.Provider object at 0x00000199DED9C610>
- <faker.providers.python.Provider object at 0x00000199DED9C1F0>
- <faker.providers.profile.Provider object at 0x00000199DED9E470>
- <faker.providers.phone_number.en_GB.Provider object at 0x00000199DED9FD60>
- <faker.providers.person.en_GB.Provider object at 0x00000199DED9C070>
- <faker.providers.passport.en_US.Provider object at 0x00000199DED9CAC0>
- <faker.providers.misc.en_US.Provider object at 0x00000199DED9FDF0>
- <faker.providers.lorem.la.Provider object at 0x00000199DED9FE80>
- <faker.providers.job.en_US.Provider object at 0x00000199DED9C820>
- <faker.providers.isbn.en_US.Provider object at 0x00000199DED9C880>
- <faker.providers.internet.en_GB.Provider object at 0x00000199DED9C9A0>
- <faker.providers.geo.en_US.Provider object at 0x00000199DED9C8B0>
- <faker.providers.file.Provider object at 0x00000199DED9F7F0>
- <faker.providers.emoji.Provider object at 0x00000199DED9F8B0>
- <faker.providers.doi.Provider object at 0x00000199DED9F910>
- <faker.providers.date_time.en_US.Provider object at 0x00000199DED9FA00>
- <faker.providers.currency.en_US.Provider object at 0x00000199DED9DB10>
- <faker.providers.credit_card.en_US.Provider object at 0x00000199DED9E5C0>
- <faker.providers.company.en_US.Provider object at 0x00000199DED9E6B0>
- <faker.providers.color.en_US.Provider object at 0x00000199DED9DCC0>
- <faker.providers.barcode.en_US.Provider object at 0x00000199DED9EB00>
- <faker.providers.bank.en_GB.Provider object at 0x00000199DED9DBD0>
- <faker.providers.automotive.en_GB.Provider object at 0x00000199DEDC9AE0>
- <faker.providers.address.en_GB.Provider object at 0x00000199DD272AD0>

4.2 Locales#

Locales allow you to generate data that matches the conventions and formats of different countries and regions. This is useful for internationalisation and region-specific testing.

To demonstrate how locales influence data generation, let us create a simple example that generates person and address data for various regions. Note that both libraries also support language-specific data generation; for example, outputs for Arabic and Japanese locales are written in the respective languages. For a full list of supported locales, please refer to the official Faker and Mimesis documentations.

Generate data for multiple locales

print("\nFaker Examples for Multiple Locales:")
locales = ['en_GB', 'en_US', 'ja_JP', 'fr_FR', 'ar']
for loc in locales:
    fake = Faker(loc)
    fake.seed_instance(42)
    print(f"\n{loc.upper()} Example:")
    print("Name:", fake.name())
    print("Email:", fake.email())
    print("Job:", fake.job())
Faker Examples for Multiple Locales:

EN_GB Example:
Name: William Jennings
Email: duncan32@example.net
Job: Civil engineer, consulting

EN_US Example:
Name: Allison Hill
Email: donaldgarcia@example.net
Job: Sports administrator

JA_JP Example:
Name: 佐藤 淳
Email: tomoyatakahashi@example.org
Job: 測量士

FR_FR Example:
Name: Alexandre Traore
Email: patrick32@example.net
Job: boucher

AR Example:
Name: جلاء بنو العريج
Email: dyaalhaby@example.net
Job: رجل مباحث

4.3 Generate user profiles with multiple providers#

Combining multiple providers allows you to create rich, realistic user profiles for testing and simulation.

profile = {
    "Name": fake_uk.name(),
    "Address": fake_uk.address().replace("\n", ", "),
    "Email": fake_uk.email(),
    "Job": fake_uk.job(),
    "Phone": fake_uk.phone_number(),
    "Company": fake_uk.company(),
    "Date of Birth": fake_uk.date_of_birth(minimum_age=18, maximum_age=90)
}
print("Faker User Profile:")
for k, v in profile.items():
    print(f"{k}: {v}")
Faker User Profile:
Name: William Jennings
Address: 2 Sian streets, New Maryton, E3 8ZA
Email: francescaharrison@example.com
Job: English as a second language teacher
Phone: +44151 496 0940
Company: Harrison LLC
Date of Birth: 1972-04-21

4.4 Generating text using Faker and Mimesis#

Generating synthetic text is useful for testing NLP pipelines, populating free-text fields, or simulating survey responses. Here, we show how to generate random sentences, words, paragraphs, and quotes using both Faker and Mimesis, side by side.

fake = Faker()

print('Sentence:', fake.sentence())
print('Paragraph:', fake.paragraph())
print('Text:', fake.text(max_nb_chars=100))
print('Word:', fake.word())
print('Quote:', fake.catch_phrase())
Sentence: Other hotel financial want nation.
Paragraph: Kitchen together begin town draw story capital piece. Full which agency democratic election cold although because.
Text: I president whatever very necessary seat color. Official behind yeah claim focus.
Word: team
Quote: Compatible modular adapter

5. Generating data at scale with PySpark#

For large datasets, you can use Faker or Mimesis to generate data and load it into a Spark DataFrame. Below are parallel examples for both libraries.

from pyspark.sql import SparkSession
import os

spark = SparkSession.builder.master("local[2]").appName("SyntheticDataExample").getOrCreate()

n_rows = 1000


fake_uk = Faker('en_GB')
fake_uk.seed_instance(42)

faker_data = [(fake_uk.name(), fake_uk.address().replace("\n", ", "), fake_uk.email()) for _ in range(n_rows)] 
columns = ["Name", "Address", "Email"] 

# Check if in virtual environment
is_in_venv = os.getenv('VIRTUAL_ENV') is not None

if not is_in_venv:
    print(f"is_in_venv: {is_in_venv} | Spark DataFrame to be created from: spark.createDataFrame")
    df_spark = spark.createDataFrame(faker_data, schema=columns)
else:
    print(f"is_in_venv: {is_in_venv} | Spark DataFrame to be created from: Pandas DataFrame -> CSV -> Spark DataFrame")
    import pandas as pd
    df_pd = pd.DataFrame(faker_data, columns=columns)

    file_path = "personal_data_temp.csv"
    df_pd.to_csv(file_path, index=False)
    df_spark = spark.read.csv(file_path, header=True, inferSchema=True)

df_spark.show(5)
spark.stop()
is_in_venv: False | Spark DataFrame to be created from: spark.createDataFrame
                                                                                
+--------------------+--------------------+--------------------+
|                Name|             Address|               Email|
+--------------------+--------------------+--------------------+
|    William Jennings|2 Sian streets, N...|francescaharrison...|
|     Rosemary Wright|654 Robin track, ...|simpsongemma@exam...|
|         Sean Norton|103 Robinson walk...|  rita19@example.net|
|       Brenda Briggs|Studio 4, Lydia i...|iwilkins@example.org|
|Leonard Powell-Mo...|Flat 32G, Green c...|andrea01@example.net|
+--------------------+--------------------+--------------------+
only showing top 5 rows

6. Generating fake population data#

Generating synthetic population data is essential for simulating census datasets, survey microdata, or anonymised samples. In this section, we demonstrate how to create tabular, population-like data with custom categories, such as sex, ethnicity, marital status, and employment status, while also simulating missing values, using the Mimesis library. The same approach and logic can be applied with the Faker library.

Note: To generate a name consistent with a specific gender, use the .person().first_name() and .person().last_name() methods separately, specifying the gender for the first name with the Gender.MALE or Gender.FEMALE enum from Mimesis. This ensures that gender-related fields (such as first name and title) are logically consistent within each synthetic record.

from faker import Faker
import numpy as np
import random
import datetime
from pyspark.sql import SparkSession
import os
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

fake = Faker('en_GB')
Faker.seed(42)
random.seed(42)
np.random.seed(42)

n_rows = 1000
shared_address_rows = 10

schema = StructType([
    StructField('ID', StringType(), True),
    StructField('SOURCE_FILE', StringType(), True),
    StructField('TIME_STAMP', StringType(), True),
    StructField('Local_Authority', StringType(), True),
    StructField('Postcode', StringType(), True),
    StructField('Address_Line_1', StringType(), True),
    StructField('Address_Line_2', StringType(), True),
    StructField('Address_Line_3', StringType(), True),
    StructField('Address_Line_4', StringType(), True),
    StructField('Address_Line_5', StringType(), True),
    StructField('Address_Line_6', StringType(), True),
    StructField('Last_Name', StringType(), True),
    StructField('First_Name', StringType(), True),
    StructField('Sex', StringType(), True),
    StructField('Title', StringType(), True),
    StructField('Marital_Status', StringType(), True),
    StructField('DOB', StringType(), True),
    StructField('Ethnicity', StringType(), True),
    StructField('Education_Level', StringType(), True),
    StructField('Nationality', StringType(), True),
    StructField('Income', IntegerType(), True),
    StructField('Employment_Status', StringType(), True),
    StructField('Academic_Degree', StringType(), True),
    StructField('guid', StringType(), True)
])

categories = {
    'sex': ['Male', 'Female'],
    'marital_status': {
        'Male': ['Single', 'Married', 'Divorced', 'Widowed'],
        'Female': ['Single', 'Married', 'Divorced', 'Widowed']
    },
    'education_level': [
        'No Qualifications', 'GCSEs', 'A-Levels', 'Apprenticeship',
        "Bachelor's Degree", "Master's Degree", 'PhD'
    ],
    'ethnicity': ['White', 'Black', 'Asian', 'Mixed', 'Other'],
    'employment_status': ['Employed', 'Unemployed', 'Student', 'Retired'],
    'academic_degree': [
        "None", "Bachelor's Degree", "Master's Degree", "PhD", "Diploma", "Certificate"
    ],
    'nationality': [
        "British", "Irish", "Polish", "Indian", "Pakistani", "Nigerian", "Chinese", "Other"
    ]
}

def maybe_missing(val, p=0.1):
    return val if random.random() > p else None

def random_timestamp():
    return datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

shared_address = {
    'Local_Authority': fake.county(),
    'Postcode': fake.postcode(),
    'Address_Line_1': str(random.randint(1, 200)),
    'Address_Line_2': fake.street_name(),
    'Address_Line_3': None,
    'Address_Line_4': fake.city(),
    'Address_Line_5': fake.country(),
    'Address_Line_6': 'United Kingdom',
}

columns = [
    'ID', 'SOURCE_FILE', 'TIME_STAMP', 'Local_Authority', 'Postcode', 'Address_Line_1',
    'Address_Line_2', 'Address_Line_3', 'Address_Line_4', 'Address_Line_5', 'Address_Line_6',
    'Last_Name', 'First_Name', 'Sex', 'Title', 'Marital_Status', 'DOB', 'Ethnicity', 'Education_Level',
    'Nationality', 'Income', 'Employment_Status', 'Academic_Degree', 'guid'
]

rows = []
for i in range(n_rows):
    sex = random.choice(categories['sex'])
    first_name = fake.first_name_male() if sex == 'Male' else fake.first_name_female()
    title = random.choice(['Mr', 'Dr', 'Prof']) if sex == 'Male' else random.choice(['Ms', 'Mrs', 'Miss', 'Dr', 'Prof'])
    last_name = fake.last_name()
    marital_status = random.choice(categories['marital_status'][sex])
    if i >= n_rows - shared_address_rows:
        address = shared_address
    else:
        address = {
            'Local_Authority': fake.county(),
            'Postcode': fake.postcode(),
            'Address_Line_1': str(random.randint(1, 200)),
            'Address_Line_2': fake.street_name(),
            'Address_Line_3': None,
            'Address_Line_4': fake.city(),
            'Address_Line_5': fake.country(),
            'Address_Line_6': 'United Kingdom',
        }
    age = random.randint(18, 99)
    dob = (datetime.date.today() - datetime.timedelta(days=age*365 + random.randint(0, 364)))
    income_val = maybe_missing(random.randint(25000, 75000))
    income_val = int(income_val) if income_val is not None else None
    row = {
        'ID': str(i+1).zfill(len(str(n_rows))),
        'SOURCE_FILE': f"{fake.user_name()}.csv",
        'TIME_STAMP': random_timestamp(),
        **address,
        'Last_Name': last_name,
        'First_Name': first_name,
        'Sex': sex,
        'Title': title,
        'Marital_Status': marital_status,
        'DOB': dob.strftime('%d-%m-%Y'),
        'Ethnicity': maybe_missing(random.choice(categories['ethnicity'])),
        'Education_Level': maybe_missing(random.choice(categories['education_level'])),
        'Nationality': random.choice(categories['nationality']),
        'Income': income_val,
        'Employment_Status': maybe_missing(random.choice(categories['employment_status'])),
        'Academic_Degree': maybe_missing(random.choice(categories['academic_degree'])),
        'guid': fake.uuid4()
    }
    # Ensure all keys are present
    for col in columns:
        if col not in row:
            row[col] = None
    rows.append(row)

spark = SparkSession.builder.appName("FakerData").getOrCreate()
is_in_venv = os.getenv('VIRTUAL_ENV') is not None

if is_in_venv:
    print(f"is_in_venv: {is_in_venv} | Spark DataFrame to be created from: CSV file")
    file_path = "population_data_temp_faker.csv"
    import csv
    with open(file_path, "w", newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=columns)
        writer.writeheader()
        writer.writerows(rows)
    population_df = spark.read.csv(file_path, header=True, schema=schema)
else:
    print(f"is_in_venv: {is_in_venv} | Spark DataFrame to be created from: spark.createDataFrame")
    population_df = spark.createDataFrame(rows, schema=schema)

# To display the first 9 columns and 5 rows
population_df.select(population_df.columns[:9]).show(5)

# Or if you want to show specific columns
cols_to_show = ["ID", "First_Name", "Last_Name", "Title", "Sex", "DOB", "Ethnicity", "Address_Line_1", "Address_Line_2", "Postcode"]
population_df.select(cols_to_show).show(5)
spark.stop()
is_in_venv: False | Spark DataFrame to be created from: spark.createDataFrame
                                                                                
+----+----------------+-------------------+---------------+--------+--------------+---------------+--------------+-----------------+
|  ID|     SOURCE_FILE|         TIME_STAMP|Local_Authority|Postcode|Address_Line_1| Address_Line_2|Address_Line_3|   Address_Line_4|
+----+----------------+-------------------+---------------+--------+--------------+---------------+--------------+-----------------+
|0001|  evansleigh.csv|2025-07-01 14:05:16|   Glasgow City| S37 0GS|            63|    White forge|          null|      Timothyland|
|0002|palmersandra.csv|2025-07-01 14:05:16|  Pembrokeshire|  M9 9HD|            57|    Irene forks|          null|        Jasonbury|
|0003|  wilsonryan.csv|2025-07-01 14:05:16|       Stirling| SN1 8JG|            89|     Lamb route|          null|      Lake Victor|
|0004|      ievans.csv|2025-07-01 14:05:16|       Highland|HD3X 6TF|            60|Davidson summit|          null|     Fionachester|
|0005|     jemma51.csv|2025-07-01 14:05:16|     Lancashire|CW6W 6RZ|            98|     Hill plaza|          null|Lake Geoffreyside|
+----+----------------+-------------------+---------------+--------+--------------+---------------+--------------+-----------------+
only showing top 5 rows

+----+----------+---------+-----+------+----------+---------+--------------+---------------+--------+
|  ID|First_Name|Last_Name|Title|   Sex|       DOB|Ethnicity|Address_Line_1| Address_Line_2|Postcode|
+----+----------+---------+-----+------+----------+---------+--------------+---------------+--------+
|0001|    Howard|   Foster|   Mr|  Male|03-05-1979|     null|            63|    White forge| S37 0GS|
|0002|   Bernard|Wilkinson| Prof|  Male|22-09-1949|    White|            57|    Irene forks|  M9 9HD|
|0003|    Sheila|    Evans|   Ms|Female|12-03-1930|    Other|            89|     Lamb route| SN1 8JG|
|0004|      Ryan|    Kelly|   Dr|  Male|27-12-1994|    Asian|            60|Davidson summit|HD3X 6TF|
|0005|     Peter|   Taylor|   Mr|  Male|21-08-1972|    Asian|            98|     Hill plaza|CW6W 6RZ|
+----+----------+---------+-----+------+----------+---------+--------------+---------------+--------+
only showing top 5 rows

7. Best practices#

  • Set a random seed for reproducibility.

  • Choose the appropriate locale for your use case.

  • Combine multiple providers for richer data.

  • For large datasets, use PySpark or batch processing.

  • Refer to the Faker documentation and Mimesis documentation for more features.

8. Conclusion#

Synthetic data generation is a powerful tool for testing, privacy protection, and simulation. Both Faker and Mimesis offer extensive capabilities for generating realistic, diverse data in Python. By using parallel examples, you can choose the library that best fits your needs and easily adapt your code for different scenarios.

We have demonstrated:

  • The core concepts of providers and locales.

  • How to generate a wide range of data types, including text, user profiles, and population data.

  • How to scale up data generation using PySpark.

  • Best practices for reproducibility and data realism.

Feel free to expand on these examples and adapt them to your own projects.

References and further reading#