Synthetic data distributions (Random. CTGAN, CopulaGAN, etc)

Synthetic data distributions vary a lot from 1. random to 2. Similar to source to 3. Sophisticated statistical and AI distributions. Depending on your needs, any one of these types of distributions would work. 

1. Random - The simplest distribution where the generated data is not corelated to anything because it is generated using Random numbers. This is useful for testing software that uses data. 

2. Similar to source - This distribution implies that the distribution curve (X axis - value, Y axis - how frequently the value occurs) of the frequency of values in the synthetic data set "looks like" the frequency curve of the source data set. More specifically, the curves have similar slopes but potentially different Y values for the same X values. This distribution is useful for software testing and some simplistic use cases of machine learning models. 

3. These include GAN based models like CTGAN and CopulaGAN. These model tabular data using sophisticated statistical and / or machine learning to generate synthetic data that cannot be "identified with the source".  

Synthetic Data Vault (SDV) is an MIT based project that generates data using these concepts. 

Python code for generating synthetic data using SDV for GAN and GaussianCopula

For a single table

# Populate a PII dictionary to let SDV know what is the type of synthetic data (name, address, SSN, etc)

import sdv as sdv
piiDict = {}
piiDict[fieldName1] = 'ssn' 
piiDict[fieldName2] = 'name' 
modelName = 'ctgan'
 
if(modelName.lower() == 'gaussiancopula'):
                model = sdv.tabular.GaussianCopula( anonymize_fields=piiDict)
elif(modelName.lower() == 'ctgan'):
                model = sdv.tabular.CTGAN(anonymize_fields=piiDict)
elif(modelName.lower() == 'copulagan'):
                model = sdv.tabular.CopulaGAN( anonymize_fields=piiDict)
 
model.fit(yourDataFrame)
syntheticDataframe = model.sample(num_rows=int(numberOfRowsYouWant))