Generating deterministic synthetic data

 Simplistic synthetic data generators use at the heart, random number generation to create the data. The random numbers are themselves generated using a "seed". The seed is usually the current system time. By providing a random number generator the same seed, you will get the same random number series of numbers each time. 

After a random number is created, the synthetic data is generated as follows:

  1. Integer types - The random number generated can be multiplied by an integer from a certain range to get the synthetic value. The range is decided based on your needs (for example, you could get the range from the a source of data if you have one)
  2. Decimal / float / double - Similar approach as above, using a range of values that are decimal / float / double
  3. Text - To generate text data, use a library such as Faker to generate the text data. Specify the type of text (such as name, address, phone, etc). To create deterministic synthetic data, you will need to maintain an array of values for each type and use a random number generator with the same seed to index into the array of text values.
  4. Dates/Times - Use a base date time and use random number generator to add an offset in hours. Use the appropriate multiplier to the random number based on how long you want the range of generated date time values to be. Specify the same initial seed to the random number generator for deterministic synthetic date time values
    import datetime, random

    # For deterministic synthetic data each time you run this
    random.seed(10)
    startTime = datetime.date(2020, 1, 1)
    endTime = datetime.date(2030, 1, 1)
    for i in range(10):
        eachDate = startTime + datetime.timedelta(hours=random.randint(0, 1000))
        print(eachDate)