Question 1 - How do I create synthetic data myself
You can generate synthetic data yourself with the following approach.
1. List all the types of your table / database - integer, long, decimal, string, etc.
2. Use a random number generator as the basis for generating synthetic data
3. Integer types - The random number generated can be multiplied by an integer from a certain range to get the synthetic value. The range is decided based on your needs (for example, you could get the range from the a source of data if you have one)
4. Decimal / float / double - Similar approach as above, using a range of values that are decimal / float / double
5. Text - To generate text data, use a library such as Faker to generate the text data. Specify the type of text (such as name, address, phone, etc.). Or maintain an array of values for each type and use a random number generator with the same seed to index into the array of text values.
6. Dates/Times - Use a base date time and use random number generator to add an offset in hours. Use the appropriate multiplier to the random number based on how long you want the range of generated date time values to be.
Question 2 - When should I use an existing tool to generate synthetic data
1. The higher the security bar for data privacy, security, the more useful to find a tool that will ensure security
2. If you have larger databases that require synthetic data, a tool will be much faster
3. Multiple databases with different schemas implies complexity in writing your own generator - a tool will handle all of the schemas and column types
4. Ability to maintain your own synthetic data generator code base - As your organization's needs for synthetic data grow, the effort required to maintain your own code base grows.
Question 3 - What are some technical considerations while generating synthetic data
1. Data filters and WHERE clauses - While generating synthetic data, it is useful to target certain rows for testing based on filters / where clauses on tables
2. Data distributions - Depending on your use case (testing, machine learning) you may need the data distribution of synthetic data to be either random, or similar to your source database or you may need more AI based distributions such as GAN
Download a free synthetic data generator below that generates synthetic data across SQL Server, Snowflake, PostGre, MySQL, Azure SQL, Amazon Aurora and more