DevOps for Synthetic Data
Getting the right data sets and a repeatable, agile process for data use in dev/test workflows has been a historically manual and labor intensive process, both because of challenges in creating representative subsets of the data, as well as challenges generating data sets that are statistically similar to real-world data.
With Windocks, your data scientists, dev, and test teams can get golden images of synthetic data based on production data sets faster and earlier in the process, supporting the trend to “shift left.”
Windocks pluggable architecture to automate synthetic data generation:
- Securely access production data sets
- Subset the data to the desired size for dev and test
- Modeling the data utilizing either built-in synthetic data generation from Windocks (based on SDV) or your own custom code
- Validating the analytical distributions and privacy metrics of the resulting data sets utilizing built-in Windocks metrics or your own custom code
- Deploy synthetic data sets to target SQL instances or Docker database containers for immediate access to dev and test teams and reproducing experiments
Efficiency across disciplines
At all phases of the R&D process, Windocks is optimized for efficiency across the disciplines of data governance, data science, dev, test, and ops.
- separation of roles from production database availability, configuration of masking and subset rules, data modeling, data generation, deployment, and test data access
- pluggable architecture for custom code for synthetic data generation and analytical validations
- automate tasks through configuration, not code, utilizing standard Docker APIs and commands
- ad hoc control through a web interface
- integration with source control (git) and build (Jenkins) systems used in the DevOps pipeline, as well as off-the-shelf templates for use with cloud pipelines such as Azure DevOps pipeline
- native SQL database integration throughout the pipeline to best integrate the efforts of infrastructure engineers, data engineers, and data scientists
- native use of database containers for automation, storage and compute efficiency, as well as avoidance of VM sprawl
Steps to orchestrate synthetic data generation
1. Install Windocks
The initial beta release supports SQL Server and PostgreSQL for Windows. After you have received your beta download link and license key, install Windocks Subset & Synthesize on a Windows machine.
For Linux installations, a DockerHub version will be available in later phases of the beta.
The Windocks server should have network access to the source database which you'd like to subset.
2. Point Windocks at the source database
Best practice often calls for a "data governance" database environment to act as the master for all dev, test, and other lower-level environments. Point Windocks to this data governance environment (often an existing SQL Server instance).
Windocks will connect to this source to subset and synthesize data.
In the free Community Edition, Windocks writes the data subsets and the synthesized data sets back to the source. In the paid editions, Windocks supports writing subsetted and synthesized data sets to other SQL instances or even to Windocks or native Docker containers.
3. Configure a subset/synthesize specification
- the database in the source instance to subset or generate synthetic data from
- the size of the database subset, specified in percentage of source database size or as maximum number of rows
- the database tables to subset as well those to "pass through" to the target data set
- any database fields to treat as "sensitive data" that are not automatically identified by the wizard
- data columns to apply statistical modeling to based on the source data
- the statistical model for synthetic data generation or whether to simply generate a subset only
This specification can be build via the Web UI or via a JSON specification for automation.
4. "Run" the specification
Once you have configured the specification for subsetting and synthetic data, you can run it interactively from the web UI, or you can automate these "runs" using the REST API.
Once the specification is run, you can access the subsetted and/or synthesized data from the target data source.
5. Validate the synthetic data set
Prior to widespread delivery, you can validate the synthetic data set manually from the web UI. The Validate screen shows statistics on the original database size versus the subsetted or synthetic data size, as well as statistical profiles of values in both the source and target data.
In addition, Windocks runs a privacy score (indexed from 0 to 1). The models accomplish this by fitting an adversarial attacker model on the synthetic data to predict sensitive attributes from “key” attributes and then evaluating its accuracy on the real data.
When scripting a run, the REST API can be used to automatically validate the synthetic data set against defined thresholds and return the results against these validations.