Database Subsetting and Synthetic Data

Subsetted databases populated with synthetic data are important for cross database analytics.  

Database subsetting is complex due to several challenges:

  1. When selecting a subset of rows from a table, you must also get the rows of related tables.

  2. A relational constraint may be on multiple columns (Composite keys). In this case a subset requires rows of related tables using the multiple column constraint.

  3. Some tables may have no foreign key columns but may have other tables that have foreign keys referencing the table.

  4. Cyclical dependencies. Table A has a foreign key to table B and table B has a foreign key back to table A.

LangGrant solves these challenges with a simple visual interface. Start by specifying a percent of the source database size, with or without bias controls.   Windocks subsetting does not write to the source database (the one being subsetted).

LEDGE subset

Challenges in cross database analytics

Joining data between databases is challenging as column level data rarely matches exactly. Fortunately, subsetting can reduce a multi-terabyte database to megabytes in size, while retaining full relational integrity. Down sized databases are then populated with synthetic data, to provide LLMs with safe context needed for LLM generated cross database analytics.

LEDGE cross database join

Cross database joins

An LLM provided with a safe synthetically populated database context, is better able to specify a join strategy.   LangGrant supports a range of joins, including fuzzy, distance, and exact match.  

Explore other LangGrant capabilities

MCP db context icon

Orchestration:  automated database context 

LEDGE automatically delivers complete database context for LLMs to comprehend multiple databases simultaneously at scale. Like a skilled engineer, once an LLM understands databases it can contribute to solution design.

Read more
MCP data plan icon

Orchestration: analytic plan

LEDGE binds LLMs to deliver accurate analytic plans for user queries.  Plans are saved, easily validated and modified, and run to deliver analytics data within minutes of the user query.

Read more
LEDGE governance icon

 

Governance

PII safeguards, authorization controls, data residency rules, firewall restrictions, and token-governance policies are built-in by design.  No sensitive data leaves governed systems.

Read more
login

Plan management

LLM generated plans are saved, easily reviewed and validated, modified, and executed, for LLM use that is transparent, explainable, and repeatable. 

Read more
structured-data

Database cloning and containers

On demand database clones with containers provide Agent developers with production database copies (with optional masking) for agentic AI dev/test.

Read more
data-base-subsetting

Database subsetting and synthetic data 

Database subsetting with synthetic data provides added context for working with complex multi-database environments.

Read more