Generate synthetic data using random generators, algorithms, statistical models, and
Large Language Models (LLMs) to simulate real data for developing and testing solutions
effectively.
Synthetic data can be a powerful tool when developing and testing solutions, especially
when actual data doesnโt yet exist or isnโt allowed to be used. Synthetic, or
artificially generated, data can have many of the characteristics of real data.
Synthetic data is typically created using random generators, algorithms, or statistical
models to simulate the characteristics and distributions of real data. However, this can
be complex to produce or rely on tools with features of varying sophistication. With the
availability of Large Language Models (LLMs), more relevant and schema-specific data may
be generated that considers characteristics expressed in natural language.
Topics
Benefits of Synthetic Data Generation Synthetic data generation enables populating database metadata clones, supporting development, testing, and machine learning projects without using sensitive data from original tables.
Generate Synthetic Data Use DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA function to generate synthetic data and query the data using Select AI actions.
Monitor and Troubleshoot Synthetic Data Generation When generating large amounts of data across many tables, Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in parallel. The status of each chunk is tracked in the SYNTHETIC_DATA$<operation_id>_STATUS table.
Synthetic data generation enables populating
database metadata clones, supporting development, testing, and machine learning projects
without using sensitive data from original tables.
Synthetic Data Generation offers the following benefits:
Populating metadata clones with synthetic data: A metadata
clone replicates the structure of a database or schema without including
actual data. Select AI allows synthetic data generation to populate these
clones, protecting sensitive data while enabling development, testing, and
creating templates. This approach supports performance and scalability
testing.
Starting new projects: When starting a new project,
actual data may not be available. Synthetic data provides realistic samples
to help demonstrate concepts and gain support for project proposals.
Validating user experience: Synthetic data aids in testing
user interfaces by providing diverse data sets to uncover design flaws,
performance, and scalability issues.
Supporting AI and machine learning projects: Synthetic
data is useful for training AI and machine learning models when real data is
unavailable or restricted. LLMs can generate data with specific patterns to
facilitate model training and scoring.
Monitor and Troubleshoot
Synthetic Data Generation ๐
When generating large amounts of data across many tables,
Select AI splits synthetic data generation tasks into smaller chunks and runs tasks in
parallel. The status of each chunk is tracked in the SYNTHETIC_DATA$<operation_id>_STATUS table.
Synthetic data generation operations are logged in the tables
DBA_LOAD_OPERATIONS and USER_LOAD_OPERATIONS. Use
these tables to monitor the DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA
operation. See Track DBMS_CLOUD Load Operations for more details.
After running synthetic data generation in a given session, you can get the
latest <operation_id> from
USER_LOAD_OPERATION using the
following:
SELECT max(id) FROM user_load_operations;
To view the synthetic data operations running in a different session, use the
DBA_LOAD_OPERATIONS view.
View Status of Synthetic Data
Generation
The status table of synthetic data generation operation shows the
progress of each table and its corresponding chunk. The
STATUS_TABLE column in USER_LOAD_OPERATIONS or
DBA_LOAD_OPERATIONS shows the status table name. The table name
is SYNTHETIC_DATA$<operation_id>_STATUS and it has following
columns:
Name
Datatype
Description
ID
NUMBER
Unique identifier of the record.
NAME
VARCHAR2
Qualified name of the table, such as
"ADB_USER"."EMPLOYEES"
BYTES
NUMBER
Desired number of records for this data generation
task
ROWS_LOADED
NUMBER
Actual number of records generated.
CHECKSUM
VARCHAR2
Starting value for the primary key during this data
generation task.
LAST_MODIFIED
TIMESTAMP WITH TIME ZONE
Timestamp indicating when the record was last
modified.
STATUS
VARCHAR2
Status of the data generation task. The valid values
are:
COMPLETED
FAILED
PENDING
SKIPPED
ERROR_CODE
NUMBER
Error code, if the data generation task
fails.
ERROR_MESSAGE
VARCHAR2
Error message provided if the task fails.
END_TIME
TIMESTAMP WITH TIME ZONE
Timestamp marking the end of the data generation
task.
Example: Check the Number of Records Generated for Each Table
To check the number of records generated for each table, issue the
following:
SELECT name, SUM(rows_loaded) FROM synthetic_data$<operation_id>_status group by name;
Query ROWS_LOADED to confirm how many number of rows are
loaded for each chunk, and SUM(ROWS_LOADED) for rows for each
table.
BEGIN
DBMS_CLOUD_AI.GENERATE_SYNTHETIC_DATA(
profile_name => 'GENAI',
object_list => '[{"owner": "ADB_USER", "name": "DIRECTOR","record_count":150},
{"owner": "ADB_USER", "name": "MOVIE_ACTOR","record_count":300},
{"owner": "ADB_USER", "name": "CLASSES", "user_prompt":"all in fall semester","record_count":5},
{"owner": "ADB_USER", "name": "ACTOR","record_count":220},
{"owner": "ADB_USER", "name": "MOVIE","record_count":50}]'
);
END;
/
-- Check loaded rows for each chunk
SQL> SELECT name, rows_loaded FROM synthetic_data$141_status order by name;
NAME ROWS_LOADED
------------------------------------------
"ADB_USER"."ACTOR" 188
"ADB_USER"."ACTOR" 32
"ADB_USER"."CLASSES" 5
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE" 50
"ADB_USER"."MOVIE_ACTOR" 38
"ADB_USER"."MOVIE_ACTOR" 114
"ADB_USER"."MOVIE_ACTOR" 148
-- Check loaded rows for each table
SQL> SELECT name, SUM(rows_loaded) FROM synthetic_data$141_status group by name;
NAME SUM(ROWS_LOADED)
------------------------------------------
"ADB_USER"."DIRECTOR" 150
"ADB_USER"."MOVIE_ACTOR" 300
"ADB_USER"."CLASSES" 5
"ADB_USER"."ACTOR" 220
"ADB_USER"."MOVIE" 50