Training Data in Generative AI
Here are guidelines for creating training data for fine-tuning the pretrained models in OCI
Generative AI. A custom model can be fine‑tuned with only one dataset, which the system automatically splits into 80 % training and 20 % validation data. The dataset must be a JSONL file containing at least 32 prompt/completion pairs, each line formatted as: {"prompt": "<your prompt>", "completion": "<expected response>"}
. Save the file in an OCI
Object Storage bucket and reference it when creating the custom model.
Dataset Requirements
Datasets for training custom models have the following requirements:
- A maximum of one fine-tuning dataset is allowed per custom model. This dataset is randomly split to a 80:20 ratio for training and validating.
- Each file must have at least 32 prompt/completion pair examples.
- The file format is
JSONL
. - Each line in the
JSONL
file has the following format:{"prompt": "<a prompt>", "completion": "<expected response given the prompt>"}\n
- The file must be stored in an OCI Object Storage bucket.
JSONL Format
- About
JSONL
-
A
JSONL
file contains a newJSON
value or object on each line. The file isn't evaluated as a whole, like a regularJSON
file. Instead, each line is treated as if it is a separateJSON
file. This format is ideal for storing a set of inputs inJSON
format.The OCI Generative AI service accepts a
JSONL
file for fine-tuning custom models in the following format:{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"} {"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"} . . .
JSONL
Example
Ensure that each
JSONL
dataset file that you create for Generative AI has the following properties: - The file is
UTF-8
encoded. - Each line item contains a valid
JSON
object. - Each
JSON
object has two properties:"prompt"
and"completion"
. - Each
JSON
object is entered in a new line or followed by a newline character (\n
).
After you create the JSONL file, add your dataset to an Object Storage bucket.