Custom Model Datasets
Review the supported datasets for custom models, and how to convert datasets into a supported format.
Allowed Datasets for Custom Text Classification
You can provide labeled data for custom text classification models in two ways:
- Data Labeling projects
- Comma-Separated Value (
.csv
) files
- CSV File Requirements
-
-
The first line must be a header containing the following two-column names:
text
: captures the text to be classified.labels
: captures one or more assigned classes. For multi-label classification datasets, multiple class names can be specified by joining them with|
symbol.
- All lines after the header line contain training records.
- If the file has more than two columns, only the
text
andlabels
columns are used to train the model. -
For the CSV file encoding, use UTF-8. When using Excel, save the file as CSV UTF-8 (Comma-delimited) (.csv).
- For delimiter, use comma (
,
). - For escape character, use a double quote (
"
), also known with the Unicode character ofU+0022
.For example, in Excel, if you type the following text:
This is a "double quote" sentence
The preceding sentence is stored in the CSV as follows:
"This is a ""double quote"" sentence"
Example CSV file for single label Text Classification:
text,labels Windows OS -unable to print,Network Printer Failure Citrix Account frequently locking,Account (Password reset) Pull print queue not working ,Application Component Disconnect wifi disable and lan is disconnected at the desktop,Hardware Device Failure
Example CSV file for multi label Text Classification:Windows OS -unable to print,Network Printer Failure Pull print queue not working ,Application Component Disconnect|Network Printer Failure wifi disable and lan is disconnected at the desktop,Hardware Device Failure|Network Connection Issue
-
Allowed Dataset Formats for Custom NER
You can provide labeled data for custom NER models in two ways:
- Data Labeling projects
- JSON Lines format (
.jsonl
).
- JSON File Requirements
-
The JSON file doesn't include the training data. Instead, the JSON file is a manifest file that contains labels and pointers (relative paths) to files with unlabeled data.
The JSON format is a JSON Lines (JSONL) format, where each line is a single JSON object:
- The first line in the object describes the set of labels or classes and the type of annotation file.
- All subsequent lines describe a training record.
-
Save all the text files in the same directory as the manifest file
(.jsonl)
, and have the training records name the files.
- Schema Definition
-
- The first line is a header line. It contains a JSON object that describes the file type.
- Any subsequent line contains a JSON object that represents a labeled record.
- Header Line Format
-
Field Type Description labelsSet
Array of objects. Object with a string member,
"name"
that indicates the set of entities supported for annotation. List all entities here.annotationFormat
String Use "ENTITY_EXTRACTION"
for NER datasets.datasetFormatDetails
Object Object with a string member, "formatType"
that indicates the type of data being annotated. Set the value offormatType
to"TEXT"
for Language. - Example JSON Schema:
-
{ "labelsSet": [ { "name": "Label1" }, { "name": "Label2" }, { "name": "Label3" }, { "name": "Label4" } ], "annotationFormat": "ENTITY_EXTRACTION", "datasetFormatDetails": { "formatType": "TEXT" } }
- Labeled Record Format
-
Field Type Description sourceDetails
Object Object with a string member,
path
that points to the file being annotated.The file path is relative to the location of the
json
file.annotations
Object Complex object that describes the annotations. entities
Array (Objects) A list of the entities identified in the record. entityType
String The type of entity annotation. For the value, use "TEXTSELECTION"
for NER.labels
Array (Objects) Each object in the array has the member, "label_name"
that represents the type of entity identified.textSpan
Object An object that represents the text span. Contains two required numeric members: "offset"
, and"length"
. - JSON Schema for Labeled Record Format Example:
-
{ "sourceDetails": { "path": "Complaint3.txt" }, "annotations": [ { "entities": [ { "entityType": "TEXTSELECTION", "labels": [ { "label_name": "Label1" }, { "label_name": "Label2" } ], "textSpan": { "offset": 0, "length": 28 } }, { "entityType": "TEXTSELECTION", "labels": [ { "label_name": "Label1" } ], "textSpan": { "offset": 196, "length": 11 } } ] } ] }
Uploading the Datasets
Upload datasets into Object Storage buckets.
Creating a Bucket
If you have an Object Storage bucket for datasets, then skip this section.
- Open the navigation menu and click Storage. Under Object Storage & Archive Storage, click Buckets.
- Under List Scope, in the Compartment list, click the name of the compartment where you want to create a bucket. You must already have permission to add Object Storage resources to this compartment.
- Click Create Bucket.
- Enter a name for the bucket, unique to the region.
- For other fields, click the Learn More links and then choose options that apply to the data.
-
Click Create. By default buckets have Private Visibility unless you change their visibility after you create them.
You must have unique bucket names within a namespace. While the namespace is region-specific, the namespace name itself is the same in all regions. For example, if the tenancy is assigned a namespace name of <your-namespace>
that is the namespace name in all regions.
You can create a bucket named MyBucket in US West (Phoenix). You can't create another bucket named MyBucket in US West (Phoenix). You can, however, create a bucket named MyBucket in Germany Central (Frankfurt). Because the namespace name is unique to a tenant, other users can create buckets named MyBucket in their own namespaces.
Adding Data to a Bucket
After you create a bucket, add your datasets the bucket. If your datasets are already in a bucket, then skip this section.
You store files as objects in buckets. An object is composed of the data itself and metadata about the object.
- Open the navigation menu and click Storage. Under Object Storage & Archive Storage, click Buckets.
- Under List Scope, in the Compartment list, click the name of the compartment that hosts the bucket.
- Click the name of the bucket where you want to add data.
- Click Upload.
- Upload the data.