Custom Model Datasets

Review the supported datasets for custom models, and how to convert datasets into a supported format.

Allowed Datasets for Custom Text Classification

You can provide labeled data for custom text classification models in two ways:

  • Data Labeling projects
  • Comma-Separated Value (.csv) files
CSV File Requirements
  • The first line must be a header containing the following two-column names:
    • text: captures the text to be classified.
    • labels: captures one or more assigned classes. For multi-label classification datasets, multiple class names can be specified by joining them with | symbol.
  • All lines after the header line contain training records.
  • If the file has more than two columns, only the text and labels columns are used to train the model.
  • For the CSV file encoding, use UTF-8. When using Excel, save the file as CSV UTF-8 (Comma-delimited) (.csv).

  • For delimiter, use comma (,).
  • For escape character, use a double quote ("), also known with the Unicode character of U+0022.

    For example, in Excel, if you type the following text:

    This is a "double quote" sentence

    The preceding sentence is stored in the CSV as follows:

    "This is a ""double quote"" sentence"

Example CSV file for single label Text Classification:

text,labels
Windows OS -unable to print,Network Printer Failure
Citrix Account frequently locking,Account (Password reset)
Pull print queue not working ,Application Component Disconnect
wifi disable and lan is disconnected at the desktop,Hardware Device Failure
Example CSV file for multi label Text Classification:
Windows OS -unable to print,Network Printer Failure
Pull print queue not working ,Application Component Disconnect|Network Printer Failure
wifi disable and lan is disconnected at the desktop,Hardware Device Failure|Network Connection Issue

Allowed Dataset Formats for Custom NER

You can provide labeled data for custom NER models in two ways:

  • Data Labeling projects
  • JSON Lines format (.jsonl).
JSON File Requirements

The JSON file doesn't include the training data. Instead, the JSON file is a manifest file that contains labels and pointers (relative paths) to files with unlabeled data.

The JSON format is a JSON Lines (JSONL) format, where each line is a single JSON object:

  • The first line in the object describes the set of labels or classes and the type of annotation file.
  • All subsequent lines describe a training record.
  • Save all the text files in the same directory as the manifest file (.jsonl), and have the training records name the files.

Schema Definition
  1. The first line is a header line. It contains a JSON object that describes the file type.
  2. Any subsequent line contains a JSON object that represents a labeled record.
Header Line Format
Field Type Description
labelsSet Array of objects.

Object with a string member, "name" that indicates the set of entities supported for annotation. List all entities here.

annotationFormat String Use "ENTITY_EXTRACTION" for NER datasets.
datasetFormatDetails Object Object with a string member, "formatType" that indicates the type of data being annotated. Set the value of formatType to "TEXT" for Language.
Example JSON Schema:
{
    "labelsSet": [
      {
        "name": "Label1"
      },
      {
        "name": "Label2"
      },
      {
        "name": "Label3"
      },
      {
        "name": "Label4"
      }
    ],
    "annotationFormat": "ENTITY_EXTRACTION",
    "datasetFormatDetails": {
      "formatType": "TEXT"
    }
  }
Labeled Record Format
Field Type Description
sourceDetails Object

Object with a string member, path that points to the file being annotated.

The file path is relative to the location of the json file.

annotations Object Complex object that describes the annotations.
entities Array (Objects) A list of the entities identified in the record.
entityType String The type of entity annotation. For the value, use "TEXTSELECTION" for NER.
labels Array (Objects) Each object in the array has the member, "label_name" that represents the type of entity identified.
textSpan Object An object that represents the text span. Contains two required numeric members: "offset", and "length".
JSON Schema for Labeled Record Format Example:
{
    "sourceDetails": {
      "path": "Complaint3.txt"
    },
    "annotations": [
      {
        "entities": [
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              },
              {
                "label_name": "Label2"
              }
            ],
            "textSpan": {
              "offset": 0,
              "length": 28
            }
          },
          {
            "entityType": "TEXTSELECTION",
            "labels": [
              {
                "label_name": "Label1"
              }
            ],
            "textSpan": {
              "offset": 196,
              "length": 11
            }
          }
        ]
      }
    ]
  }

Uploading the Datasets

Upload datasets into Object Storage buckets.

Note

Alternatively, you can create datasets using the OCI Data Labeling service.

Creating a Bucket

If you have an Object Storage bucket for datasets, then skip this section.

  1. Open the navigation menu and click Storage. Under Object Storage & Archive Storage, click Buckets.
  2. Under List Scope, in the Compartment list, click the name of the compartment where you want to create a bucket. You must already have permission to add Object Storage resources to this compartment.
  3. Click Create Bucket.
  4. Enter a name for the bucket, unique to the region.
  5. For other fields, click the Learn More links and then choose options that apply to the data.
  6. Click Create. By default buckets have Private Visibility unless you change their visibility after you create them.

Note

You must have unique bucket names within a namespace. While the namespace is region-specific, the namespace name itself is the same in all regions. For example, if the tenancy is assigned a namespace name of <your-namespace> that is the namespace name in all regions.

You can create a bucket named MyBucket in US West (Phoenix). You can't create another bucket named MyBucket in US West (Phoenix). You can, however, create a bucket named MyBucket in Germany Central (Frankfurt). Because the namespace name is unique to a tenant, other users can create buckets named MyBucket in their own namespaces.

Adding Data to a Bucket

After you create a bucket, add your datasets the bucket. If your datasets are already in a bucket, then skip this section.

You store files as objects in buckets. An object is composed of the data itself and metadata about the object.

  1. Open the navigation menu and click Storage. Under Object Storage & Archive Storage, click Buckets.
  2. Under List Scope, in the Compartment list, click the name of the compartment that hosts the bucket.
  3. Click the name of the bucket where you want to add data.
  4. Click Upload.
  5. Upload the data.