Document custom models are intended for Document Understanding users without a data science
background.
Overview
By creating a dataset, and instructing Document Understanding to
train a model based on the dataset, you can have a custom model ready for your
scenario. For custom key value extraction, it involves having a set of documents
labeled with the fields you're trying to extract in the trained model, for example,
company code, date, or total.. For custom document classification, it involves
having a set of documents with the document class annotated for each document, for
example, job application, recommendation letter, or background check report.
Tools to Create the Dataset
The key to building a useful custom model is preparing and training it with a good
dataset. We recommend that you create and label the dataset using OCI
Data Labeling. Here is an outline of the steps to
take:
Collect enough documents that match the distribution of the intended
application.
Select the correct annotation format for the custom model that you want. All Document Understanding models are supported
under the Document annotation format, using key-value
annotations for custom key value extraction, or single-label classification for
custom document classification.
Label all instances of the fields or document classes that occur in the sourced
dataset.
Include expected variations in the training dataset
If you expect variation, then have at least one example of each
variation in the training dataset. For example, if you expect that in
employee application forms not all applications have a completed the
reference phone number field, include one example where all the fields
are filled out in addition to one where all fields, except the reference
phone number field, are filled out.
Make the dataset size larger than the minimum
Custom key value extraction requires a minimum of five documents, and
custom document classification requires a minimum of 10 documents.
Increasing the dataset increases model performance. The following table
shows the recommended minimum numbers of documents based on targeted
accuracy, variation in documents, and document types:
Recommended Number of Documents by Type and Accuracy for
Custom Key Value Extraction
All documents of a class have the same
template, for example, Invoice class can contain
documents from one shop or organization
15
All documents are labeled.
The number of documents mentioned is for a single class. For example, if a dataset has 5 classes to be classified and if the recommended number of documents is 15, then the total number of documents is 75 (15*5).
Digital/Scan/Mobile
75%
Documents of a class have various templates.
For example, the invoice class can contain
documents from various shops or
organizations.
20
All documents are labeled.
The number of documents mentioned is for a single class. For example, if a dataset has 5 classes to be classified and if the recommended number of documents is 15, then the total number of documents is 75 (15*5).
Digital/Scan/Mobile
80%
Documents of a class have various templates.
For example, the invoice class can contain
documents from various shops or
organizations.
25
All documents are labeled.
The number of documents mentioned is for a single class. For example, if a dataset has 5 classes to be classified and if the recommended number of documents is 15, then the total number of documents is 75 (15*5).
Digital/Scan/Mobile
90%
Documents of a class have various templates.
For example, the invoice class can contain
documents from various shops or
organizations.
35
All documents are labeled.
The number of documents mentioned is for a single class. For example, if a dataset has 5 classes to be classified and if the recommended number of documents is 15, then the total number of documents is 75 (15*5).
Guidelines for Annotating Data 🔗
A custom model is only be as good as the quality of training documents and
annotations used to train it. The following are guidelines to creating a useful
custom model:
Annotate the documents consistently and correctly
Imagine you're creating a custom model for an employee application and
want to extract the applicant's name with the custom model. If you
expect the first and last name to be extracted, annotate all words
related to the full name, for example, Mary Joe Smith, as the applicant
name in the training documents. If the applicant name field is present
in all the documents, annotate it on all the documents. Skipping
annotations on training documents or partially annotating a field
adversly affects the quality of the model.
Annotate both field names and field values
To enable the model to learn better, annotate the associated keys names
and value names. For example, to extract the applicant name for a
document, create two labels, for example, applicant name
field and applicant name value. On the
training document, annotate the field name as applicant name
field and the answer, for example, Mary Joe Smith, as
applicant name value.