Pretrained Document AI Models

Vision provides pretrained document AI models that allow you to organize and extract text and structure from business documents.

Pretrained models let you use AI with no data science experience. Simply provide an image-based document to the Vision service and get back information about your document without having to create your own model.

Important

The AnalyzeDocument and DocumentJob capabilities in Vision are moving to a new service, Document Understanding. The following features are impacted:

Table detection
Document classification
Receipt key-value extraction
Document OCR

These features are available in Vision until January 1, 2024. After then, they are available only in Document Understanding.

Use Cases

Pretrained document AI models let you automate back-office operations, and process receipts more accurately.

Intelligent search: Enrich image-based files with metadata, including document type and key fields, for easier retrieval.
Expense reporting: Extract the required information from receipts to automate business workflows. For example, employee expense reporting, spending compliance, and reimbursement.
Downstream Natural Language Processing (NLP): Extract text from PDF files and organize it as the input for NLP, either in tables or in words and lines.
Loyalty points capture: Automate loyalty points calculations from receipts, based on the number of items or the total amount paid.

Supported Formats

Vision supports several document formats.

Documents can be uploaded either from a local file or Oracle Cloud Infrastructure Object Storage. They can be in the following formats:

JPEG
PDF
PNG
TIFF

Pretrained Models

There are five types of pretrained model with Vision.

The pretrained models are:

Optical Character Recognition (OCR)

Vision can detect and recognize text in a document. Language classification identifies the language of a document, then OCR draws bounding boxes around the printed or hand-written text it locates in an image, and digitizes the text.

If you have a PDF with text, Vision locates the text in that document and extracts the text. It then provides bounding boxes for the identified text. Text Detection can be used with Document AI or Image Analysis models.

Vision provides a confidence score for each text grouping. The confidence score is a decimal number. Scores closer to 1 indicate a higher confidence in the extracted text, while lower scores indicate lower confidence score. The range of the confidence score for each label is from 0 to 1.

Note

OCR support is limited to English. If you know that the text in your images is in English, set the language to Eng.

Supported features are:

Word extraction
Text line extraction
Confidence score
Bounding polygons
Single request
Batch request

Limitations are:

Although Language classification identifies multiple languages, OCR is limited to English.

OCR Example

An example of OCR use in Vision.

Input document

Receipt from a fictitious cafe, including two line items, tax, subtotal and total amounts. — Figure 1. OCR Input

API Request:

{ "analyzeDocumentDetails":
 { "compartmentId": "",
   "document": { "namespaceName": "",
   "bucketName": "",
   "objectName": "",
   "source": "OBJECT_STORAGE" },
  "features":
             [ { "featureType": "TEXT_DETECTION" },
               { "featureType": "LANGUAGE_CLASSIFICATION",
                 "maxResults": 5 } ]
 } 
}

Output:

The receipt with all the fields identifed — Figure 2. OCR Output

API Response:

{ "documentMetadata":
 { "pageCount": 1,
   "mimeType": "image/jpeg" },
   "pages":
           [ { "pageNumber": 1,
               "dimensions":
                            { "width": 361, 
                              "height": 600,
                              "unit": "PIXEL" },
                              "detectedLanguages":
                                                  [ { "languageCode": "ENG",
                                                      "confidence": 0.9999994 },
                                                    { "languageCode": "ARA", 
                                                      "confidence": 4.7619238e-7 },
                                                    { "languageCode": "NLD",
                                                      "confidence": 7.2325456e-8 },
                                                    { "languageCode": "CHI_SIM",
                                                      "confidence": 3.0645523e-8 },
                                                    { "languageCode": "ITA",
                                                      "confidence": 8.6900076e-10 } ],
                              "words":
                                                  [ { "text": "Example",
                                                      "confidence": 0.99908227,
                                                      "boundingPolygon":
                                                                        { "normalizedVertices": 
                                                                                               [ { "x": 0.0664819944598338, 
                                                                                                   "y": 0.011666666666666667 },
                                                                                                 { "x": 0.22160664819944598,
                                                                                                   "y": 0.011666666666666667 },
                                                                                                 { "x": 0.22160664819944598,
                                                                                                   "y": 0.035 },
                                                                                                 { "x": 0.0664819944598338,
                                                                                                   "y": 0.035 } ]
                                                                        } ... "detectedLanguages":
                                                                                                [ { "languageCode": "ENG", 
                                                                                                     "confidence": 0.9999994 } ], ...

Document Classification

Document Classification can be used to classify a document.

Vision provides a list of possible document types for the analyzed document. Each document type has a confidence score. The confidence score is a decimal number. Scores closer to 1 indicate a higher confidence in the extracted text, while lower scores indicate lower confidence score. The range of the confidence score for each label is between 0-1. The list of possible document types is:

Invoice
Receipt
Resume or CV
Tax form
Driver's license
Passport
Bank statement
Check
Payslip
Other

Supported features are:

Classify document
Confidence score
Single request
Batch request

Document Classification Example

An example of document classification use in Vision.

Input document

API Request:

{ "analyzeDocumentDetails":
 { "compartmentId": "",
   "document":
              { "namespaceName": "",
                "bucketName": "",
                "objectName": "",
                "source": "OBJECT_STORAGE" },
   "features": 
              [ { "featureType":
                  "DOCUMENT_CLASSIFICATION",
                  "maxResults": 5 } ]
 } 
}

Output:

API Response:

{ "documentMetadata":
 { "pageCount": 1,
   "mimeType": "image/jpeg" },
  "pages":
          [ { "pageNumber": 1,
              "dimensions": 
                           { "width": 361,
                             "height": 600,
                             "unit": "PIXEL" },
              "detectedDocumentTypes":
                                      [ { "documentType": "RECEIPT",
                                          "confidence": 1 },
                                        { "documentType": "TAX_FORM",
                                          "confidence": 6.465067e-9 },
                                        { "documentType": "CHECK",
                                          "confidence": 6.031838e-9 },
                                        { "documentType": "BANK_STATEMENT",
                                          "confidence": 5.413888e-9 },
                                        { "documentType": "PASSPORT",
                                          "confidence": 1.5554872e-9 } ],
 ...
               detectedDocumentTypes":
                                      [ { "documentType": "RECEIPT",
                                          "confidence": 1 } ], ...

Table Extraction

Table extraction can be used to identify tables in a document and extract their contents. For example, if a PDF receipt contains a table that includes the taxes and total amount, Vision identifies the table and extract the table structure.

Vision provides the number of rows and columns for the table and the contents in each table cell. Each cell has a confidence score. The confidence score is a decimal number. Scores closer to 1 indicate a higher confidence in the extracted text, while lower scores indicate lower confidence score. The range of the confidence score for each label is from 0 to 1.

Supported features are:

Table extraction for tables with and without borders
Bounding polygons
Confidence score
Single request
Batch request

Limitations are:

English language only

Table Extraction Example

An example of table extraction use in Vision.

Input document

Fictitious balance sheet for eight quarters — Figure 4. Table Extraction Input

API Request:

{ "analyzeDocumentDetails":
 { "compartmentId": "",
   "document": 
              { "namespaceName": "",
                "bucketName": "",
                "objectName": "",
                "source": "OBJECT_STORAGE" },
   "features": 
              [ { "featureType": "TABLE_DETECTION" } ]
 } 
}

Output:

The balance sheet with cell, column header and row identifer highlighted — Figure 5. Table Extraction Output

API Response:

{ "documentMetadata":
 { "pageCount": 1,
   "mimeType": "application/pdf" },
  "pages":
          [ { "pageNumber": 1,
              "dimensions": 
                           { "width": 2575, 
                             "height": 1013,
                             "unit": "PIXEL" },
 ... 
  "tables":
           [ { "rowCount": 15,
               "columnCount": 9,
               "bodyRows":
                          [ { "cells":
                                      [ { "text": "Qtr1-12",
                                          "rowIndex": 0,
                                          "columnIndex": 1,
                                          "confidence": 0.92011595,
                                          "boundingPolygon":
                                                            { "normalizedVertices": 
                                                                                   [ { "x": 0.2532038834951456,
                                                                                       "y": 0.022704837117472853 },
                                                                                     { "x": 0.3005825242718447,
                                                                                       "y": 0.022704837117472853 },
                                                                                     { "x": 0.3005825242718447,
                                                                                       "y": 0.05330700888450148 },
                                                                                     { "x": 0.2532038834951456,
                                                                                       "y": 0.05330700888450148 } ]
                                                             },
                                                               "wordIndexes": [ 0 ] },
                                        { "text": "Qtr2-12",
                                          "rowIndex": 0,
                                          "columnIndex": 2,
                                          "confidence": 0.919653,
                                          "boundingPolygon":
                                                           { "normalizedVertices":
                                                                                   [ { "x": 0.33048543689320387,
                                                                                       "y": 0.022704837117472853 },
                                                                                     { "x": 0.3724271844660194,
                                                                                       "y": 0.022704837117472853 },
                                                                                     { "x": 0.3724271844660194,
                                                                                       "y": 0.05330700888450148 },
                                                                                     { "x": 0.33048543689320387,
                                                                                       "y": 0.05330700888450148 } ]
                                                          }, "wordIndexes": [ 1 ] },
 ...

Key Value Extraction (Receipts)

Key value extraction can be used to identify values for predefined keys in a receipt. For example, if a receipt includes a merchant name, merchant address, or merchant phone number, Vision can identify these values and return them as a key value pair.

Supported features are:

Extract values for predefined key value pairs
Bounding polygons
Single request
Batch request

Limitations:

Supports receipts in English only.

Supported fields are:

MerchantName: The name of the merchant issuing the receipt.
MerchantPhoneNumber: The telephone number of the merchant.
MerchantAddress: The address of the merchant.
TransactionDate: The date the receipt was issued.

TransactionTime: The time the receipt was issued.
Total: The total amount of the receipt, after all charges and taxes have been applied.
Subtotal: The subtotal before taxes.
Tax: Any sales taxes.
Tip: The amount of tip given by the purchaser.

The supported line item information is:

ItemName: Name of the item.
ItemPrice: Unit price of the item.
ItemQuantity: The number of each items purchased.
ItemTotalPrice: The total price of the line item.

Key Value Extraction (Receipts) Example

An example of key value extraction use in Vision.

Input document

API Request:

{ "analyzeDocumentDetails":
 { "compartmentId": "",
   "document":
              { "namespaceName": "",
                "bucketName": "",
                "objectName": "",
                "source": "OBJECT_STORAGE" },
   "features":
              [ { "featureType": "KEY_VALUE_DETECTION" } ]
 } 
}

Output:

The fictitious receipt with only specific lines and fields highighted — Figure 7. Key Value Extraction (Receipts) Output

API Response:

{ "documentMetadata":
                     { "pageCount": 1,
                       "mimeType": "image/jpeg" },
                       "pages":
                               [ { "pageNumber": 1, 
                                   "dimensions":
                                                { "width": 361,
                                                  "height": 600,
                                                  "unit": "PIXEL" },
 ...
                                   "documentFields":
                                                     [ { "fieldType": "KEY_VALUE",
                                                         "fieldLabel":
                                                                      { "name": "MerchantName" },
                                                         "fieldValue":
                                                                      { "valueType": "STRING",
                                                                        "boundingPolygon":
                                                                                          { "normalizedVertices":
                                                                                                                 [ { "x": 0.0664819944598338,
                                                                                                                     "y": 0.011666666666666667 },
                                                                                                                   { "x": 0.3157894736842105,
                                                                                                                     "y": 0.011666666666666667 },
                                                                                                                   { "x": 0.3157894736842105,
                                                                                                                     "y": 0.035 },
                                                                                                                   { "x": 0.0664819944598338,
                                                                                                                     "y": 0.035 } ]
                                                                                           },
                                                                        "wordIndexes":
                                                                                      [ 0, 1 ],
                                                                        "value": "Example cafe" } },
 ...

Optical Character Recognition (OCR) PDF

OCR PDF generates a searchable PDF file in your Object Storage. For example, Vision can take a PDF file with text and images, and return a PDF file where you can search for the text in the PDF.

Supported features:

Generate searchable PDF
Single request
Batch request

OCR PDF Example

An example of OCR PDF use in Vision.

Input

Page from a PDF document — Figure 8. OCR ODF Input

API Request:

{ "analyzeDocumentDetails":
 { "compartmentId": "",
   "document":
              { "source": "INLINE",
                "data": "......" },
   "features":
              [ { "featureType": "TEXT_DETECTION",
                  "generateSearchablePdf": true } ]
 } 
}

Output:

Searchable PDF.

Using the Pretrained Document AI Models

Vision provides pretrained models for customers to extract insights about their documents without needing Data Scientists.

You need the following before using a pretrained model:

A paid tenancy account in Oracle Cloud Infrastructure.
Familiarity with Oracle Cloud Infrastructure Object Storage.

You can call the pretrained Document AI models as a batch request using Rest APIs, SDK, or CLI. You can call the pretrained Document AI models as a single request using the Console, Rest APIs, SDK, or CLI.

See the Limits section for information on what is allowed in batch requests.

Oracle Cloud Infrastructure Documentation

Pretrained Document AI Models

Use Cases

Supported Formats

Pretrained Models

Optical Character Recognition (OCR)

Document Classification

Table Extraction

Key Value Extraction (Receipts)

Optical Character Recognition (OCR) PDF

Using the Pretrained Document AI Models