Create Entities

Here's how you create an entity.

To create an entity:
  1. Click Entities (This is an image of the Entities icon.) in the side navbar.
  2. Click Add Entity and then enter the name and select the type. The dialog's fields reflect the entity type. For example, For regular expressions entities, you can add the expression. For Value List entities, you add the values and synonyms.
    If your skill supports multiple languages through Digital Assistant's native language support, then you need to add the foreign-language counterparts for the Value List entity's values and synonyms.
    Description of multilingual_entity_values.png follows

    Because these values need to map to the corresponding value from the primary langauge (The Primary Language Value), you need to select the primary value before you add its secondary language counterpart. For example, if you've added French as a secondary language to a skill's whose primary language is English, you first select small as the Primary Language Value and then add petite.
    Description of add_ml_entity_value.png follows

  3. As an optional step, enter a description. You might use the description to spell out the entity, like the pizza toppings for a PizzaTopping entity. This descripition is not retained when you add the entity to a composite bag.
  4. You can add the following functions, which are optional. They can be overwritten if you add the entity to a composite bag.
    • If a value list entity has a long list of values, but you only want to show users only a few options at a time, you can set the pagination for these values by entering a number in the Enumeration Range Size field, or by defining an Apache FreeMarker expression that evaluates to this number. For example, you can define an expression that returns enum values based on the channel.

      When you set this property to 0, the skill won't output a list at all, but will the user input against an entity value.

      If you set this number to one lower than the total number of values defined for this entity, then the Resolve Entities component displays a Show More button to accompany each full set of values. If you use a Common Response component to resolve the entity, then you can configure the Show More button yourself.
      This is an image of the Show More button.
      You can change the Show More button text using the showMoreLabel property that belongs to the Resolve Entities and Common Response components.

    • Add an error message for invalid user input. Use an Apache FreeMarker expression that includes the system.entityToResolve.value.userInput property. For example, ${system.entityToResolve.value.userInput!'This'}' is not a valid pizza type.
    • To allow users to pick more than one value from a value list entity, switch on Multiple Values. When you switch this on, the values display as a numbered list.
      This is an image of the numbered multi-value list.
      Switching this option off displays the values as a list of options, which allows only a single choice.
    • Switching on Fuzzy Match increases the chances of the user input matching a value, particularly when your values don’t have a lot of synonyms. Fuzzy matching uses word stemming to identify matches from the user input. Switching off fuzzy matching enforces strict matching, meaning that the user input must be an exact match to the values and synonyms; "cars" won’t match a value called "car", nor will "manager" match a "development manager" value.
    • For skills that are configured with a translation service, entity matching is based on the translation of the input. If you switch on Match Original Value, the original input is also considered in entity matching, which could be useful for matching values that are untranslatable.
    • To force a user to select a single value, switch on Prompt for Disambiguation and add a disambiguation prompt. By default, this message is Please select one value of <item name>, but you can replace this with one made up solely of text (You can only order one pizza at a time. Which pizza do you want to order?) or a combination of text and FreeMarker expressions. For example:
      "I found multiple dates: <#list system.entityToResolve.value.disambiguationValues.Date as date>${date.date?number_to_date}<#sep> and </#list>. Which date should I use as expense date?"
    • Define a validation rule using a FreeMarker expression.
      Note

      You can only add prompts, disambiguation, and validation for built-in entities when they belong to a composite bag.
  5. Click Create.
  6. Next steps:
    1. Add the entity to an intent. This informs the skill of the values that it needs to extract from the user input during the language processing. See Add Entities to Intents.
    2. In the dialog flow, declare a variable for the entity.
    3. Access the variable values using Apache FreeMarker expressions. See Built-In FreeMarker Array Operations.
    4. Click Validate and review the validation messages for errors related to entity event handlers (if used), potential problems like multiple values in a value list entity sharing the same synonym, and for guidance on applying best practices such as adding multiple prompts to make the skill more engaging.

Value List Entities for Multiple Languages

When you have a skill that is targeted to multiple languages and which uses Digital Assistant's native language support, you can set values for each language in the skill. For each entity value in a skill's primary language, you should designate a corresponding value in each additional language.

Tip:

To ensure that your skill consistently outputs responses in the detected language, always include useFullEntityMatches: true in Common Response, Resolve Entities, and Match Entity states. As described in Add Natively-Supported Languages to a Skill, setting this property to true (the default) returns the entity value as an object whose properties differentiate the primary language from the detected language. When referenced in Apache FreeMarker expressions, these properties ensure that the appropriate language displays in the skill's message text and labels.

Word Stemming Support in Fuzzy Match

Starting with Release 22.10, fuzzy matching for list value entities is based on word stemming, where a value match is based on the lexical root of the word. In previous versions, fuzzy matching was enabled through partial matching and auto correct. While this approach was tolerant of typos in the user input, including transposed words, it could also result in matches to more than one value within the value list entity. With stemming, this scatter is eliminated: matches are based on the word order of the user input, so either a single match is made, or none at all. For example, "Lovers Veggie" would not result in any matches, but "Veggie Lover" would match to the Veggie Lovers value of a pizza type entity. (Note that "Lover" is stemmed.) Stop words, such as articles and prepositions, are ignored in extracted values, as are special characters. For example, both "Veggie the Lover" and "Veggie////Lover" would match the Veggie Lovers value.

Create ML Entities

ML Entities are a model-driven approach to entity extraction. Like intents, you create ML Entities from training utterances – likely the same training utterances that you used to build your intents. For ML Entities, however, you annotate the words in the training utterances that correspond to an entity.

To get started, you can annotate some of the training data yourself, but as is the case for intents, you can develop a more varied (and therefore robust) training set by crowd sourcing it. As noted in the training guidelines, robust entity detection requires anywhere from 600 - 5000 occurrences of each ML entity throughout the training set. Also, if the intent training data is already expansive, then you may want to crowd source it rather than annotate each utterance yourself. In either case, you should analyze your training data to find out if the entities are evenly represented and if the entity values are sufficiently varied. With the annotations complete, you then train the model, then test it. After reviewing the entities detected in the test runs, you can continue to update the corpus and retrain to improve the accuracy.

To create an ML Entity:
  1. Click + Add Entity.
  2. Complete the Create Entity dialog. Keep in mind that the Name and Description appear in the crowd worker pages for Entity Annotation Jobs.
    • Enter a name that identifies the annotated content. A unique name helps crowd workers.
    • Enter a description. Although this is an optional property, crowd workers use it, along with the Name property, to differentiate entities.
    • Choose ML Entity from the list.
  3. Switch on Exclude System Entity Matches when the training annotations contain names, locations, numbers, or other content that could potentially clash with system entity values. Setting this option prevents the model from extracting system entity values that are within the input that's resolved to this ML entity. It enforces a boundary around this input so that the model recognizes it only as an ML entity value and does not parse it further for system entity values. You can set this option for composite bag entities that reference ML entities.
  4. Click Create.
  5. Click +Value List Entities to associate this entity with up to five Value List Entities. This is optional, but associating an ML Entity with a Value List Entity combines the contextual extraction of the ML Entity and the context-agnostic extraction of the Value List Entity.
  6. Click the DataSet tab. This page lists all the utterances for each ML Entity in your skill, which include the utterances that you've added yourself to bootstrap the entity, those submitted from crowd sourcing jobs, or have been imported as JSON objects. From this page, you can add utterances manually or in bulk by uploading a JSON file. You can also manage the utterances from this page by editing them (including annotating or re-annotating them), or by deleting, importing, and exporting them.
    • Add utterances manually:
      • Click Add Utterance. After you've added the utterance, click Edit Annotations to open the Entity List.
        Note

        You can only add one utterance at a time. If you want to add utterances in bulk, you can either add them through an Entity Annotation job, or you can upload a JSON file.
      • Highlight the text relevant to the ML Entity, then complete the labeling by selecting the ML Entity from the Entity List. You can remove an annotation by clicking x in the label.
        This is an image of the Delete icon on an annotation.

    • Add utterances from a JSON file. This JSON file contains a list of utterance objects.
      [
        {
          "Utterance": {
            "utterance": "I expensed $35.64 for group lunch at Joe's on 4/7/21",
            "languageTag": "en",
            "entities": [
              {
                "entityValue": "Joe's"   
                "entityName": "VendorName",
                "beginOffset": 37,
                "endOffset": 42
              }
            ]
          }
        },
        {
          "Utterance": {
            "utterance": "Give me my $30 for Coffee Klatch on 7/20",
            "languageTag": "en",
            "entities": [
              {
                "entityName": "VendorName",
                "beginOffset": 19,
                "endOffset": 32
              }
            ]
          }
        }
      ]
      You can upload it by clicking More > Import to retrieve it from your local system.
      The entities object describes the ML entities that have been identified within the utterance. Although the preceding example illustrates a single entities object for each utterance, an utterance may contain multiple ML entities which means multiple entities objects:
      [
        {
          "Utterance": {
            "utterance": "I want this and that",
            "languageTag": "en",
            "entities": [
              {
                "entityName": "ML_This",
                "beginOffset": 7,
                "endOffset": 11
              },
              {
                "entityName": "ML_That",
                "beginOffset": 16,
                "endOffset": 20
              }
            ]
          }
        },
        {
          "Utterance": {
            "utterance": "I want less of this and none of that",
            "languageTag": "en",
            "entities": [
              {
                "entityName": "ML_This",
                "beginOffset": 15,
                "endOffset": 19
              },
              {
                "entityName": "ML_That",
                "beginOffset": 32,
                "endOffset": 36
              }
            ]
          }
        }
      ]
      entityName identifies the ML Entity itself and entityValue identifies the text labeled for the entity. entityValue is an optional key that you can use to validate the labeled text against changes made to the utterance. The label itself is identified by the beginOffset and endOffset properties, which represent the offset for the characters that begin and end the label. This offset is determined by character, not by word, and is calculated from the first character of the utterance (0-1).
      Note

      You can't create the ML Entities from this JSON. They must exist before you upload the file.
      If you don't want to determine the offsets, you can leave the entities object undefined and then apply the labels after you upload the JSON file.
      [
        {
          "Utterance": {
            "utterance": "I expensed $35.64 for group lunch at Joe's on 4/7/21",
            "languageTag": "en",
            "entities": []
              
            
          }
        },
        {
          "Utterance": {
            "utterance": "Give me my $30 for Coffee Klatch on 7/20",
            "languageTag": "en",
            "entities": []
            
          }
        }
      ]
      The system checks for duplicates to prevent redundant entries. Only changes made to the entities definition in the JSON file are applied. If an utterance has been changed in the JSON file, then it's considered a new utterance.
    • Edit an annotated utterance:
      • Click Edit This is an image of the Edit ML Entity icon to remove the annotation.
        Note

        A modified utterance is considered a new (unannotated) utterance.
      • Click Edit Annotations to open the Entity List.
      • Highlight the text, then select an ML Entity from the Entity List.
      • If you need to remove an annotation, click x in the label.
  7. When you've completed annotating the utterances. Click Train to update both trainer Tm and the Entity model.
  8. Test the recognition by entering a test phrase in the Utterance Tester, ideally one with a value not found in any training data. Check the results to find out if the model detected the correct ML Entity and if the text has been labeled correctly and completely.
  9. Associate the ML Entity with an intent.

Exclude System Entity Matches

Switching on Exclude System Entity Matches prevents the model from replacing previously extracted system entity values with competing values found within the boundaries of an ML entity. With this option enabled, "Create a meeting on Monday to discuss the Tuesday deliverable" keeps the DATE_TIME and ML entity values separate by resolving the applicable DATE_TIME entity (Monday) and ignoring "Tuesday" in the text that's recognized as the ML entity ("discuss the Tuesday deliverable").

When this option is disabled, the skill instead resolves two DATE_TIME entities values, Monday and Tuesday. Clashing values like these diminish the user experience by updating a previously slotted entity value with an unintended value or by interjecting a disambiguation prompt that interrupts the flow of the conversation.
Note

You can set the Exclude System Entity Matches option for composite bag entities that reference an ML entity.

Import Value List Entities from a CSV File

Rather than creating your entities one at a time, you can create entire sets of them when you import a CSV file containing the entity definitions.

This CSV file contains columns for the entity name, (entity), the entity value (value) and any synonyms (synonyms). You can create this file from scratch, or you can reuse or repurpose a CSV that has been created from an export.

Whether you're starting anew or using an exported file, you need to be mindful of the version of the skill that you're importing to because of the format and content changes for native language support that were introduced in Version 20.12. Although you can import a CSV from a prior release into a 20.12 skill without incident in most cases, there are still some compatibility issues that you may need to address. But before that, let's take a look at the format of a pre-20.12 file. This file is divided into the following columns: entity, value, and synonyms. For example:
entity,value,synonyms
PizzaSize,Large,lrg:lrge:big
PizzaSize,Medium,med
PizzaSize,Small,little
For skills created with, or upgraded to, Version 20.12, the import files have language tags appended to the value and synonyms column headers. For example, if the skill's primary native language is English (en), then the value and synonyms columns are en:value and en:synonyms:
entity,en:value,en:synonyms
PizzaSize,Large,lrg:lrge:big
PizzaSize,Medium,med
PizzaSize,Small,
PizzaSize,Extra Large,XL
CSVs that support multiple native languages require additional sets of value and synonyms columns for each secondary language. If a native English language skill's secondary language is French (fr), then the CSV has fr:value and fr:synonyms columns as counterparts to the en columns:
entity,en:value,en:synonyms,fr:value,fr:synonyms
PizzaSize,Large,lrg:lrge:big,grande,grde:g
PizzaSize,Medium,med,moyenne,moy
PizzaSize,Small,,petite,p
PizzaSize,Extra Large,XL,pizza extra large,
Here are some things to note if you plan to import CSVs across versions:
  • If you import a pre-20.12 CSV into a 20.12 skill (including those that support native languages or use translation services), the values and synonyms are imported as primary languages.
  • All entity values for both the primary and secondary languages must be unique within an entity, so you can't import a CSV if the same value has been defined more than once for a single entity. Duplicate values may occur in pre-20.12 versions, where values can be considered unique because of variations in letter casing. This is not true for 20.12, where casing is more strictly enforced. For example, you can't import a CSV if it has both PizzaSize, Small and PizzaSize, SMALL. If you plan to upgrade Version 20.12, you must first resolve all entity values that are the same, but differentiated only by letter casing before performing the upgrade.
  • Primary language support applies to skills created using Version 20.12 and higher, so you must first remove language tags and any secondary language entries before you can import a Version 20.12 CSV into a skill created with a prior version.
When you import a 20.12 CSV into a 20.12 skill:
  • You can import a multi-lingual CSV into skills that do not use native language support, including those that use translation services.
  • If you import a multi-lingual CSV into a skill that supports native languages or uses translation services, then only rows that provide a valid value for the primary language are imported. The rest are ignored.
With these caveats in mind, here's how you create entities through an import:
  1. Click Entities (This is an image of the Entities icon.) in the side navbar.

  2. Click More, choose Import Value list entities, and then select the .csv file from your local system.
    Description of import_entities.png follows

  3. Add the entity or entities to an intent (or to an entity list and then to an intent).

Export Value List Entities to a CSV File

You can export the values and synonyms in a CSV file for reuse in another skill. The exported CSVs share the same format as the CSVs used for creating entities through imports in that they contain entity, value, and synonyms columns. The these CVS have release-specific requirements which can impact their reuse.
  • The CSVs exported from skills created with, or upgraded to, Version 20.12 are equipped for native language support though the primary (and sometimes secondary) language tags that are appended to the value and synonyms columns. For example, the CSV in the following snippet has a set of value and synonyms columns for the skill's primary language, English (en) and another set for its secondary language, French (fr):
    entity,en:value,en:synonyms,fr:value,fr:synonyms
    The primary language tags are included in all 20.12 CSVs regardless of native language support. They are present in skills that are not intended to perform any type of translation (native or through a translation service) and in skills that use translation services.
  • The CSVs exported from skills running on versions prior to 20.12 have the entity, value, and synonyms columns, but no language tags.
To export value list entities:
  1. Click Entities (This is an image of the Entities icon.) in the side navbar.

  2. Click More, choose Export Value list entities and then save the file.
    Description of export_entities.png follows

    The exported .csv file is named for your skill. If you're going to use this file as an import, then you may need to perform some of the edits described in Import Intents from a CSV File if you're going to import it to, or export it from, Version 20.12 skills and prior versions.

Create Dynamic Entities

Dynamic entity values are managed through the endpoints of the Dynamic Entities API that are described in the REST API for Oracle Digital Assistant. To add, modify, and delete the entity values and synonyms, you must first create a dynamic entity to generate the entityId that's used in the REST calls.

To create the dynamic entity:
  1. Click + Entity.
  2. Choose Dynamic Entities from the Type list.
  3. If the backend service is unavailable or hasn't yet pushed any values, or if you do not maintain the service, click + Value to add mock values that you can use for testing purposes. Typically, you would add these static values before the dynamic entity infrastructure is in place. These values are lost when you clone, version, or export a skill. After you provision the entity values through the API, you can overwrite, or retain, these values (though in most cases you would overwrite them).
  4. Click Create.

Tip:

If the API refreshes the entity values as you're testing the conversation, click Reset to restart the conversation.
A couple of notes for service developers:
  • You can query for the dynamic entities configured for a skill using the generated entityId with the botId. You include these values in the calls to create the push requests and objects that update the entity values.
  • An entity cannot have more than 150,000 values. To reduce the likelihood of exceeding this limit when you're dealing with large amounts of data, send PATCH requests with your deletions before you send PATCH requests with your additions.
Note

Dynamic entities are only supported on instances of Oracle Digital Assistant that were provisioned on Oracle Cloud Infrastructure (sometimes referred to as the Generation 2 cloud infrastructure). If your instance is provisioned on the Oracle Cloud Platform (as are all version 19.4.1 instances), then you can't use feature.

Guidelines for Creating ML Entities

Here's a general approach to creating an ML Entity.
  1. Create concise ML Entities. The ML Entity definition is at the base of a useful training set, so clarity is key in terms of its name and the description which help crowd workers annotate utterances.

    Because crowd workers rely on the ML Entity descriptions and names, you must ensure that your ML Entities are easily distinguishable from each other, especially when there's potential overlap. If the differences are not clear to you, it's likely that crowd workers will be confused. For example, the Merchant and Account Type entities may be difficult to differentiate in some cases. In "Transfer $100 from my savings account to Pacific Gas and Electric," you can clearly label "savings" as Account Type and Pacific Gas and Electric as Merchant. However, the boundary between the two can be blurred in sentences like "Need to send money to John, transfer $100 from my savings to his checking account." Is "checking account" an Account type, or a Merchant name? In this case, you may decide that any recipient should always be a merchant name rather than an account type.

  2. In preparation of crowd sourcing the training utterances, consider the typical user input for different entity extraction contexts. For example, can the value be extracted in the user's initial message (initial utterance context), or is it extracted from responses to the skill's prompts (slot utterance context)?
    Context Description Example Utterances (detected ML Entity values in bold)
    Initial utterance context A message that's usually well-structured and includes ML Entity values. For an expense reporting skill, for example, the utterance would include a value that the model can detect for an ML Entity called Merchant. Create an expense for team dinner at John's Pasta Shop for $85 on May 3
    Slot utterance context A user message that provides the ML Entity in response to a prompt, either because of conversation design (the skill prompts with "Who is the merchant?") or to slot a value because it hasn't been provided by a previously submitted response.

    In other circumstances, the ML Entity value may have already been provided, but may be included in other user messages in the same conversation. For example, the skill might prompt users to provide additional expense details or describe the image of an uploaded receipt.

    • Merchant is John's Pasta Shop.
    • Team dinner. Amount $85. John's Pasta Shop.
    • Description is TurboTaxi from home to CMH airport.
    • Grandiose Shack Hotel receipt for cloud symposium
  3. Gather your training and testing data.
    • If you already have a sufficient collection of utterances, you may want to assess them for entity distribution and entity value diversity before you launch an Entity Annotation job.
    • If you don't have enough training data, or if you're starting from scratch, launch an Intent Paraphrasing Job. To gather viable (and abundant) utterances for training and testing, integrate the entity context into the job by creating tasks for each intent. To gather diverse phrases, consider breaking down each intent by conversation context.
    • For the task's prompt, provide crowd workers context and ask them, "How would you respond?" or "What would you say?" Use the accompanying hints to provide examples and to illustrate different contexts. For example:
      Prompt Hint
      You're talking to an expense reporting bot, and you want to create an expense. What would be the first thing you would say? Ensure that the merchant name is in the utterance. You might say something like, "Create an expense for team dinner at John's Pasta Shop for $85 on May 3."
      This task asks for phrases that not only initiate the conversation, but also include a merchant name. You might also want utterances that reflect responses prompted by the skill when the user doesn't provide a value. For example, "Merchant is John's Pasta Shop" in response to the skill's "Who is the merchant?" prompt.
      Prompt Hint
      You've submitted an expense to the an expense reporting bot, but didn't provide a merchant name. How would you respond? Identify the merchant. For example, "Merchant is John's Pasta Shop."
      You've uploaded an image of a receipt to an expense reporting bot. It's now asking you to describe the receipt. How would you respond? Identify the merchant's name on the receipt. For example: "Grandiose Shack Hotel receipt for cloud symposium."
      To test false positives for testing – words and phrases that the model should not identify as ML Entities – you may also want to collect "negative examples". These utterances do include an ML Entity value.
      Context Example Utterances
      Initial utterance context Pay me back for Tuesday's dinner
      Slot utterance context
      • Pos presentation dinner. Amount $50. 4 people.
      • Description xerox lunch for 5
      • Hotel receipt for interview stay
    • Gather a large training set by setting an appropriate number of paraphrases per intent. For the model to generalize successfully, your data set must contain somewhere between 500 and 5000 occurrences for each ML entity. Ideally, you should avoid the low end of this range.
  4. Once the crowd workers have completed the job (or have completed enough utterances that you can cancel the job), you can either add the utterances, or launch an Intent Validation job to verify them. You can also download the results to your local system for additional review.
  5. Reserve about 20% of the utterances for testing. To create CSVs for the Utterance Tester from the downloaded CSVs for Intent Paraphrasing and Intent Validation jobs:
    • For Intent Paraphrasing jobs: transfer the contents in the result column (the utterances provided by crowd workers) to the utterance column in the Utterance Tester CSV. Transfer the contents of the intentName column to the expectedIntent column in the Utterance Tester CSV.
    • For Intent Validation jobs: transfer the contents in the prompt column (the utterances provided by crowd workers) to the utterance column in the Utterance Tester CSV. Transfer the contents of the intentName column to the expectedIntent column in the Utterance Tester CSV.
  6. Add the remaining utterances to a CSV file with a single column, utterance. Create an Entity Annotation Job by uploading this CSV. Because workers are labeling the entity values, they will likely classify negative utterances as "I'm not sure" or "None of the entities apply."
  7. After the Entity Annotation job is complete, you can add the results, or you can launch an Entity Validation job to verify the labeling. Only the utterances that workers deem correct in an Entity Validation job can be added to the corpus.

    Tip:

    You can add, remove, or adjust the annotation labels in the Dataset tab of the Entities page.
  8. Train the entity by selecting Entity.
  9. Run test cases to evaluate entity recognition using the utterances that you reserved from the Intent Paraphrasing job. You can divide up these utterances into different test suites to test different behaviors (unknown values, punctuation that may not be present in the training data, false positives, and so on). Because there may be a large number of these utterances, you can create test suites by uploading a CSV into the Utterance Tester.
    Description of ml_test_suites.png follows

    Note

    The Utterance Tester only displays entity labels for passing test cases. Use a Quick Test instead to view the labels for utterances that resolve below the confidence threshold.
  10. Use the results to refine the data set. Iteratively add, remove, or edit the training utterances until test run results indicate the model is effectively identifying ML Entities.
    Note

    To prevent inadvertant entity matches that degrade the user experience, switch on Exclude System Entity Matches if the training data contains names, locations, numbers.

ML Entity Training Guidelines

The model generalizes an entity using both the context around a word (or words) and the lexical information about the word itself. For the model to generalize effectively, we recommend that the number of annotations per entity to range somewhere between 500 and 5000. You may already have a training set that’s both large enough and has the variation of entity values that you’d expect from end users. If this is the case, you can launch an Entity Annotation job and then incorporate the results into the training data. However, if you don’t have enough training data, or if the data that you do have lacks sufficient coverage for all the ML entities, then you can collect utterances from crowd-sourced Intent Paraphrasing jobs.

Whatever the source, the distribution of entity values should reflect your general idea of the values that the model may encounter. To adequately train the model:
  • Do not overuse the same entity values in your trainining data. Repetitive entity values in your training data prevent the model from generalizing on unknown values. For example, you expect the ML Entity to recognize a variety of values, but the entity is represented by only 10-20 different values in your training set. In this case, the model will not generalize, even if there are two or three thousand annotations.
  • Vary the number of words for each entity value. If you expect users to input entity values that are three-to-five words long, but your training data is annotated with one- or two-word entity values, then the model may fail to identify the entity as the number of words increase. In some cases, it may only partially identify the entity. The model assumes the entity boundary from the utterances that you've provided. If you've trained the model on values with one or two words, then it assumes the entity boundary is only one or two words long. Adding entities with more words enables the model to recognize longer entity boundaries.
  • Utterance length should reflect your use case and the anticipated user input. You can train the model to detect entities for messages of varying lengths by collecting both short and long utterances. The utterances can even have multiple phrases. If you expect short utterances that reflect the slot-filling context, then gather your sample data accordingly. Likewise, if you're anticipating utterances for the initial context scenario, then the training set should contain complete phrases.
  • Include punctuation. If entity names require special characters, such as '-' and '/', include them in the entity values in the training data.
  • Ensure that all ML Entities are equally represented in your training data. An unbalanced training set has too many instances of one entity and too few of another. The models produced from unbalanced training sets sometimes fail to detect the entity with too few instances and over-predict for the entities with disproportionately high instances. This leads to false-positives.

ML Entity Testing Guidelines

Before your train your skill, you should reserve about 20% of unannotated utterances to find out how the model generalizes when presented with utterances or entity values that are not part of its training data. This set of utterances may not be your only testing set, depending on the behaviors you want to evaluate. For example:
  • Use only slot context utterances to find out how well the model predicts entities with less context.
  • Use utterances with "unknown" values to find out how well the model generalizes with values that are not present in the training data.
  • Use utterances without ML Entities to find out if the model detects any false positives.
  • Use utterances that contain ML Entity values with punctuation to find out how well the model performs with unusual entity values.