Language detects, classifies and provides options to de-identify personal identifiable information (PII) in unstructured text.
Use Cases
Detecting and curating private information in user feedback
Many organizations collect user feedback is collected through various channels such as product reviews, return requests, support tickets, and feedback forums. You can use Language PII detection service for automatic detection of PII entities to not only proactively warn, but also anonymize before storing posted feedback. Using the automatic detection of PII entities you can proactively warn users about sharing private data, and applications to implement measures such as storing masked data.
Scanning object storage for presence of sensitive data
Cloud storage solutions such as OCI Object Storage are widely used by employees to store business documents in the locations either locally controlled or shared by many teams. Ensuring that such shared locations don't store private information such as employee names, demographics and payroll information requires automatic scanning of all the documents for presence of PII. The OCI
Language PII model provides batch API to process many text documents at scale for processing data at scale.
Supported Entities
The following table describes the different entities that PII can extract.
Entity Type
Description
PERSON
Person name
ADDRESS
Address
AGE
Age
DATE_TIME
Date or time
SSN_OR_TAXPAYER
Social security number or taxpayer ID (US)
EMAIL
Email
PASSPORT_NUMBER_US
Passport number (US)
TELEPHONE_NUMBER
Telephone or fax (US)
DRIVER_ID_US
Driver identification number (US)
BANK_ACCOUNT_NUMBER
Bank account number (US)
BANK_SWIFT
Bank account (SWIFT)
BANK_ROUTING
Bank routing number
CREDIT_DEBIT_NUMBER
Credit or debit card number
IP_ADDRESS
IP address, both IPV4 and IPV6
MAC_ADDRESS
MAC address
Following are secret types:
COOKIE
Website Cookie
XSRF TOKEN
Cross-Site Request Forgery (XSRF) Token
AUTH_BASIC
Basic Authentication
AUTH_BEARER
Bearer Authentication
JSON_WEB_TOKEN
JSON Web Token
PRIVATE_KEY
Cryptographic Private Key
PUBLIC_KEY
Cryptographic Public Key
Following are the OCI account credentials that are the authentication information required to access and manage resources within OCI. These credentials serve the purpose of ensuring secure authentication of users, applications, and services to interact with OCI services and resources.
OCI_OCID_USER
OCI User
OCI_OCID_TENANCY
Tenancy OCID (Oracle Cloud Identifier)
OCI_SMTP_USERNAME
SMTP (Simple Mail Transfer Protocol) Username
OCI_OCID_REFERENCE
OCID Reference
OCI_FINGERPRINT
OCI Fingerprint
OCI_CREDENTIAL
This type covers OCI Auth Token, OAuth Credential and SMTP Credential
OCI_PRE_AUTH_REQUEST
OCI Pre-Authenticated Request
OCI_STORAGE_SIGNED_URL
OCI Storage Singed URL
OCI_CUSTOMER_SECRET_KEY
OCI Customer Secret Key
OCI_ACCESS_KEY
OCI Access Keys or security credentials
Examples
Input Text
Output Text Masked with "*"
Hello Support Team,
I am reaching out to seek help with my credit card number
5111 1111 1111 1118 expiring on 11/23. There was a
suspicious transaction on 12-Aug-2022 which I reported by
calling from my mobile number +1 (650) 555-0190 also I
emailed from my email id sarah.jones1234@hotmail.com. Would
you please let me know the refund status?
Regards,
Sarah
Hello Support Team, I am reaching out to seek help with my
credit card number ******************* expiring on ***** .
There was a suspicious transaction on *********** which I
reported by calling from my mobile number ** **************
also I emailed from my email id ***************************
. Would you please let me know the refund status? Regards,
*****
The JSON for the example is:
Sample Request
Copy
POST https://<region-url>/20210101/actions/batchDetectLanguagePiiEntities
API Request format:
Copy
{
"documents": [
{
"languageCode": "en",
"key": "1",
"text": "Hello Support Team, I am reaching out to seek help with my credit card number 5111 1111 1111 1118 expiring on 11/23. There was a suspicious transaction on 12-Aug-2022 which I reported by calling from my mobile number +1 (650) 555-0190 also I emailed from my email id sarah.jones1234@hotmail.com. Would you please let me know the refund status? Regards, Sarah"
}
],
"compartmentId": "ocid1.tenancy.oc1..aaaaaaaadany3y6wdh3u3jcodcmm42ehsdno525pzyavtjbpy72eyxcu5f7q",
"masking": {
"ALL": {
"mode": "MASK",
"isUnmaskedFromEnd": true,
"leaveCharactersUnmasked": 4
}
}
}
Response JSON:
Copy
{
"documents": [
{
"key": "1",
"entities": [
{
"offset": 79,
"length": 19,
"type": "CREDIT_DEBIT_NUMBER",
"text": "5111 1111 1111 1118",
"score": 0.75,
"isCustom": false
},
{
"offset": 111,
"length": 5,
"type": "DATE_TIME",
"text": "11/23",
"score": 0.9992455840110779,
"isCustom": false
},
{
"offset": 156,
"length": 11,
"type": "DATE_TIME",
"text": "12-Aug-2022",
"score": 0.998766303062439,
"isCustom": false
},
{
"offset": 218,
"length": 2,
"type": "TELEPHONE_NUMBER",
"text": "+1",
"score": 0.6941494941711426,
"isCustom": false
},
{
"offset": 221,
"length": 14,
"type": "TELEPHONE_NUMBER",
"text": "(650) 555-0190",
"score": 0.9527066349983215,
"isCustom": false
},
{
"offset": 268,
"length": 27,
"type": "EMAIL",
"text": "sarah.jones1234@hotmail.com",
"score": 0.95,
"isCustom": false
},
{
"offset": 354,
"length": 5,
"type": "PERSON",
"text": "Sarah",
"score": 0.9918518662452698,
"isCustom": false
}
],
"languageCode": "en",
"maskedText": "Hello Support Team, \nI am reaching out to seek help with my credit card number ***************2345 expiring on *1/23. There was a suspicious transaction on *******2022 which I reported by calling from my mobile number +1 **********9999 also I emailed from my email id ***********************.com. Would you please let me know the refund status?\nRegards,\n*arah"
}
],
"errors": []
}
Configuring PII or PHI Text Output 🔗
In the Language service, you can configure the PII/PHI output when analyzing text.
In the PII or PHI section, click Configure in the Output section.
Select PII from the dropdown.
Select from the following:
Mask: Select to include or exclude entities.
Anonymization exclusion list: Enter entities to exclude from the UI output and the SDK output.
Include excluded entities from masking in detected entities: Select to include the entity that was excluded from the output in the UI, but to continue to include the entity in the SDK output.
Masking character: Masking character to mask input text.
Replace: Replace PII entities with a given sequence of characters.
Remove: Remove PII entities from output.
Click Save changes.
PII Rules 🔗
Custom PII Rules
Keys
Type
Description
ruleId
String
Unique identifier for the rule.
regex
String
Regular expression pattern to match custom data types. For example, ([A-Z]{5}[0-9]{4}[A-Z]{1}) to match Pan card.
type
String
Name for entity type to match. For example, PAN_CARD.
prefix
List<String>
Words or phrases to look for within maxDistance of regex detected word.
suffix
List<String>
Words or phrases to search for within maxDistance of regex detected word.
isCaseSensitive
Boolean
Determines if the matching process should consider uppercase and lowercase letters as distinct, with a value of true indicating case sensitivity and false indicating case insensitivity.
maxDistance
Integer
Defines the maximum allowable distance in characters between the prefix/suffix and the matched pattern, ensuring that the pattern is found within a certain proximity to the prefix/suffix.
priority
Integer
Priority of rules. Ranges between 1-50 where Priority 1 is highest. For example, if there are two rules with same regex but different prefix and suffix, the rule with the higher priority is considered
regexOnly
Boolean
If true, this removes model detected entities which have same regex as the rule regex.
For example:
In the sentence, "I am 25 years old and he is 11 months old," with the suffix set to ["years"]:
If regexOnly is true, only 25 is detected because the suffix "months" doesn't match the specified suffix "years".
If regexOnly is false, both 25 and 11 are detected—25 from the rule (due to the suffix "years") and 11 from the model.
filterEntityTypes
List<String>
OCI entity types to filter. For example, [PERSON, AGE] to filter entity types PERSON and AGE from model detections. If filter set to [ALL], all model detected entities are filtered out.
When listing [All], detection regex based and ignores predefined model entities.