Harvesting Object Storage Files as Logical Data Entities

Your data lake typically has many files that represent a single data set. The files naming conversions indicate that multiple files are part of a single logical data entity.

You can group multiple Object Storage files into logical data entities in data catalog using filename patterns . A logical data entity is like any other data entity and can be used for search and discovery. Using logical data entities, you can organize your data lake content meaningfully and prevent the explosion of data entities and attributes in your data catalog.

Typical tasks you perform while harvesting Object Storage files as logical data entities:

  1. Create a pattern.
  2. Assign the pattern to an Object Storage data asset.
  3. Harvest the data asset.
  4. View harvested logical data entities.

Understanding Logical Data Entities

Consider the following set of files:

myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json

If you harvest these files in your Oracle Object Storage data source without creating filename patterns , Data Catalog creates nine individual data entities in your data catalog. Imagine this situation with hundreds of files in your data source resulting in hundreds of data entities in your data catalog.

Using filename patterns , you can group the example set of files into logical data entities. Any files that are not matched are created as separate File type data entities.
myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json

Understanding Expressions

In Data Catalog, a filename pattern is defined using expressions.

An expression can have one or more components that you separate using a delimiter. Each component specifies a matching rule for the pattern. Filename patterns are created using Java regular expressions. You specify the regular expression that should be used to group your files into required logical data entities.

You can specify qualifiers that are used when parsing the expression. You can use the following qualifiers:

  • bucketName: Use this qualifier to specify that the bucket name should be derived from the path that matches the given expression. The bucketName qualifier is used only once in the expression and always as the first component of the expression. The bucketName qualifier value can be a static text or an expression.
  • logicalEntity: Use this qualifier to specify that the logical data entity name should be derived from the path that matches the given expression. You can use logicalEntity multiple times in an expression. The logicalEntity qualifier values can consist of static text or expressions.