Table Function Operators Reference

The table function operator provides a list of built-in table functions that you can use in a data flow to return data in the form of tables.

See Table Function Operator.

Here's a list of the built-in table functions that are supported by the table function operator.

Cache

Persists and caches a dataset according to the given memory storage level.

Returns a new dataset in the cached resilient distributed dataset (RDD).

Parameter Description
level

The memory storage to use:

MEMORY_ONLY: Store as deserialized Java objects in the JVM. If memory is insufficient, some RDD partitions are not cached, but are recomputed as and when needed.

MEMORY_AND_DISK: Store as deserialized Java objects in the JVM. If memory is insufficient, some RDD partitions are stored on disk, and read from disk as and when needed.

MEMORY_ONLY_SER: Store as deserialized Java objects.

MEMORY_AND_DISK_SER: Store as deserialized Java objects. Partitions that don't fit are stored on disk.

DISK_ONLY: Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc: Similar to other levels but each partition is replicated on two cluster nodes.

Cross tabulation analysis

Computes a pair-wise frequency table or contingency table from distinct values of the given two columns.

The number of distinct values for each column must be less than 1e4.

Returns a dataframe that contains the contingency table. In the contingency table:

  • The name of the first column is col1_col2, where col1 is the first column and col2 is the second column before transformation. The first column of each row are the distinct values of col1.
  • The other column names are the distinct values of col2.
  • The counts are returned as the long type.
  • Pairs without occurrences are given zero as the count.
  • The maximum number of non-zero pairs is 1e6.
  • Null elements are replaced by null.
Parameter Description
col1

Name of the first column. See the note after this table for limitations.

Distinct items from col1 make the first item of each row in the returned dataframe.

col2

Name of the second column. See the note after this table for limitations.

Distinct items from co2 make the column names in the returned dataframe.

Note

Only the following characters are allowed in a column name:

  • lower and upper case letters a-z A-Z
  • numbers 0-9
  • underscore _

Example

Suppose col1 is age and col2 is salary.

Before transformation:

+---+------+
|age|salary|
+---+------+
|1  |100   |
|1  |200   |
|2  |100   |
|2  |100   |
|2  |300   |
|3  |200   |
|3  |300   |
+---+------+

After transformation:

+----------+---+---+---+
|age_salary|100|200|300|
+----------+---+---+---+
|2         |2  |0  |1  |
|1         |1  |1  |0  |
|3         |0  |1  |1  |
+----------+---+---+---+
Cube

Generates a multi-dimensional cube of all possible combinations using the given columns list and the calculations in the given aggregate function expressions.

The following aggregate functions are supported in the expressions:

  • AVG
  • COUNT
  • MEAN
  • MAX
  • MIN
  • SUM

Returns a dataframe after the cube transformation.

Parameter Description
cubeColumns

The list of columns, separated by commas, by which to generate the multi-dimensional cube.

aggExpressions

The aggregate function expressions to perform on the columns.

For example: salary -> avg, age -> max

Example

Before transformation:

+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng       |A    |local  |1000  |10 |
|Art       |B    |foreign|2000  |11 |
|Sport     |C    |local  |1500  |13 |
|Eng       |A    |foreign|800   |19 |
|Art       |A    |foreign|3000  |30 |
|Eng       |E    |foreign|1000  |21 |
|Art       |D    |local  |1000  |32 |
|Sport     |C    |local  |1000  |28 |
+----------+-----+-------+------+---+

If cubeColumns is department, region and aggExpressions issalary -> avg, age -> max (compute the average salary and maximum age), the after transformation cube is:

+----------+-------+-----------------+--------+
|department|region |avg(salary)      |max(age)|
+----------+-------+-----------------+--------+
|Eng       |foreign|900.0            |21      |
|Sport     |null   |1250.0           |28      |
|Art       |null   |2000.0           |32      |
|Eng       |null   |933.3333333333334|21      |
|null      |local  |1125.0           |32      |
|Art       |foreign|2500.0           |30      |
|null      |foreign|1700.0           |30      |
|Eng       |local  |1000.0           |10      |
|null      |null   |1412.5           |32      |
|Sport     |local  |1250.0           |28      |
|Art       |local  |1000.0           |32      |
+----------+-------+-----------------+--------+
Data generation

Generates synthetic data using the given file format and content. The supported file formats are:

  • CSV
  • JSON

Returns a data entity. The synthetic data source can be used in pipeline unit-tests where real data is not used or moved.

Parameter Description
format

The type of file. Supported values are:

  • CSV
  • JSON
content

The file content for the given format.

Example

A CSV content example:

co1,co2,co3
1,B,3
A,2,C

Data generation result:

+---+---+---+
|co1|co2|co3|
+---+---+---+
|1  |B  |3  |
+---+---+---+
|A  |2  |C  |
+---+---+---+
Deduplication

Finds duplicates in the given column or columns list and returns a new dataset with the duplicate rows removed.

Parameter Description
columns

A column name or a comma-separated list of column names.

Example

Before transformation:

+---+---+
|a  |b  |
+---+---+
|1  |2  |
|3  |4  |
|1  |3  |
|2  |4  |
+---+---+

If columns is a, the dataset after transformation is:

+---+---+
|a  |b  |
+---+---+
|1  |2  |
|3  |4  |
|2  |4  |
+---+---+
Frequent items

Finds frequent items in the given column or columns list using the given minimum frequency.

The frequent element count algorithm proposed by Karl et al. is used to find frequent items. False positives are possible.

Returns a dataframe with an array of frequent items for each column.

Parameter Description
freqItems

A column name or a comma-separated list of column names for which you want to find frequent items.

support

Minimum frequency for an item to be considered frequent.

The value must be greater than 1e-4 (decimal value of 0.0001).

For example, 0.6 represents 60%. That is, an item that appears in at least 60% of the rows is a frequent item.

Example

Before transformation:

+-------+--------------+
|user_id|favorite_fruit|
+-------+--------------+
|      1|         apple|
|      2|        banana|
|      3|         apple|
|      4|        orange|
|      5|        banana|
|      6|        banana|
|      7|         apple|
+-------+--------------+

If freqItems is favorite_fruit and support is 0.6, the dataframe returned after transformation is:

+------------------------+
|favorite_fruit_freqItems|
+------------------------+
|                [banana]|
+------------------------+
Na Drop

Removes rows that contain null or NaN values in the given column or columns list.

Returns a dataframe that excludes the rows.

Parameter Description
how

How to determine if a row is to be dropped. Supported values are:

  • any: Drop the row that contains any null or NaN values in the specified cols.
  • all: Drop the row only if every specified cols is null or NaN for that row.

The how parameter is ignored if the minNonNulls parameter is specified.

cols

A column name or a comma-separated list of column names.

minNonNulls

The minimum number of non-null and non-NaN values that a row can contain.

Drops the rows that contain less than the specified minimum.

The minNonNulls parameter overrides the how parameter.

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If how is any and cols is name, the dataframe returned after transformation is:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

When you use the how parameter with the value all, a row is dropped only if all its values are null. For example, before transformation:

+----+--------+----+
|  id|    name| age|
+----+--------+----+
|   1|John Doe|  30|
|   2|    null|null|
|   3|Jane Doe|  25|
|   4|Jake Doe|null|
|null|    null|null|
+----+--------+----+

After transformation:

+---+--------+----+
| id|    name| age|
+---+--------+----+
|  1|John Doe|  30|
|  2|    null|null|
|  3|Jane Doe|  25|
|  4|Jake Doe|null|
+---+--------+----+
Na Fill

Replaces null values and returns a dataframe using the replaced values.

Parameter Description
replacement

The key-value map to use for replacing null values. The key is a column name. The value is a replacement value.

The key points to the replacement value. For example: name -> Tom means replace null values in the name column with Tom.

A replacement value is cast to the column's data type. A replacement value must be of the following type:

  • Int
  • Long
  • Float
  • Double
  • String
  • Boolean

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If the key-value map is as follows:

id -> 3

name -> unknown

age -> 10

After transformation:

+---+-------+---+
|id |name   |age|
+---+-------+---+
|1  |Alice  |25 |
|2  |unknown|28 |
|3  |Bob    |10 |
|4  |Charlie|30 |
+---+-------+---+
Na Replace

Replaces a value with another value in the given column or columns by matching keys in the given key and replacement value map.

Returns a new dataframe that contains the replaced transformation.

Parameter Description
cols

A column name or a comma-separated list of column names on which to apply replacement values.

If cols is *, then replacement is applied on all String, Numeric, or Boolean columns.

replacement

The replacement key-value map to use. The key is a value to replace. The value is the replacement value. The map value can have nulls.

The key points to the replacement value. For example: Alice -> Tom means replace Alice with Tom.

The key and replacement value pair must have the same type. Only the following types are supported:

  • Boolean
  • Double
  • String

Example

Before transformation:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Alice  |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+

If cols is name, and replacement is Alice -> Tom, the dataframe after transformation is:

+----+-------+----+
|id  |name   |age |
+----+-------+----+
|1   |Tom    |25  |
|2   |null   |28  |
|null|Bob    |null|
|4   |Charlie|30  |
+----+-------+----+
Repartition

Divides a dataset by the given columns into the specified number of partitions.

Returns a new dataset that's hash partitioned. The exact number of partition as specified is returned.

Parameter Description
partitionColumns

A column name or comma-separated list of column names by which the dataset is partitioned.

numberOfPartitions

The number of partitions to create.

Rollup

Generates a multi-dimensional rollup of possible combinations using the given columns list and the calculations in the given aggregate function expressions.

The following aggregate functions are supported in the expressions:

  • AVG
  • COUNT
  • MEAN
  • MAX
  • MIN
  • SUM

Returns a dataframe after the rollup transformation.

Parameter Description
rollupColumns

The list of columns, separated by commas, by which to generate the multi-dimensional rollup.

aggExpressions

The aggregate function expressions to perform on the columns.

For example: salary -> avg, age -> max

Example

Before transformation:

+----------+-----+-------+------+---+
|department|group|region |salary|age|
+----------+-----+-------+------+---+
|Eng       |A    |local  |1000  |10 |
|Art       |B    |foreign|2000  |11 |
|Sport     |C    |local  |1500  |13 |
|Eng       |A    |foreign|800   |19 |
|Art       |A    |foreign|3000  |30 |
|Eng       |E    |foreign|1000  |21 |
|Art       |D    |local  |1000  |32 |
|Sport     |C    |local  |1000  |28 |
+----------+-----+-------+------+---+

If rollupColumns is department, region and aggExpressions is salary -> avg, age -> max (compute the average salary and maximum age), the after transformation rollup is:

+----------+-------+-----------------+--------+
|department|region |avg(salary)      |max(age)|
+----------+-------+-----------------+--------+
|Eng       |foreign|900.0            |21      |
|Sport     |null   |1250.0           |28      |
|Art       |null   |2000.0           |32      |
|Eng       |null   |933.3333333333334|21      |
|Art       |foreign|2500.0           |30      |
|Eng       |local  |1000.0           |10      |
|null      |null   |1412.5           |32      |
|Sport     |local  |1250.0           |28      |
|Art       |local  |1000.0           |32      |
+----------+-------+-----------------+--------+
Spark SQL

Runs Spark SQL queries on incoming data, by first creating temporary views using the given table names.

Parameter Description
SQL

The SQL statement or script to run.

Example: select * from table1

tableName

A table name or comma-separated list of table names by which Spark creates temporary tables.

Example: table1, table2

Stratified sampling

Generates a stratified sample without replacement based on the given sampling fraction for each stratum.

Returns a new dataframe that represents the stratified sample.

Parameter Description
strata_col

The column that defines the strata.

fractions

The sampling fraction for each stratum, from 0.0 to 1.0.

For example, 0.1 returns 10% of the rows, and 1.0 returns 100% of the rows.

If a fraction is not specified, zero is assumed.

sample_size

If fractions is not specified, specify a portion of the dataset to be sampled, from 0.0 to 1.0.

seed

Use any random number in seed to return the same sample every time, keeping the result consistent.

Example

Before transformation:

+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  2|    1|
|  2|    1|
|  2|    3|
|  3|    2|
|  3|    3|
+---+-----+

If strata_col is key, and fractions is the following:

1 -> 1.0, 3 -> 0.5

After transformation:

+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  1|    2|
|  3|    2|
+---+-----+
Summary

Computes the given statistics for numeric and string columns.

If no statistics are provided, all the following are computed:

  • COUNT
  • MEAN
  • STDDEV
  • MIN
  • approximate quartiles (percentiles at 25%, 50%, and 75%)
  • MAX
Parameter Description
statistics

A comma-separated list of statistics.

Supported values are:

  • COUNT
  • MEAN
  • STDDEV
  • MIN
  • one or more arbitrary approximate percentiles specified as a percentage (for example, 75%)
  • MAX

Example: count, mean, stddev, min, 27%, 41%, 95%, max

Example

Before transformation:

+----------+------+-----------------+--------+
|department|gender|avg(salary)      |max(age)|
+----------+------+-----------------+--------+
|Eng       |female|900.0            |21      |
|Sport     |null  |1250.0           |28      |
|Art       |null  |2000.0           |32      |
|Eng       |null  |933.3333333333334|21      |
|Art       |female|2500.0           |30      |
|Eng       |male  |1000.0           |10      |
|null      |null  |1412.5           |32      |
|Sport     |male  |1250.0           |28      |
|Art       |male  |1000.0           |32      |
+----------+------+-----------------+--------+

Suppose statistics is count, mean, stddev, min, 27%, 41%, 95%, max.

After transformation:

+-------+----------+-----+------+----------------+----------------+
|summary|department|group|gender|salary          |age             |
+-------+----------+-----+------+----------------+----------------+
|count  |8         |8    |8     |8               |8               |
|mean   |null      |null |null  |1412.5          |20.5            |
|stddev |null      |null |null  |749.166203188585|8.76682056718072|
|min    |Art       |A    |female|800             |10              |
|27%    |null      |null |null  |1000            |13              |
|41%    |null      |null |null  |1000            |19              |
|95%    |null      |null |null  |3000            |32              |
|max    |Sport     |E    |male  |3000            |32              |
+-------+----------+-----+------+----------------+----------------+