Working with Notebooks
Oracle Big Data Service uses the Big Data Studio notebook application as its notebook interface and coding environment.
Creating a Notebook
You can perform many actions in a notebook. For information about some of them, see Overview of the Notebooks User Interface.
After creating a notebook, you can create paragraphs in the notebook. See Creating Paragraphs.Overview of the Notebooks User Interface
Below are some of the most common screen elements you'll see when working within a notebook. Mouse over icons in the user interface to see their functions.
Icon | Description |
---|---|
Modify Notebook |
Modifies details of a notebook, such as name, description, and tags. |
Run Paragraphs |
Executes all paragraphs in a notebook in sequential order. |
Invalidate Session |
Resets any connection or code executed in a notebook. |
Delete Notebook |
Deletes a notebook. |
Hide/Show Code |
Hides or shows the code section in all paragraphs in a notebook. |
Hide/Show Result |
Hides or shows the results section in all paragraphs in a notebook. |
Readonly/Write |
Sets the notebook to read-only or write mode. |
Hide/Show Panel |
Shows or hides the paragraph settings bar commands, results toolbar, and settings dialog for a selected paragraph in a panel to the right of the notebook. |
Versioning |
Creates a version of a notebook or displays versions. |
Clear Result |
Clears results for all paragraphs in a notebook. |
Clear Paragraph Dependencies |
Removes all defined paragraph dependencies. |
Open as Iframe |
Opens a notebook in an Iframe. This allows a notebook to be embedded inside another web page. |
Share Notebook |
Shares a notebook. |
Clone Notebook |
Creates a copy of a notebook. |
Export Notebook |
Exports a notebook to your computer as a |
Zeppelin/Jupyter |
Sets the preferred layout of a notebook (Zeppelin or Jupyter layout format). |
Default Template/another template |
Applies the overall look and feel of the notebook using the default template or another template. |
Importing a Notebook
You can import notebooks from your local computer into the Big Data Studio notebook application. This includes Jupyter notebooks.
- Access the notebook application. See Accessing Big Data Studio.
- On the Notebooks page, click the Import Notebooks icon in the upper right.
-
Drag and drop or browse for the notebook you want to import. Valid file extensions are
.json
,.dsnb
, and.ipynb
. - Click Import.
Searching for a Notebook
You can search for notebooks by title, description, or tag.
- Access the notebook application. See Accessing Big Data Studio.
- On the Notebooks page, enter text in the search box or click a tag. Clicking a tag displays all notebooks with that tag.
Cloning a Notebook
When you clone a notebook you make a copy of it.
Supported Interpreters
The Big Data Studio notebook application provides interpreters that execute code in different languages. The interpreters listed below are supported.
Each interpreter has a set of properties that can be customized by an administrator and applied across all notebooks. See Configuring Interpreters.
Start a paragraph with %interpreter
as shown below and then code
your paragraph with the corresponding syntax.
The Spark and PySpark interpreters support YARN client mode. Running the interpreters in YARN client mode means that a Spark driver is started by Big Data Studio and Spark application commands are sent to YARN node managers to be run. This way YARN manages memory instead of the cluster node.
Type | Interpreter | Usage |
---|---|---|
graalvm |
|
Specify the language type as shown below and then use the syntax for that language. %python-graalvm %js-graalvm %java-graalvm %ruby-graalvm |
jdbc (default) |
|
Database settings must be configured correctly. Administrators should go to Interpreters > jdbc > jdbc(default) in the Big Data Studio web UI and verify or configure the following settings:
|
jdbc (Cloud SQL) |
|
To use this interpreter, Cloud SQL must be added to the cluster. See Adding Cloud SQL . Cloud SQL enables you to use SQL to query your big data sources. See Using Cloud SQL with Big Data. Administrators must also change the default password used with the interpreter:
|
jdbc (Hive) |
|
On unsecured clusters, no additional configuration is needed. On secured clusters, the |
md (markdown) |
|
|
pyspark |
|
|
python |
|
Cluster administrators can install other Python packages on the cluster host. Use SSH to connect to the cluster, and then use pip to install Python packages. When packages are installed and the new libraries are located in one of the following locations, the libraries become available to the module search path of the Python interpreter in the notebook. /usr/bin /usr/lib /usr/lib64 |
rserveinterpreter |
|
Cluster administrators can install other R packages on the cluster host. Use
SSH to connect to the cluster, and then use R CMD [options] package_name.tar.gz R install.packages(url_for_packages, NULL) When packages are installed the new libraries are located in
|
spark |
|
|
Creating Paragraphs
Paragraphs are used in notebooks to interactively explore data. Paragraphs contain executable code that returns results in many different formats.
You can use different interpreters in the same notebook. Interpreter settings can be configured by an administrator. See Configuring Interpreters.
To create a paragraph:
For information about some of the actions you can perform in paragraphs, see Overview of Paragraphs User Interface.
Overview of Paragraphs User Interface
Below are some of the most common screen elements you'll see when working with paragraphs. Mouse over icons in the user interface to see their functions.
Icon | Description |
---|---|
Execute Paragraph |
Executes the code or query in a paragraph. |
Enter Dependency Mode |
Adds or removes dependent paragraphs. |
Comments |
Adds comments to a paragraph. |
Expand/Collapse |
Expands a paragraph and shows the paragraph in full-screen mode, or collapses that view. |
Show/Hide Line Numbers |
Shows or hides line numbers in the code in a paragraph (applies only to the code section). |
Visibility |
Manages the visibility settings in a paragraph. This controls how a paragraph can be viewed by the author and other users who have access to the notebook. |
Settings |
Provides a number of actions. Use this to:
|
Accesing HDFS from Spark and PySpark
To access HDFS in a notebook and read and write to HDFS, you need to grant access to your folders and files to the user that the Big Data Studio notebook application will access HDFS as.
When Big Data Studio accesses HDFS (and other Hadoop cluster services), these users are used:
-
interpreteruser
is the user and group used with unsecured clusters. -
datastudio
is the user and group used with secured clusters (Kerberos-enabled).
Note that these users are not the users that are used to access the Big Data Studio web UI.
On a secured cluster you need to obtain a Kerberos ticket for your user before running any
hdfs
commands. For example, to run the command kinit
oracle
for the user oracle
.
You can grant read and write access to HDFS folders. You can also remove that access.
You can grant read access to specific HDFS files. You can also remove that access.
The following examples show how to use PySpark with HDFS. You need access to HDFS as
described in the previous topics. Make sure you're using the %pyspark
interpreter.
Get a Spark Session
You need a Spark session to work with HDFS.
%pyspark from pyspark.sql import SparkSession sparkSession = SparkSession.builder.appName("SparkSessionName").getOrCreate()
Read in HDFS
To read using SparkSession
:
%pyspark df_load = sparkSession.read.csv('/hdfs/full/path/to/folder') df_load.show()
Write to HDFS
Use a DataFrame to write to HDFS. To write using SparkSession
:
%pyspark data = [('One', 1), ('Two', 2), ('Three', 3)] # Create DataFrame df = sparkSession.createDataFrame(data) # Write into HDFS df.write.mode("overwrite").csv("/hdfs/full/path/to/folder") # /hdfs/full/path/to/folder is created when dfs.write is executed
df.write.mode
options specify the behavior of the save operation when data
already exists:-
append:
Appends contents of this DataFrame to existing data. -
overwrite:
Overwrites existing data. -
ignore:
Ignores this operation if data already exists. -
error
orerrorifexists
: Throws an exception if data already exists.
Example 1: Write to a Directory
This example writes to a /tmp
directory to which all HDFS users have
access.
%pyspark from pyspark.sql import SparkSession sparkSession = SparkSession.builder.appName("pyspark-write").getOrCreate() # Create data data = [('One', 1), ('Two', 2), ('Three', 3), ('Four', 4), ('Five', 5)] df = sparkSession.createDataFrame(data) # Write into HDFS df.write.mode("overwrite").csv("/tmp/testpysparkcsv")
Example 2: Read the Information
This example reads the information in the /tmp/testpysparkcsv
directory
created in the previous example.
%pyspark from pyspark.sql import SparkSession sparkSession = SparkSession.builder.appName("pyspark-read").getOrCreate() # Read from HDFS df_load = sparkSession.read.csv('/tmp/testpysparkcsv') df_load.show()
Accessing Hive from Spark and PySpark
To access Hive in a notebook and read and write to Hive, you need to grant access to your folders and files to the user that the Big Data Studio notebook application will access Hive as.
When Big Data Studio accesses Hive, these users are used:
-
interpreteruser
is the user and group used with unsecured clusters. -
datastudio
is the user and group used with secured clusters (Kerberos-enabled).
Note that these users are not the users that are used to access the Big Data Studio web UI.
You can enable Big Data Studio to access all Hive tables by adding the
datastudio
user to the hive
admin group. The user performing
these steps must be logged into a node of the cluster as the root
user.
In secured CDH clusters, the
datastudio
user and group can be granted
more fine-grained access to specific tables by creating and using Sentry roles. For more
information, see Hive SQL Syntax for Use with Sentry in the Cloudera
documentation.To grant access to Hive tables:
To remove access to Hive, remove the user from the group that has access. In this example,
the group is hive
.
Unsecured clusters:
dcli -C gpasswd -d interpreteruser hive
Secured clusters:
dcli -C gpasswd -d datastudio hive
The following examples show how to use PySpark with Hive. You need access to Hive as
described in the previous topics. Make sure you're using the %pyspark
interpreter.
Use spark.sql
to Work with Hive
You use spark.sql
to work with Hive. spark.sql
function
call:
%pyspark # Run the SQL instruction in quiet mode spark.sql("SQL instruction") # Show some output spark.sql("SQL instruction").show()
Example of SQL instructions that do not show information:
%pyspark spark.sql ("DROP TABLE IF EXISTS hive_table") spark.sql("CREATE TABLE IF NOT EXISTS hive_table (number int, Ordinal_Number string, Cardinal_Number string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ") spark.sql("load data inpath '/tmp/pysparktestfile.csv' into table pyspark_numbers_from_file") spark.sql("insert into table pyspark_numbers_from_file2 select * from pyspark_numbers_from_file") spark.sql("CREATE TABLE IF NOT EXISTS pyspark_numbers (number int, Ordinal_Number string, Cardinal_Number string) USING com.databricks.spark.csv OPTIONS \ (path \"/full/path/hdfs/dataframe/folder\", header \"false\")")
Example of SQL instructions that show information:
%pyspark spark.sql("show databases").show() spark.sql("show tables").show() spark.sql("describe hive_table ").show() spark.sql("select * hive_Table limit 5").show()
Example 1: Create a Table
%pyspark spark.sql("CREATE TABLE IF NOT EXISTS pyspark_test_table ( eid int, name String, salary String, destination String)") spark.sql("describe pyspark_test_table ").show()
Example 2: Remove a Table
%pyspark spark.sql ("DROP TABLE IF EXISTS pyspark_test_table")
Example 3: Create a File with 9000 Rows
%pyspark from pyspark.sql import SparkSession import re single_digit = ["","one", "two", "three" , "four" , "five" , "six" , "seven" , "eight" , "nine"] ten_to_nineteen =["ten","eleven","twelve","thirteen" ,"fourteen","fifteen","sixteen","seventeen","eighteen","nineteen"] two_digits =["","","twenty","thirty","fourty","fithy","sixty","seventy","eighty","ninety"] ten_power = ["hundred","thousand"] data = [] def cardinalNum(number): if len (str(number)) == 1 : return single_digit[number] elif len (str(number)) == 2 : if str(number)[0] == "1" : return ten_to_nineteen[int(str(number)[1])] else: return two_digits[int(str(number)[0])] + " " + single_digit[int(str(number)[1])] elif len (str(number)) == 3 : return single_digit[int(str(number)[0])] + " " + ten_power[0] + " " + cardinalNum(int(str(number)[1:3])) elif len (str(number)) == 4 : return cardinalNum(int(str(number)[0:1])) + " " + ten_power[1] + " " + cardinalNum(int(str(number)[1:4])) else: return str(number) def ordinalNum(number): if re.match(".*1[0-9]$", str(number)): return(str(number) + 'th') elif re.match(".*[04-9]$", str(number)): return(str(number) + 'th') elif re.match(".*1$", str(number)): return(str(number) + 'st') elif re.match(".*2$", str(number)): return(str(number) + 'nd') elif re.match(".*3$", str(number)): return(str(number) + 'rd') else: return(str(number)) sparkSession = SparkSession.builder.appName("pyspark-write").getOrCreate() # Create data for number in range(1, 9001): tmpdata=[ (number) ,( ordinalNum(number)) ,(cardinalNum(number))] data.append(tmpdata) #print ( str(number) + ',' + ordinalNum(number) + ',' + cardinalNum(number)) df = sparkSession.createDataFrame(data) # Write into HDFS df.write.mode("overwrite").csv("/tmp/testpyspark")
Example 4: Load from the DataFrame Folder
%pyspark spark.sql("CREATE TABLE IF NOT EXISTS pyspark_numbers (number int, Ordinal_Number string, Cardinal_Number string) USING com.databricks.spark.csv OPTIONS \ (path \"/tmp/testpyspark\", header \"false\")") spark.sql ("describe pyspark_numbers").show() spark.sql ("select * from pyspark_numbers").show() spark.sql("select count(*) from pyspark_numbers").show()
Example 5: Load from a .csv
File
%pyspark spark.sql("CREATE TABLE IF NOT EXISTS pyspark_numbers_from_file (number int, Ordinal_Number string, Cardinal_Number string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ") spark.sql("load data inpath '/tmp/pysparktestfile.csv' into table pyspark_numbers_from_file")
To create the pysparktestfile.csv
file, copy the file generated in Example
3 above, for example:
# Command listing files hdfs dfs -ls /tmp/testpyspark/ # Output listing files Found 2 items -rw-r--r-- 3 datastudio supergroup 0 2020-08-21 16:50 /tmp/testpyspark/_SUCCESS -rw-r--r-- 3 datastudio supergroup 416250 2020-08-21 16:50 /tmp/testpyspark/part-00000-5bf7c92a-e2ad-4f79-802e-c84c0c3b4cc0-c000.csv # Command copying file hdfs dfs -cp -p /tmp/testpyspark/part-00000-5bf7c92a-e2ad-4f79-802e-c84c0c3b4cc0-c000.csv /tmp/pysparktestfile.csv
Example 6: Insert from One Table to Another
%pyspark spark.sql("CREATE TABLE IF NOT EXISTS pyspark_numbers_new (number int, Ordinal_Number string, Cardinal_Number string) ") spark.sql("insert into table pyspark_numbers_new select * from pyspark_numbers_from_file") spark.sql("select * from pyspark_numbers_new").show() spark.sql("select count(*) from pyspark_numbers_new").show() spark.sql("select count(*) from pyspark_numbers_from_file").show()
Creating Versions of a Notebook
You can create a version of a notebook, which serves as a snapshot of the notebook when the version was created.
- Access the notebook application. See Accessing Big Data Studio.
- From the Notebooks page, open the notebook you want to version.
- Click Versioning in the upper left.
- Select Create Version to create a new verision of the notebook, or View Version History to view and access versions that have already been created.
Exporting a Notebook
You can export notebooks from the Big Data Studio notebook application to your local computer. This includes Jupyter notebooks.
- Access the notebook application. See Accessing Big Data Studio.
- On the Notebooks page, click the Select Notebooks icon.
- Select the notebook you want to export.
- Click the Export Notebooks icon in the upper right.
-
Change settings if desired, and then click Export. The file will
be exported as a
.dsnb
file. - Choose from export options that display, and then click OK to proceed with the export.
Deleting a Notebook
You can delete notebooks for which you have delete permissions.
- Access the notebook application. See Accessing Big Data Studio.
- On the Notebooks page, click the Select Notebooks icon.
- Select the notebook you want to delete.
- Click the Delete Notebooks icon in the upper right, and then click Delete to confirm the action.
You can also delete a notebook by opening the notebook and clicking the Delete Notebook icon at the top.
Keyboard Shortcuts for Notebooks
You can use keyboard shortcuts to perform certain actions, such as select paragraphs and clone and export notebooks.
To see a list of available shortcuts, open a notebook, click your user name in the top right, and select Keyboard Shortcuts from the menu. You can open an overview of all shortcuts, or search for shortcuts. If you're on a page that doesn't have any shortcuts, you won't see the Keyboard Shortcuts option.
You can also search for shortcuts by pressing ctrl+shift+f
.