Get Started with a Non-Highly Available ODH Big Data Cluster
You use an Oracle Cloud Infrastructure account to create a non-highly available
Big Data cluster with Oracle Distribution including Apache Hadoop.
You can create Big Data clusters with options for node shapes and storage sizes.
Select these options based on your use case and performance needs. In this workshop, you
create a non-HA cluster and assign small shapes to the nodes. This cluster is perfect for
testing applications.
This simple non-HA cluster has the following profile:
Nodes:1 Master node, 1 Utility node, and 3 Worker nodes.
Master and Utility Nodes Shapes:VM.Standard2.4 shape for the Master and Utility nodes. This shape provides 4
CPUs and 60 GB of memory.
Worker Nodes Shape:VM.Standard2.1 shape for the Worker nodes in the cluster. This shape provides
1 CPU and 150 GB of memory.
Storage Size:150 GB block storage for the Master, Utility, and Worker nodes.
Before You Begin
To successfully perform this tutorial, you must have the following:
Create ssh encryption keys to connect to your compute
instances or nodes.
Open a terminal window:
MacOS or Linux: Open a terminal window in the directory where you
want to store your keys.
Windows: Right-click on the directory where you want to store
your keys and select Git Bash Here.
Note
If you are using Windows Subsystem for Linux (WSL), ensure that the
directory for the keys is directly on your Linux machine and not in a
/mnt folder (windows file system).
If your username is in the Administrators group, then skip this section.
Otherwise, have your administrator add the following policy to your compartment:
Copy
allow group <the-group-your-username-belongs> to manage all-resources in compartment <your-compartment-name>
With this privilege, you can manage all resources in your compartment,
essentially giving you administrative rights in that compartment.
In the Configure VCN and Subnets section, keep the default values for
the CIDR blocks:
VCN CIDR BLOCK:10.0.0.0/16
PUBLIC SUBNET CIDR BLOCK:10.0.0.0/24
PRIVATE SUBNET CIDR BLOCK:10.0.1.0/24
Note
Notice the public and private subnets have different network
addresses.
For DNS RESOLUTION , select the check box for USE DNS HOSTNAMES IN THIS
VCN.
Note
If you plan to use host names instead of IP addresses through the VCN DNS
or a third-party DNS, then select this check box. This choice cannot be
changed after the VCN is created.
Click Next.
The Review and Create dialog is displayed (not shown here) confirming
all the values you just entered plus names of network components, and DNS
information.
Example:
DNS Label for VCN:trainingvcn
DNS Domain Name for VCN: trainingvcn.oraclevcn.com
DNS Label for Public Subnet:sub07282019130
DNS Label for Private Subnet:sub07282019131
Click Create to create your VCN.
The Creating Resources dialog is displayed (not shown here) showing
all VCN components being created.
Click View Virtual Cloud Network and view your new VCN details.
To access the nodes with ssh, the Start VCN wizard
automatically opens port 22 on your public subnet. To open other ports, you must add
ingress rules to your VCN's security list.
In this section, to allow access to Apache Ambari, you add an ingress rule to your
public subnet.
Open the navigation menu and click
Networking. Then click Virtual Cloud Networks.
Select the VCN you created with the wizard.
Click <your-public-subnet-name>.
In the Security Lists section, click the Default Security List
link.
Click Add Ingress Rules.
Fill in the ingress rule with the following information.
Fill in the ingress rule as follows:
Stateless:Clear the check box.
Source Type: CIDR
Source CIDR: 0.0.0.0/0
IP Protocol: TCP
Source port range: (leave-blank)
Destination Port Range: 7183
Description: Access Apache Ambari on first utility node.
Click Add Ingress Rule.
Confirm that the rule is displayed in the list of Ingress Rules.
Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
Open the Compartment drop-down list in the List Scope section and
select <your-compartment-name> you created in the get
started tutorial. Example: training-compartment.
Click Create Cluster.
On the Create Cluster panel, provide cluster details:
Cluster Name:training-cluster.
Cluster Admin Password: Enter a cluster admin
password of your choice such as
Training123. Important: You need
this password to sign into Apache Ambari and to perform certain actions
on the cluster through the Console.
Confirm Cluster Admin Password: Confirm your password.
Secure & Highly Available (HA): Clear this check box for a
non-HA cluster.
A non-HA cluster lacks the full Hadoop security stack,
including HDFS Transparent Encryption, Kerberos, and Apache Sentry.
This setting can't be changed for the life of the cluster.
Cluster Version: ODH <latest-version>.
In the Hadoop Nodes > Master/Utility Nodes section, provide the
following details:
Choose Instance Type:Virtual
Machine.
Choose Master/Utility Node
Shape:VM.Standard2.4.
Block Storage size per Master/Utility Node (in GB):150
GB.
Number of Master & Utility NodesRead-Only:
Since you are creating a non-HA cluster, this field shows 2
nodes: 1 Master node and 1 Utility node. Note: For
an HA cluster, this field shows 4 nodes: 2 Master nodes
and 2 Utility nodes.
In the Hadoop Nodes > Worker Nodes section, provide the following
details:
Choose Instance Type:Virtual
Machine.
Choose Worker Node
Shape:VM.Standard2.1.
Block Storage size per Worker Node (in GB):150
GB.
Number of Worker Nodes:3. Three is the
minimum allowed for a cluster.
In the Network Setting > Cluster Private Network section, provide the
following details:
CIDR BLOCK:10.1.0.0/24. This CIDR
block assigns a range of 256 contiguous IP
addresses, 10.1.0.0 to
10.1.0.255. The IP addresses is
available for the cluster's private network that BDS creates for the
cluster. This private network is created in the Oracle tenancy and
not in your customer tenancy. It is used exclusively for private
communication among the nodes of the cluster. No other traffic
travels over this network, it isn't accessible by outside hosts, and
you can't modify it once it's created. All ports are open on this
private network.
Note
Use the above CIDR block instead of the
already displayed CIDR block range to avoid any possible
overlapping of IP addresses with the CIDR block range for the
training-vcn VCN that you
created in Lab 1.
In the Network Setting > Customer Network section, provide the following
details:
Choose VCN in
training-compartment:training-vcn.
This is the VCN that you created in the Get Started tutorial. The VCN
must contain a regional subnet.
Note
Make sure that
training-compartment is selected; if
it's not, click the Change Compartment link, and then search
for and select your
training-compartment.
Choose Regional Subnet in
training-compartment:Public
Subnet-training-vcn. This is the public subnet that
was created for you when you created your
training-vcn VCN in the Get Started
tutorial.
Networking Options:Deploy Oracle-managed Service gateway
and NAT gateway (Quick Start). This option simplifies your
network configuration by allowing Oracle to provide and manage these
communication gateways for private use by the cluster. These gateways
are created in the Oracle tenancy and can't be modified after the
cluster is created.
Note
Select the Use the gateways in your
selected Customer VCN (Customizable) option if you want
more control over the networking configuration.
In the Additional Options > SSH public key section, click Choose SSH
public key file.
Add the public key you created in the Get Started tutorial. Your public key has
a .pub extenstion. Example:
big-data-key.pub
Click Create Cluster. The Clusters page is re-displayed. The
state of the cluster is initially Creating.
If you are using a Free Trial account to run this workshop, Oracle recommends that
you delete the BDS cluster when you complete the workshop to avoid unnecessary
charges.
The process of creating the cluster takes approximately one hour to complete. You can
monitor the cluster creation progress as follows:
To view the cluster's details, click training-cluster
in the Name column to display the Cluster Details page.
The Cluster Information tab displays the cluster's general and
network information.
The List of cluster nodes section displays the
following information for each node in the cluster: Name, status, type,
shape, private IP address, and date and time of creation.
Note
The name
of a node is the concatenation of the first seven letters of the
cluster's name, trainin, followed by two letters
representing the node type such as mn for a
Master node, un for a Utility
node, and wn for a Worker node. The numeric
value represents the node type order in the list such as nodes
0, 1, and
2.
To view the details of a node, click the node's name link in the Name
column. For example, click the traininmn0 master node in
the Name column to display the Node Details page.
The Node Information tab displays the node's general information and
the network information.
The Node Metrics section at the bottom of the Node Details page
is displayed after the cluster is provisioned. It displays the
following charts: CPU Utilization, Memory Utilization,
Network Bytes In, Network Bytes Out, and Disk
Utilization. You can hover over any chart to get additional
details.
Click the Cluster Details link in the breadcrumbs at the top of the page
to re-display the Cluster Details page.
In the Resources section on the left, click Work Requests.
The Work Requests section on the page displays the status of the cluster
creation and other details such as the Operation, Status, %
Complete, Accepted, Started, and Finished. Click
the CREATE_BDS name link in the Operation column.
The CREATE_BDS page displays the work request information, logs,
and errors, if any.
Click the Clusters link in the breadcrumbs at the top of the page to
re-display the Clusters page.
Once the training-cluster cluster is created
successfully, the status changes to Active.
Lab 3. Add Oracle Cloud SQL to the Cluster 🔗
You add Oracle Cloud SQL to a cluster so that you can use SQL to query your big data
sources. When you add Cloud SQL support to a cluster, a query server node is added and big
data cell servers are created on all worker nodes.
Note
Cloud SQL is not included with Big Data Service. You must pay an extra fee for
using Cloud SQL.
On the Clusters page, on the row for
training-cluster, click the Actions
button.
From the context menu, select Add Cloud SQL.
In the Add Cloud SQL dialog box, provide the following
information:
Query Server Node Shape: Select
VM.Standard2.4.
Query Server Node Block Storage (In GB): Enter
1000.
Cluster Admin Password: Enter your cluster administration
password that you chose when you created the cluster such as
Training123.
Click Add. The Clusters page is re-displayed. The status of the
training-cluster is now Updating and the
number of nodes in the cluster is increased by 1.
Click the training-cluster name link in the Name column to
display the Cluster Details page. Scroll-down the page to the List of
cluster nodes section. The newly added Cloud SQL node,
traininqs0, is displayed.
Click the Cloud SQL Information tab to display information about the
new Cloud SQL node.
Click Work Requests in the Resources section. In the Work
Requests section, the ADD_CLOUD_SQL operation is
displayed along with the status of the operation and percent completed. Click
the ADD_CLOUD_SQL link.
The Work Request Details page displays the status, logs, and errors (if
any) of adding the Cloud SQL node to the cluster.
Click the Clusters link in the breadcrumbs at the top of the page to
re-display the Clusters page. Once the Cloud SQL node is successfully
added to the cluster, the cluster's state changes to Active and the
number of nodes in the cluster is now increased by 1.
Lab 4. Map the Private IP Addresses to Public IP Addresses 🔗
Big Data Service nodes are by default assigned private IP addresses, which aren't
accessible from the public internet.
You must make the nodes in the cluster available by establishing connections to the node. In
this workshop, you map the private IP addresses of the nodes in the cluster to
public IP addresses to make them publicly available on the internet. We assume
that making the IP address public is an acceptable security
risk.
Note
Using a bastion Host, VPN Connect, and OCI FastConnect
provide more private and secure options than making the IP address
public.
In this lab, you use Oracle Cloud Infrastructure Cloud Shell, which is a web
browser-based terminal accessible from the Oracle Cloud Console.
Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, click the training-cluster
link in the Name column to display the Cluster Details page.
In the Cluster Information tab, in the Customer Network
Information section, click the Copy link next to Subnet
OCID. Next, paste that OCID to an editor or a file, as you will need it
later in this workshop.
On the same page, in the List of cluster nodes section, in the IP
Address column, find the private IP addresses for the utility node
traininun0, the master node
traininmn0, and the Cloud SQL node
traininqs0. Save the IP addresses as you will
need them in later tasks.
A utility node generally contains utilities used for accessing the cluster. Making
the utility nodes in the cluster publicly available makes services that run on the utility
nodes available from the internet.
In this task, you set three variables using the export
command. These variables are used in the oci network command
that you use to map the private IP address of the utility node to a new public IP
address.
On the Oracle Cloud Console banner at the top of the page, click
Cloud Shell. It may take a few moments to connect and
authenticate you.
To change the Cloud Shell background color theme from dark to light, click
Settings
on the Cloud Shell banner, and then select Theme > Light from the
Settings menu.
At the $ command line prompt, enter the following command. The
display-name is an optional
descriptive name that will be attached to the reserved public IP address that
will be created for you. Press the [Enter] key to run
the command.
Copy
$ export DISPLAY_NAME="traininun0-public-ip"
At the $ command line prompt, enter the following command. Substitute
subnet-ocid with your own
subnet-ocid that you identified in Task 1
of this step. Press the [Enter] key to run the command.
Copy
$ export SUBNET_OCID="subnet-ocid"
At the $ command line prompt, enter the following command. The private
IP address is the ip address that is assigned to the node that you want to map.
In this case, provide the utility node's private IP address that you identified
in Task 1. Press the [Enter] key to run the
command.
Copy
$ export PRIVATE_IP="ip-address"
At the $ command line prompt, enter the following command exactly as
it's shown below without any line breaks, or click Copy to
copy the command, and then paste it on the command line. Press the
[Enter] key to run the command.
In the output returned, find the value for ip-address field. This is the
new reserved public IP address that is mapped to the private IP address of your
utility.
To view the newly created reserved public IP address in the console, click the
Navigation menu and navigate to Networking. In the IP
Management section, click Reserved IPs. The new reserved public
IP address is displayed in the Reserved Public IP Addresses page. If you
did specify a descriptive name as explained earlier, that name will appear in
the Name column; Otherwise, a name such as
publicipnnnnnnnnn is generated.
In this task, you set two variables using the export command.
These variables are used in the oci network command that you use to
map the private IP address of the master node to a new public IP address. You have
done similar work in the previous task.
At the $ command line prompt, enter the following command. The
display-name is an optional
descriptive name that will be attached to the reserved public IP address that
will be created for you. Press the [Enter] key to run
the command.
Copy
$ export DISPLAY_NAME="traininmn0-public-ip"
You already set the SUBNET_OCID variable to your own
subnet-ocid value that you identified in Task
2 of this step. You don't need to set this variable again.
At the $ command line prompt, enter the following command. The private
IP address is the ip address that is assigned to the node that you want to map.
In this case, provide the first master node's private IP address that you
identified in Task 1. Press the [Enter] key to
run the command.
$ export PRIVATE_IP="ip-address"
At the $ command line prompt, enter the following command exactly as
it's shown below without any line breaks, or click Copy to
copy the command, and then paste it on the command line. Press the
[Enter] key to run the command.
In the output returned, find the value for ip-address field. This is the
new reserved public IP address that is mapped to the private IP address of your
master node.
To view the newly created reserved public IP address in the console, click the
Navigation menu and navigate to Networking. In the IP
Management section, click Reserved IPs. The new reserved public
IP address is displayed in the Reserved Public IP Addresses page. If you
did specify a descriptive name as explained earlier, that name will appear in
the Name column; Otherwise, a name such as
publicipnnnnnnnnn is generated.
In this task, you set two variables using the export command.
Next, you use the oci network command to map the private IP address
of the Cloud SQL node to a new public IP address.
In the Cloud Shell, at the $ command line prompt, enter the
following command.
Copy
$ export DISPLAY_NAME="traininqs0"
You already set the SUBNET_OCID variable to your own
subnet-ocid value that you identified in Task
2 of this step. You don't need to set this variable again.
At the $ command line prompt, enter the following command. The private
IP address is the ip address that is assigned to the node that you want to map.
In this case, provide the Cloud SQL node's private IP address that you
identified in Task 1.
Copy
export PRIVATE_IP="ip-address"
At the $ command line prompt, enter the following command exactly as
it's shown below without any line breaks, or click Copy to
copy the command, and then paste it on the command line.
In the output returned, find the value for ip-address field. This is the
new reserved public IP address that is mapped to the private IP address of your
Cloud SQL node.
To view the newly created reserved public IP address in the console, click the
Navigation menu and navigate to Networking. In the IP
Management section, click Reserved IPs. The new reserved public
IP address is displayed in the Reserved Public IP Addresses page.
In this task, you edit a public IP address using both the Cloud Console and
the Cloud Shell.
Click the Navigation menu and navigate to Networking. In the
IP Management section, click Reserved IPs. The new reserved
public IP addresses that you created in this step are displayed in the
Reserved Public IP Addresses page.
Change the name of the public IP address associated with the Cloud SQL node
from traininqs0 to
traininqs0-public-ip. On the row for
traininqs0, click the Actions button, and then
select Rename from the context menu.
In the Rename dialog box, in the RESERVED PUBLIC IP NAME field,
enter traininqs0-public-ip, and then click Save
Changes.
Do not delete any of your public IP addresses as you need them
for this tutorial.
Lab 5. Use Apache Ambari to Access the Cluster 🔗
In this task, you use Apache Ambari to access the cluster. In a Big Data cluster,
Apache Ambari runs on the first utility node, traininun0. You use
the reserved public IP address that is associated with traininun0
that you created in Task 2 of Lab 4.
Open a Web browser window and enter the following URL. Substitute
ip-address with your own
ip-address that is associated with
the utility node in your cluster, traininun0, which you
created in the previously. To view your reserved public IP address in the
console, click the Navigation menu and navigate to Networking. In
the IP Management section, click Reserved IPs. The reserved public
IP address is displayed in the Reserved Public IP Addresses page.
https://<ip-address>:7183
On the login screen, enter the following information:
username: admin
password: password you specified when you
created the cluster
Click Sign In.
From the Dashboard, note the name of the cluster at the top right and
the services running on the cluster from the left navigation.
Click Hosts. The hosts of the cluster are displayed. Hosts are
configured with one or more components, each of which corresponds to a service.
The component indicates which daemon, also known as service, runs on the host.
Typically, a host will run multiple components in support of the various
services running in the cluster.
Drill-down on the components associated with the master node in the cluster,
traininmn0.
The services and components that are running on the master node are
displayed such as the HDFS NameNode, Spark3 History
Server, YARN's Registry DNS, and Yarn's
ResourceManager among other services.
Exit Apache Ambari. From the User drop-down menu, select Sign
out.
In this task, you connect to the cluster's master node using SSH as user
opc (the default Oracle Public Cloud user).
When you created a cluster, you used your SSH public key to create the nodes. In this
section, you use the matching private key to connect to the master node.
Open the
navigation menu and click Networking. Under IP Management, click
Reserved IPs.
Copy the Reserved Public IP of
trainingmn0-publib-ip. You use this IP address to
connect to your master node.
Open a Terminal or a Command Prompt window, by using an
application such as GitBash.
Change into the directory where you stored the SSH encryption keys you created
in the Get Started tutorial.
Connect to your node with this ssh command. Your private key
has no extension. Example: big-data-key.
Because you identified your public key when you created the nodes, this
command logs you into your instance. You can now issue sudo commands to
create a Linux administrator.
Create the training Linux administrator user and the OS group
supergroup. Assign training the
supergroup superuser group as the primary group, and
hdfs, hadoop, and hive as the secondary groups.
The opc user has sudo
privileges on the cluster which allows it to switch to the root
user in order to run privileged commands. Change to the root
user as follows:
Copy
sudo bash
The dcli utility allows you to run the command that you
specify across each node of the cluster. The syntax for the
dcli utility is as follows:
Copy
dcli [option] [command]
Use the -C option to run the specified command on all
the nodes in the cluster.
Enter the following command at the # prompt to create
the OS group supergroup which is defined as the
superuser group in hadoop.
Copy
dcli -C "groupadd supergroup"
Enter the following command at the # prompt to create
the training administrator user and add it to the listed
groups on each node in the training-cluster. The
useradd linux command creates the new
training user and adds it to the specified
groups.
The preceding command creates a new user named
training on each node of the cluster. The
-g option assigns the
supergroup group as the primary group for
training. The -G
option assigns the hdfs,
hadoop, and hive groups
as the secondary groups for training.
Note
Since the training user is part of the hive
group, it is considered an administrator for Sentry.
Use the linux id command to confirm the creation of the
new user and to list its groups membership.
Copy
id training
You can now access HDFS using the new training
administrator user on any node in the cluster such as the first master
node in this example. Change to the training user as
follows:
Copy
sudo -su training
Use the linux id command to confirm that you are
connected now as the training user.
Copy
id
Perform a file listing of HDFS as the training user using the
following command:
Copy
hadoop fs -ls /
Lab 7. Upload Data to HDFS and Object Storage 🔗
In this step, you download and run two sets of scripts.
First, you will download and run the Hadoop Distributed File System (HDFS) scripts to
download data from Citi Bikes NYC to a new local directory on your master node
in your BDS cluster. The HDFS scripts manipulates some of the downloaded data files, and
then upload them to new HDFS directories. The HDFS scripts also create Hive databases
and tables which you will query using Hue.
Second, you will download and run the object storage scripts to download data from Citi
Bikes NYC to your local directory using OCI Cloud Shell. The object storage
scripts uploads the data to a new bucket in Object Storage. See the Data License Agreement for information about the Citi Bikes
NYC data license agreement.
Open the navigation menu and click
Identity & Security. Under Identity, click
Compartments.
In the list of compartments, search for the training-compartment. In the
row for the compartment, in the OCID column, hover over the OCID link and
then click Copy. Next, paste that OCID to an editor or a file, so that
you can retrieve it later in this step.
Click the Navigation menu and navigate to Networking > Reserved
IPs. The Reserved Public IP Addresses page is displayed. In the
List Scope on the left pane, make sure that your
training-compartment is selected.
In row for the traininmn0-public-ip reserved IP address, copy
the reserved public IP address associated with the master node in the
Reserved Public IP column. Next, paste that IP address to an editor
or a file, so that you can retrieve it later in this step. You might need this
IP address to ssh to the master node, if you didn't save your ssh connection in
step 5.
Log in as the training Hadoop Administrator user that
you created in Step 5.
Copy
sudo -su training
Use the id command to confirm that you are connected as the
training Hadoop Administrator user.
Copy
id
Use the cd command to change the working directory to that of
the training user. Use the pwd command
to confirm that you are in the training working
directory:
In this task, you download two scripts that will set up your HDFS environment and
download the HDFS dataset from Citibike System Data. The scripts and a
randomized weather data file are stored in a public bucket in Object Storage.
The Citi Bikes detailed trip data files (in zipped format) are first downloaded to a
new local directory. Next, the files are unzipped, and the header row is removed
from each file. Finally, the updated files are uploaded to a new
/data/biketrips HDFS directory. Next, a new
bikes Hive database is created with two Hive tables.
bikes.trips_ext is an external table defined over
the source data. The bikes.trips table is created from this
source; it is a partitioned table that stores the data in Parquet format. The tables
are populated with data from the .csv files in the
/data/biketrips directory.
The stations data file is downloaded (and then manipulated) from the station information page. The updated file
is then uploaded to a new /data/stations HDFS directory.
The weather data is downloaded from a public bucket in Object Storage. Next, the
header row in the file is removed. The updated file is then uploaded to a new
/data/weather HDFS directory. Next, a new
weather Hive database and
weather.weather_ext table are created and populated
with from the weather-newark-airport.csv file in the
/data/weather directory.
Run the following command to download the env.sh script
from a public bucket in Object Storage to the training
working directory. You will use this script to set up your HDFS
environment.
Run the following command to download the
download-all-hdfs-data.sh script from a public
bucket in Object Storage to the training working
directory. You will run this script to download the dataset to your local
working directory. The script will then upload this data to HDFS.
Add the execute privilege to the two downloaded .sh
files as follows:
Copy
chmod +x *.sh
Display the content of the env.sh file using the
cat command. This file sets the target local and HDFS
directories.
Copy
cat env.sh
Note
You will download the data from Citi Bikes NYC to the new
Downloads local target directory as
specified in the env.sh file. You will upload the
data from the local Downloads directory to the
following new HDFS directories under the new /data
HDFS directory as specified in the env.sh and the
HDFS scripts: biketrips,
stations, and
weather.
Use the cat command to display the content of the
download-all-hdfs-data.sh script. This script
downloads the download-citibikes-hdfs.sh and
download-weather-hdfs.sh scripts to the local
training working directory, adds execute
privilege on both of those scripts, and then runs the two scripts.
Copy
cat download-all-hdfs-data.sh
The download-citibikes-hdfs.sh script does the
following:
Runs the env.sh script to set up your local and
HDFS target directories.
Downloads the stations information from the Citi Bike Web site to the
local Downloads target directory.
Creates a new /data/stations HDFS directory, and
then copies the stations.json file to this HDFS
directory.
Downloads the bike rental data files (the zipped .csv
files) from Citi Bikes NYC to the local
Downloads target directory.
Unzips the bike rental files, and removes the header row from each
file.
Creates a new /data/biketrips HDFS directory,
and then uploads the updated csv files to this HDFS
directory. Next, it adds execute file permissions to both
.sh files.
Creates the bikes Hive database with two Hive tables.
bikes.trips_ext is an external table
defined over the source data. bikes.trips is
created from this source; it is a partitioned table that stores the data
in Parquet format. The tables are populated with data from the
.csv files in the
/data/biketrips directory.
The download-weather-hdfs.sh script provides a random
weather data set for Newark Liberty Airport in New Jersey. It does the
following:
Runs the env.sh script to set up your local and HDFS
target directories.
Downloads the weather-newark-airport.csv file to the
Downloads stations information from the
Citi Bike Web site to the local Downloads target
directory.
Removes the header row from the file.
Creates a new /data/weather HDFS directory, and
then uploads the weather-newark-airport.csv file to
this HDFS directory.
Creates the weather Hive database and the
weather.weather_ext Hive table. It then populates
the table with the weather data from the
weather-newark-airport.csv file in the local
Downloads directory.
Run the download-all-hdfs-data.sh script as
follows:
Copy
./download-all-hdfs-data.sh
Text messages will scroll on the screen. After a minute or so, the Weather
data loaded and Done messages are displayed on the screen.
Navigate to the local Downloads directory, and then use
the ls -l command to display the downloaded trips, stations,
and weather data files.
Copy
cd Downloads
Use the head command to display the first two records
from the stations.json file.
Copy
head -2 stations.json | jq
Use the head command to display the first 10 records
from the weather-newark-airport.csv file.
Copy
head weather-newark-airport.csv
Use the following commands to display the HDFS directories that were created,
and list their contents.
Hit the [Enter] key on your keyboard to execute the last command
above.
Use the following command to display the first 5 rows from the uploaded
JC-201901-citibike-tripdata.csv file in the
/data/biketrips HDFS folder. Remember, the
header row for this uploaded .csv file was removed when you ran
the download-citibikes-hdfs.sh script.
Copy
hadoop fs -cat /data/biketrips/JC-201902-citibike-tripdata.csv | head -5
In this task, you will download two scripts . The scripts and a randomized weather
data file are stored in a public bucket in Object Storage.
The scripts that you download for this section will set up your Object Storage
environment and download the object storage dataset from Citi
Bikes NYC.
On the Oracle Cloud Console banner at the top of the page, click
Cloud Shell. It may take a few moments to connect and authenticate
you.
Click Copy to copy the following command. Right-click your mouse, select
Paste, and then paste it on the command line. You will use this
script to set up your environment for the Object Store data. Press the
[Enter] key to run the command.
Edit the downloaded env.sh file using the vi
Editor (or an editor of your choice) as follows:
Copy
vi env.sh
To input and edit text, press the [i] key on your keyboard (insert mode)
at the current cursor position. The INSERT keyword is displayed at the
bottom of the file to indicate that you can now make your edits to this file.
Scroll-down to the line that you want to edit. Copy your training-compartment
OCID value that you identified in Task 1, and then paste it
between the " " in the export
COMPARTMENT_OCID="" command.
Note
You will upload the Object Store data to the
training bucket as specified in the
env.shfile.
Press the [Esc] key on your keyboard, enter :wq,
and then press the [Enter] key on your keyboard to save your changes and
quit vi.
At the $ command line prompt, click Copy to copy the following
command, and then paste it on the command line. You will run this script to
download the dataset to your local working directory. You will then upload this
data to a new object in a new bucket. Press the [Enter] key to run the
command.
Add the execute privilege to the two downloaded .sh
files as follows:
Copy
chmod +x *.sh
Use the cat command to display the content of this
script. This script runs the env.shscript, downloads the
download-citibikes-objstore.sh and
download-weather-objstore.sh scripts, adds
execute privilege to both of those scripts, and then runs the two scripts.
Copy
cat download-all-objstore.sh
You can use the cat command to display the content of
this script. The download-all-objstore.sh script
runs the env.sh script which sets the environment.
The script writes some of the data from Citi Bikes NYC, and a randomized weather data
that is stored in a public bucket in Object Storage to the your local Cloud
Shell directory and to new objects in a new bucket named
training, as specified in the
env.sh script. The
training bucket will contain the following
new objects:
The weather object which stores the weather
data.
The stations object which stores the stations
data.
The biketrips object which stores the bike trips
data.
Run the download-all-objstore.sh script as
follows:
Copy
./download-all-objstore.sh
Text messages will scroll on the screen. After a minute or so, the Done
message along with the location of the data (both compartment and bucket) are
displayed on the screen.
Navigate to the local Downloads directory to display
the downloaded data files.
Click the Navigation menu and navigate to Storage. In the
Object Storage & Archive Storage section, click Buckets.
The Buckets page is displayed. In the List Scope on the left pane,
make sure that your training-compartment is selected. In the list of
available buckets, the newly created training bucket is displayed in the
Name column. Click the training link.
The Buckets Details page for the training bucket is displayed.
Scroll-down to the Objects section to display the newly created
biketrips, stations, and weather objects.
To display the data files in an object such as the biketrip object,
click Expand next to the object's name. The files contained in that object are
displayed. To collapse the list of files, click collapse next to the object's name.
To display the first 1KB of the file's content (in read-only mode), click the
Actions button on the row for the file, and then select View
Object Details from the context menu.
Note
To view all the data in a file, select Download from the context
menu, and then double-click the downloaded file to open it using its native
application, MS-Excel (.csv) in this example.
You use the Clusters and Cluster Details pages to maintain your
clusters.
Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, on the row for
training-cluster, click the Actions
button. You can use the context menu to view the cluster's details, add nodes,
add block storage, add Cloud SQL, rename the cluster, remove Cloud SQL (if it's
already added), and terminate the Big Data cluster.
Alternatively, you can click the training-cluster link
in the Name column to display the Cluster Details page. You can
use the buttons at the top of the page to do the following:
Add nodes to the cluster.
Add block storage.
Add Cloud SQL.
Change shape.
Use the More Actions drop-down list to rename the cluster, move a
resource from the current compartment to a different compartment, add
tags, remove Cloud SQL (if it's already added), and terminate Big data
Cluster.
Note
Oracle Cloud Infrastructure Tagging allows you to add metadata to
resources, which enables you to define keys and values and associate them with
resources. You can use the tags to organize and list resources based on your
business needs.
You can monitor the cluster's metrics and the metrics of any of its
nodes.
From the Clusters page, click training-cluster
in the Name column to display the Cluster Details page.
Scroll-down the Cluster Details page. In the Resources section on
the left, click Cluster Metrics.
The Cluster Metrics section shows the various metrics such as HDFS
Space Used, HDFS Space Free, Yarn Jobs Completed, and
Spark Jobs Completed. You can adjust the Start time, End time,
Interval, Statistic, and Options fields, as desired.
In the Resources section on the left, click Nodes (7).
In the List of cluster nodes section, click any node name link to
display its metrics. Click the traininmn0 master node in
the Name column.
In the Node Details page, scroll-down to the Node Metrics
section. This section is displayed at the bottom of the Node Details page
only after the cluster is successfully provisioned. It displays the
following charts: CPU Utilization, Memory Utilization, Network
Bytes In, Network Bytes Out, and Disk Utilization. You can
hover over any chart to get additional details.
Click the Cluster Details link in the breadcrumbs at the top of the page
to re-display the Cluster Details page.
Lab 9. Clean Up Tutorial Resources 🔗
You can delete the resources that you created in this workshop. If you want to run
the labs in this workshop again, perform these clean up tasks.
If you want to list the resources in your training-compartment,
you can use the Tenancy Explorer page. From the Navigation menu, navigate
to Governance & Administration. In the Governance section, click
Tenancy Explorer. On the Tenancy Explorer page, in the Search
compartments field, type training, and then select
training-compartment from the list of compartments. The
resources in the training-compartment are displayed.
Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, on the row for
training-cluster, click the Actions
button, and then select Terminate Big Data Cluster from the context
menu.
A confirmation message box is displayed. Enter the name of the cluster, and
then click Terminate. The status of the cluster in the State
column is Deleting. It can take up to 30 minutes before the cluster is
deleted.
The status of the cluster in the State column changes from Active
to Deleting.
To view the status of the deletion process, click the cluster's name link in
the Name column to display the Cluster Details page. In the
Resources section at the bottom left-hand side of the page, click
Work Requests. In the Work Requests section, you can see the
% Complete information.
For additional details on the deletion process, click CREATE_BDS in
the Operation column. The DELETE_BDS page displays the logs,
and errors, if any.
Click the Clusters link in the breadcrumbs to return to the
Clusters page. When the cluster is successfully deleted, the status
of the cluster in the State column changes from Deleting to
Deleted.
Open the navigation menu and click
Identity & Security. Under Identity, click
Policies.
Click the Actions button associated with the
training-admin-policy policy, and then select Delete from the
context menu. A confirmation message box is displayed, click
Delete.
Click the Actions button associated with the training-bds-policy
policy, and then select Delete from the context menu. A confirmation
message box is displayed, click Delete.
To delete a VCN, it must first be empty and have no related resources or attached
gateways such as internet gateway, dynamic routing gateway, and so on. To delete a VCN's
subnets, they must first be empty too.
Open the navigation menu and click
Networking. Then click Virtual Cloud Networks.
From the list of available VCNs in your compartment, click the
training-vcn name link in the Name column. The Virtual
Cloud Network Details page is displayed.
In the Subnets section, click the Actions button associated with
Private Subnet-training-vcn. Select Terminate from the context
menu. A confirmation message box is displayed. Click Terminate.
In the Subnets section, click the Actions button associated with
Public Subnet-training-vcn. Select Terminate from the context
menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click Route
Tables.
In the Routes Tables section, click the Actions button associated
with Route Table for Private Subnet-training-vcn. Select Terminate
from the context menu. A confirmation message box is displayed. Click
Terminate.
In the Routes Tables section, click the Default Route Table for
training-vcn link in the Name column. The Route Table
Details page is displayed. In the Route Rules section, click the
Actions icon associated with Internet Gateway-training-vcn.
Select Remove from the context menu. A confirmation message box is
displayed. Click Remove. Click training-vcn in the breadcrumbs to
return to the training-vcn page.
In the Resources section on the left pane, click Internet
Gateways. In the Internet Gateways section, click the
Actions button associated with Internet Gateway-training-vcn.
Select Terminate from the context menu. A confirmation message box is
displayed. Click Terminate.
In the Resources section on the left pane, click Security Lists.
In the Security Lists section, click the Actions button associated
with Security List for Private Subnet-training-vcn. Select
Terminate from the context menu. A confirmation message box is
displayed. Click Terminate.
In the Resources section on the left pane, click NAT Gateways. In
the NAT Gateways section, click the Actions button associated with
NAT Gateway-training-vcn. Select Terminate from the context
menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click Service
Gateways. In the Service Gateways section, click the Actions
button associated with Service Gateway-training-vcn. Select
Terminate from the context menu. A confirmation message box is
displayed. Click Terminate.
At the top of the page, click Terminate to terminate your VCN. A
Terminate Virtual Cloud Network window is displayed. After less than
a minute, the Terminate All button is enabled. To delete your VCN, click
Terminate All.
When the termination operation is completed successfully, a Virtual Cloud
Network termination complete message is displayed in the window. Click
Close.
Click the Navigation menu and navigate to Networking. In the
IP Management section, click Reserved IPs. The Reserved
Public IP Addresses page is displayed.
In the List Scope on the left pane, make sure that your
training-compartment is selected.
In this workshop, you have created three reserved IP addresses:
traininmn0-public-ip,
traininqs0-public-ip, and
traininun0-public-ip.
Click the Actions button associated with
traininmn0-public-ip. Select Terminate
from the context menu. A confirmation message box is displayed. Click
Terminate.
Click the Actions button associated with
traininqs0-public-ip. Select Terminate
from the context menu. A confirmation message box is displayed. Click
Terminate.
Click the Actions button associated with
traininun0-public-ip. Select Terminate
from the context menu. A confirmation message box is displayed. Click
Terminate.
Before you can delete a bucket that contains objects, you must delete all the objects
in the bucket first.
Click the Navigation menu and navigate to Storage. In the
Object Storage & Archive Storage section, click Buckets.
The Buckets page is displayed. In the List Scope on the left pane,
make sure that your training-compartment is selected. In the list of
available buckets, the newly created training bucket is displayed in the
Name column. Click the training link.
The Bucket Details page for the training bucket is displayed.
Scroll-down to the Objects section.
On the row for the biketrips object, click the Actions button,
and then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter biketrips in the Type
the folder name to confirm deletion text box, and then click
Delete. The object is deleted and the Bucket Details page is
re-displayed.
On the row for the stations object, click the Actions button, and
then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter stations in the Type
the folder name to confirm deletion text box, and then click
Delete. The object is deleted and the Bucket Details page is
re-displayed.
On the row for the weather object, click the Actions button, and
then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter weather in the Type
the folder name to confirm deletion text box, and then click
Delete. The object is deleted and the Bucket Details page is
re-displayed.
Scroll up the page, and then click the Delete button. A confirmation
message box is displayed. Click Delete. The bucket is deleted and the
Buckets page is re-displayed.
Open the navigation menu and click
Identity & Security. Under Identity, click
Compartments.
From the list of available compartments, search for your
training-compartment.
On the Compartments page, click the Actions button associated
with training-compartment. Select Delete from the context
menu.
A confirmation message box is displayed. Click Delete. The status of the
deleted compartment changes from Active to Deleting until the
compartment is successfully deleted. You can click on the compartment name link
in the Name column to display the status of this operation.