Get Started with a Non-Highly Available ODH Big Data Cluster

Before You Begin

To successfully perform this tutorial, you must have the following:

A Oracle Cloud Infrastructure account. See Signing Up for Oracle Cloud Infrastructure.
A MacOS, Linux, or Windows computer with ssh support installed.

Lab 1. Setup OCI Resources Needed for Big Data Clusters

In this lab, you use an Oracle Cloud Infrastructure account to prepare the resources needed to create a Big Data cluster.

Task 1: check your service limits

Log in to the Oracle Cloud Infrastructure Console.
Open the navigation menu, and click Governance and Administration. Under Governance, click Limits, Quotas and Usage.
Find your service limit for Big Data node shapes:
- Filter for the following options:
  - Service: Big Data
  - Scope: <your-region> (Same as the region in the top navigation bar.)
  - Resource:
    - VM Standard2.4 - Total OCPUs (for master, utility and Cloud SQL nodes)
    - VM Standard2.1 - Total OCPUs (for worker nodes)
  - Compartment: <tenancy-name> (root)
  - Find available OCPU count:
    - Limit Name: vm-standard-2-4-ocpu-count
    - Available:
      - For non-highly available (non-HA) clusters: minimum 3
        (One for the master node, one for the utility node, and one for Cloud SQL.)
      - For highly available (HA) clusters: minimum 5
        (Two for master nodes, two for utility nodes, and one for Cloud SQL.)
    - Limit Name: vm-standard-2-1-ocpu-count
    - Available: minimum 3

Note

In this workshop, you create nodes with the following shapes and storage:

Shape:

VM Standard2.4 for the master, utility, and Cloud SQL nodes.
VM Standard2.1 for the worker nodes.

Storage:

150 GB of Block Storage for the master, utility, and worker nodes.
1,000 GB of Block Storage for the Cloud SQL node.

To use another shape, filter for that shape instead. For a list of all supported shapes in Big Data Service, see Service Limits.

Task 2: create SSH encryption keys

Create ssh encryption keys to connect to your compute instances or nodes.

Open a terminal window:
- MacOS or Linux: Open a terminal window in the directory where you want to store your keys.
- Windows: Right-click on the directory where you want to store your keys and select Git Bash Here.
Note

If you are using Windows Subsystem for Linux (WSL), ensure that the directory for the keys is directly on your Linux machine and not in a /mnt folder (windows file system).
Issue the following OpenSSH command:
```
ssh-keygen -t rsa -N "" -b 2048 -C <your-ssh-key-name> -f <your-ssh-key-name>
```
The command generates some random text art used to generate the keys. When complete, you have two files:
- The private key file: <your-ssh-key-name>
- The public key file: <your-ssh-key-name>.pub
You use these files to connect to your nodes.

See Creating a Key Pair for details on generating ssh encryption keys.

Note

This workshop does not use Putty keys. Use the instructions in this section to create your key pair.

Task 3: add compartment policy

If your username is in the Administrators group, then skip this section. Otherwise, have your administrator add the following policy to your tenancy:

allow group <the-group-your-username-belongs> to manage compartments in tenancy

With this privilege, you can create a compartment for all the resources in your tutorial.

Steps to Add the Policy

In the top navigation bar, open the Profile menu.
Click your username.
In the left pane, click Groups.
In a notepad, copy the Group Name that your username belongs.
Open the navigation menu and click Identity & Security. Under Identity, click Policies.
Select your compartment from the Compartment drop-down.
Click Create Policy.
Fill in the following information:
- Name: manage-compartments
- Description: Allow the group <the-group-your-username-belongs> to list, create, update, delete and recover compartments in the tenancy.
- Compartment: <your-tenancy>(root)
For Policy Builder, click Customize (Advanced).

Paste in the following policy:

allow group <the-group-your-username-belongs> to manage compartments in tenancy

Click Create.

Reference: The compartments resource-type in Verbs + Resource-Type Combinations for IAM

Task 4: create a compartment

Create a compartment for the resources that you create in this tutorial.

Log in to the Oracle Cloud Infrastructure Console.
Open the navigation menu and click Identity & Security. Under Identity, click Compartments.
Click Create Compartment.
Fill in the following information:
- Name: <your-compartment-name>.
  Example: training-compartment.
- Description: Compartment for <your-description>.
- Parent Compartment: <your-tenancy>(root)
Click Create Compartment.

Task 5: add resource policy

If your username is in the Administrators group, then skip this section. Otherwise, have your administrator add the following policy to your compartment:

allow group <the-group-your-username-belongs> to manage all-resources in compartment <your-compartment-name>

With this privilege, you can manage all resources in your compartment, essentially giving you administrative rights in that compartment.

Steps to Add the Policy

Open the navigation menu and click Identity & Security. Under Identity, click Policies.
Select your compartment from the Compartment drop-down.
Click Create Policy.
Fill in the following information:
- Name: manage-<your-compartment-name>-resources
- Description: Allow users to list, create, update, and delete resources in <your-compartment-name>.
- Compartment: <your-compartment-name>
For Policy Builder, select the following choices:
- Policy use cases: Compartment Management
- Common policy templates: Let compartment admins manage the compartment
- Groups: <the-group-your-username-belongs>
- Location: <your-compartment-name>
Click Create.

Reference: Common Policies

Task 6: add Big Data policy

Big Data service can create a virtual cloud network in your tenancy, only if you give it permission. Add the following policy to your compartment:

allow service bdsprod to {VNC_READ, VNIC_READ, VNIC_ATTACH, VNIC_DETACH, VNIC_CREATE, VNIC_DELETE,VNIC_ATTACHMENT_READ, SUBNET_READ, VCN_READ, SUBNET_ATTACH, SUBNET_DETACH, INSTANCE_ATTACH_SECONDARY_VNIC, INSTANCE_DETACH_SECONDARY_VNIC} in compartment <your-compartment-name>

With this privilege, Big Data can create the network resources in your compartment.

Steps to Add the Policy

Open the navigation menu and click Identity & Security. Under Identity, click Policies.
Select your compartment from the Compartment drop-down menu.
Click Create Policy.
Fill in the following information:
- Name: big-data-create-network-resources
- Description: Allow Big Data service to create a virtual cloud network.
- Compartment: <your-compartment-name>
For Policy Builder, enable Show manual editor.

Paste the following policy in the editor:

allow service bdsprod to {VNC_READ, VNIC_READ, VNIC_ATTACH, VNIC_DETACH, VNIC_CREATE, VNIC_DELETE,VNIC_ATTACHMENT_READ, SUBNET_READ, VCN_READ, SUBNET_ATTACH, SUBNET_DETACH, INSTANCE_ATTACH_SECONDARY_VNIC, INSTANCE_DETACH_SECONDARY_VNIC} in compartment <your-compartment-name>

Click Create.

Note

Ensure that you create this policy. Without this policy, you cannot create a cluster.

Task 7: create a Virtual Cloud Network (VCN)

Set up a virtual cloud network to host the nodes in your cluster.

In the Console's main landing page, go to Launch Resources and click Set up a network with a wizard.
In the Start VCN Wizard workflow, select Create VCN with Internet Connectivity and then click Start VCN Wizard .
Enter or select the following information:
- VCN Name: training-vcn
- Compartment: <your-compartment-name>. Example: training-compartment.
In the Configure VCN and Subnets section, keep the default values for the CIDR blocks:
- VCN CIDR BLOCK: 10.0.0.0/16
- PUBLIC SUBNET CIDR BLOCK: 10.0.0.0/24
- PRIVATE SUBNET CIDR BLOCK: 10.0.1.0/24
Note

Notice the public and private subnets have different network addresses.
For DNS RESOLUTION , select the check box for USE DNS HOSTNAMES IN THIS VCN.

Note

If you plan to use host names instead of IP addresses through the VCN DNS or a third-party DNS, then select this check box. This choice cannot be changed after the VCN is created.
Click Next.
The Review and Create dialog is displayed (not shown here) confirming all the values you just entered plus names of network components, and DNS information.

Example:
- DNS Label for VCN: trainingvcn
- DNS Domain Name for VCN: trainingvcn.oraclevcn.com
- DNS Label for Public Subnet: sub07282019130
- DNS Label for Private Subnet: sub07282019131
Click Create to create your VCN.

The Creating Resources dialog is displayed (not shown here) showing all VCN components being created.
Click View Virtual Cloud Network and view your new VCN details.

Task 8: create an ingress rule for Apache Ambari

To access the nodes with ssh, the Start VCN wizard automatically opens port 22 on your public subnet. To open other ports, you must add ingress rules to your VCN's security list.

In this section, to allow access to Apache Ambari, you add an ingress rule to your public subnet.

Open the navigation menu and click Networking. Then click Virtual Cloud Networks.
Select the VCN you created with the wizard.
Click <your-public-subnet-name>.
In the Security Lists section, click the Default Security List link.
Click Add Ingress Rules.
Fill in the ingress rule with the following information.
Fill in the ingress rule as follows:
- Stateless: Clear the check box.
- Source Type: CIDR
- Source CIDR: 0.0.0.0/0
- IP Protocol: TCP
- Source port range: (leave-blank)
- Destination Port Range: 7183
- Description: Access Apache Ambari on first utility node.
Click Add Ingress Rule.
Confirm that the rule is displayed in the list of Ingress Rules.

Lab 2. Create a non-HA ODH Cluster

Create a non-HA cluster and monitor the steps.

Task 1: create the non-ha ODH cluster

Sign in to the Oracle Cloud Console.
Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
Open the Compartment drop-down list in the List Scope section and select <your-compartment-name> you created in the get started tutorial. Example: training-compartment.
Click Create Cluster.
On the Create Cluster panel, provide cluster details:
- Cluster Name:training-cluster.
- Cluster Admin Password: Enter a cluster admin password of your choice such as Training123. Important: You need this password to sign into Apache Ambari and to perform certain actions on the cluster through the Console.
- Confirm Cluster Admin Password: Confirm your password.
- Secure & Highly Available (HA): Clear this check box for a non-HA cluster.
  A non-HA cluster lacks the full Hadoop security stack, including HDFS Transparent Encryption, Kerberos, and Apache Sentry. This setting can't be changed for the life of the cluster.
- Cluster Version: ODH <latest-version>.
In the Hadoop Nodes > Master/Utility Nodes section, provide the following details:
- Choose Instance Type:Virtual Machine.
- Choose Master/Utility Node Shape:VM.Standard2.4.
- Block Storage size per Master/Utility Node (in GB):150 GB.
- Number of Master & Utility NodesRead-Only: Since you are creating a non-HA cluster, this field shows 2 nodes: 1 Master node and 1 Utility node. Note: For an HA cluster, this field shows 4 nodes: 2 Master nodes and 2 Utility nodes.
In the Hadoop Nodes > Worker Nodes section, provide the following details:
- Choose Instance Type:Virtual Machine.
- Choose Worker Node Shape:VM.Standard2.1.
- Block Storage size per Worker Node (in GB):150 GB.
- Number of Worker Nodes:3. Three is the minimum allowed for a cluster.
In the Network Setting > Cluster Private Network section, provide the following details:
- CIDR BLOCK:10.1.0.0/24. This CIDR block assigns a range of 256 contiguous IP addresses, 10.1.0.0 to 10.1.0.255. The IP addresses is available for the cluster's private network that BDS creates for the cluster. This private network is created in the Oracle tenancy and not in your customer tenancy. It is used exclusively for private communication among the nodes of the cluster. No other traffic travels over this network, it isn't accessible by outside hosts, and you can't modify it once it's created. All ports are open on this private network.
  Note
  
  Use the above CIDR block instead of the already displayed CIDR block range to avoid any possible overlapping of IP addresses with the CIDR block range for the training-vcn VCN that you created in Lab 1.
In the Network Setting > Customer Network section, provide the following details:
- Choose VCN in training-compartment:training-vcn. This is the VCN that you created in the Get Started tutorial. The VCN must contain a regional subnet.
  Note
  
  Make sure that training-compartment is selected; if it's not, click the Change Compartment link, and then search for and select your training-compartment.
- Choose Regional Subnet in training-compartment:Public Subnet-training-vcn. This is the public subnet that was created for you when you created your training-vcn VCN in the Get Started tutorial.
- Networking Options:Deploy Oracle-managed Service gateway and NAT gateway (Quick Start). This option simplifies your network configuration by allowing Oracle to provide and manage these communication gateways for private use by the cluster. These gateways are created in the Oracle tenancy and can't be modified after the cluster is created.
  Note
  
  Select the Use the gateways in your selected Customer VCN (Customizable) option if you want more control over the networking configuration.
In the Additional Options > SSH public key section, click Choose SSH public key file.
Add the public key you created in the Get Started tutorial. Your public key has a .pub extenstion. Example: big-data-key.pub
Click Create Cluster. The Clusters page is re-displayed. The state of the cluster is initially Creating.

If you are using a Free Trial account to run this workshop, Oracle recommends that you delete the BDS cluster when you complete the workshop to avoid unnecessary charges.

Task 2: review services in the non-HA cluster

The training-cluster is a non-highly available (non-HA) cluster. In all the ODH non-HA clusters, the services are distributed as follows:

Master Node, `traininmn0`	Utility Node, `traininun0`	Worker nodes, `traininwn0`, `traininwn1`, `traininwn2`
Ambari Metrics Monitor HDFS Client HDFS NameNode Hive Client MapReduce2 Client Spark3 Client Spark3 History Server YARN Client YARN Registry DNS YARN ResourceManager ZooKeeper Server	Ambari Metrics Collector Ambari Metrics Monitor Ambari Server HDFS Client HDFS Secondary NameNode Hive Metastore HiveServer2 MapReduce2 Client MapReduce2 History Server Oozie Server Spark3 Client Tez Client YARN Client YARN Timeline Service V1.5 ZooKeeper Client ZooKeeper Server	Ambari Metrics Monitor HDFS DataNode HDFS Client Hive Client MapReduce2 Client Oozie Client Spark3 Client Spark3 Thrift Server Tez Client YARN Client YARN NodeManager ZooKeeper Client ZooKeeper Server

Task 3: monitor cluster creation

The process of creating the cluster takes approximately one hour to complete. You can monitor the cluster creation progress as follows:

To view the cluster's details, click training-cluster in the Name column to display the Cluster Details page.
The Cluster Information tab displays the cluster's general and network information.
The List of cluster nodes section displays the following information for each node in the cluster: Name, status, type, shape, private IP address, and date and time of creation.
Note

The name of a node is the concatenation of the first seven letters of the cluster's name, trainin, followed by two letters representing the node type such as mn for a Master node, un for a Utility node, and wn for a Worker node. The numeric value represents the node type order in the list such as nodes 0, 1, and 2.
To view the details of a node, click the node's name link in the Name column. For example, click the traininmn0 master node in the Name column to display the Node Details page.

The Node Information tab displays the node's general information and the network information.

The Node Metrics section at the bottom of the Node Details page is displayed after the cluster is provisioned. It displays the following charts: CPU Utilization, Memory Utilization, Network Bytes In, Network Bytes Out, and Disk Utilization. You can hover over any chart to get additional details.
Click the Cluster Details link in the breadcrumbs at the top of the page to re-display the Cluster Details page.
In the Resources section on the left, click Work Requests.
The Work Requests section on the page displays the status of the cluster creation and other details such as the Operation, Status, % Complete, Accepted, Started, and Finished. Click the CREATE_BDS name link in the Operation column.
The CREATE_BDS page displays the work request information, logs, and errors, if any.
Click the Clusters link in the breadcrumbs at the top of the page to re-display the Clusters page.
Once the training-cluster cluster is created successfully, the status changes to Active.

Lab 3. Add Oracle Cloud SQL to the Cluster

You add Oracle Cloud SQL to a cluster so that you can use SQL to query your big data sources. When you add Cloud SQL support to a cluster, a query server node is added and big data cell servers are created on all worker nodes.

Note

Cloud SQL is not included with Big Data Service. You must pay an extra fee for using Cloud SQL.

On the Clusters page, on the row for training-cluster, click the Actions button.
From the context menu, select Add Cloud SQL.
In the Add Cloud SQL dialog box, provide the following information:
- Query Server Node Shape: Select VM.Standard2.4.
- Query Server Node Block Storage (In GB): Enter 1000.
- Cluster Admin Password: Enter your cluster administration password that you chose when you created the cluster such as Training123.
Click Add. The Clusters page is re-displayed. The status of the training-cluster is now Updating and the number of nodes in the cluster is increased by 1.
Click the training-cluster name link in the Name column to display the Cluster Details page. Scroll-down the page to the List of cluster nodes section. The newly added Cloud SQL node, traininqs0, is displayed.
Click the Cloud SQL Information tab to display information about the new Cloud SQL node.
Click Work Requests in the Resources section. In the Work Requests section, the ADD_CLOUD_SQL operation is displayed along with the status of the operation and percent completed. Click the ADD_CLOUD_SQL link.
The Work Request Details page displays the status, logs, and errors (if any) of adding the Cloud SQL node to the cluster.
Click the Clusters link in the breadcrumbs at the top of the page to re-display the Clusters page. Once the Cloud SQL node is successfully added to the cluster, the cluster's state changes to Active and the number of nodes in the cluster is now increased by 1.

Lab 4. Map the Private IP Addresses to Public IP Addresses

Big Data Service nodes are by default assigned private IP addresses, which aren't accessible from the public internet.

You must make the nodes in the cluster available by establishing connections to the node. In this workshop, you map the private IP addresses of the nodes in the cluster to public IP addresses to make them publicly available on the internet. We assume that making the IP address public is an acceptable security risk.

Note

Using a bastion Host, VPN Connect, and OCI FastConnect provide more private and secure options than making the IP address public.

In this lab, you use Oracle Cloud Infrastructure Cloud Shell, which is a web browser-based terminal accessible from the Oracle Cloud Console.

Task 1: gather required information

Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, click the training-cluster link in the Name column to display the Cluster Details page.
In the Cluster Information tab, in the Customer Network Information section, click the Copy link next to Subnet OCID. Next, paste that OCID to an editor or a file, as you will need it later in this workshop.
On the same page, in the List of cluster nodes section, in the IP Address column, find the private IP addresses for the utility node traininun0, the master node traininmn0, and the Cloud SQL node traininqs0. Save the IP addresses as you will need them in later tasks.

Task 2: map the private IP address of the utility node to a public IP address

A utility node generally contains utilities used for accessing the cluster. Making the utility nodes in the cluster publicly available makes services that run on the utility nodes available from the internet.

In this task, you set three variables using the export command. These variables are used in the oci network command that you use to map the private IP address of the utility node to a new public IP address.

On the Oracle Cloud Console banner at the top of the page, click Cloud Shell. It may take a few moments to connect and authenticate you.
To change the Cloud Shell background color theme from dark to light, click Settings on the Cloud Shell banner, and then select Theme > Light from the Settings menu.
At the $ command line prompt, enter the following command. The display-name is an optional descriptive name that will be attached to the reserved public IP address that will be created for you. Press the [Enter] key to run the command.
```
$ export DISPLAY_NAME="traininun0-public-ip"
```
At the $ command line prompt, enter the following command. Substitute subnet-ocid with your own subnet-ocid that you identified in Task 1 of this step. Press the [Enter] key to run the command.
```
$ export SUBNET_OCID="subnet-ocid"
```
At the $ command line prompt, enter the following command. The private IP address is the ip address that is assigned to the node that you want to map. In this case, provide the utility node's private IP address that you identified in Task 1. Press the [Enter] key to run the command.
```
$ export PRIVATE_IP="ip-address"
```

At the $ command line prompt, enter the following command exactly as it's shown below without any line breaks, or click Copy to copy the command, and then paste it on the command line. Press the [Enter] key to run the command.

oci network public-ip create --display-name $DISPLAY_NAME --compartment-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."compartment-id"'` --lifetime "RESERVED" --private-ip-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."id"'`

In the output returned, find the value for ip-address field. This is the new reserved public IP address that is mapped to the private IP address of your utility.
To view the newly created reserved public IP address in the console, click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The new reserved public IP address is displayed in the Reserved Public IP Addresses page. If you did specify a descriptive name as explained earlier, that name will appear in the Name column; Otherwise, a name such as publicipnnnnnnnnn is generated.

Task 3: map the private IP address of the master node to a public IP address

In this task, you set two variables using the export command. These variables are used in the oci network command that you use to map the private IP address of the master node to a new public IP address. You have done similar work in the previous task.

At the $ command line prompt, enter the following command. The display-name is an optional descriptive name that will be attached to the reserved public IP address that will be created for you. Press the [Enter] key to run the command.
```
$ export DISPLAY_NAME="traininmn0-public-ip"
```
You already set the SUBNET_OCID variable to your own subnet-ocid value that you identified in Task 2 of this step. You don't need to set this variable again.
At the $ command line prompt, enter the following command. The private IP address is the ip address that is assigned to the node that you want to map. In this case, provide the first master node's private IP address that you identified in Task 1. Press the [Enter] key to run the command.
```
$ export PRIVATE_IP="ip-address"
```

At the $ command line prompt, enter the following command exactly as it's shown below without any line breaks, or click Copy to copy the command, and then paste it on the command line. Press the [Enter] key to run the command.

oci network public-ip create --display-name $DISPLAY_NAME --compartment-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."compartment-id"'` --lifetime "RESERVED" --private-ip-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."id"'`

In the output returned, find the value for ip-address field. This is the new reserved public IP address that is mapped to the private IP address of your master node.
To view the newly created reserved public IP address in the console, click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The new reserved public IP address is displayed in the Reserved Public IP Addresses page. If you did specify a descriptive name as explained earlier, that name will appear in the Name column; Otherwise, a name such as publicipnnnnnnnnn is generated.

Task 4: map the private IP address of the Cloud SQL node to a reserved public IP address

In this task, you set two variables using the export command. Next, you use the oci network command to map the private IP address of the Cloud SQL node to a new public IP address.

In the Cloud Shell, at the $ command line prompt, enter the following command.
```
$ export DISPLAY_NAME="traininqs0"
```
You already set the SUBNET_OCID variable to your own subnet-ocid value that you identified in Task 2 of this step. You don't need to set this variable again.
At the $ command line prompt, enter the following command. The private IP address is the ip address that is assigned to the node that you want to map. In this case, provide the Cloud SQL node's private IP address that you identified in Task 1.
```
export PRIVATE_IP="ip-address"
```

At the $ command line prompt, enter the following command exactly as it's shown below without any line breaks, or click Copy to copy the command, and then paste it on the command line.

oci network public-ip create --display-name $DISPLAY_NAME --compartment-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."compartment-id"'` --lifetime "RESERVED" --private-ip-id `oci network private-ip list --subnet-id $SUBNET_OCID --ip-address $PRIVATE_IP | jq -r '.data[] | ."id"'`

In the output returned, find the value for ip-address field. This is the new reserved public IP address that is mapped to the private IP address of your Cloud SQL node.
To view the newly created reserved public IP address in the console, click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The new reserved public IP address is displayed in the Reserved Public IP Addresses page.

Task 5: rename a reserved public IP address

In this task, you edit a public IP address using both the Cloud Console and the Cloud Shell.

Click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The new reserved public IP addresses that you created in this step are displayed in the Reserved Public IP Addresses page.
Change the name of the public IP address associated with the Cloud SQL node from traininqs0 to traininqs0-public-ip. On the row for traininqs0, click the Actions button, and then select Rename from the context menu.
In the Rename dialog box, in the RESERVED PUBLIC IP NAME field, enter traininqs0-public-ip, and then click Save Changes.
The renamed public IP address is displayed.

You can also edit public IP addresses using the OCI CLI. See OCI CLI Command Reference - public-ip.

Important

Do not delete any of your public IP addresses as you need them for this tutorial.

Lab 5. Use Apache Ambari to Access the Cluster

In this task, you use Apache Ambari to access the cluster. In a Big Data cluster, Apache Ambari runs on the first utility node, traininun0. You use the reserved public IP address that is associated with traininun0 that you created in Task 2 of Lab 4.

Before you can access Apache Ambari on the utility node using a web browser, you must have opened the port associated with the service and mapped the private IP address to a public IP address.

Open a Web browser window and enter the following URL. Substitute ip-address with your own ip-address that is associated with the utility node in your cluster, traininun0, which you created in the previously. To view your reserved public IP address in the console, click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The reserved public IP address is displayed in the Reserved Public IP Addresses page.
```
https://<ip-address>:7183
```
On the login screen, enter the following information:
- username: admin
- password: password you specified when you created the cluster
Click Sign In.
From the Dashboard, note the name of the cluster at the top right and the services running on the cluster from the left navigation.
Click Hosts. The hosts of the cluster are displayed. Hosts are configured with one or more components, each of which corresponds to a service. The component indicates which daemon, also known as service, runs on the host. Typically, a host will run multiple components in support of the various services running in the cluster.
Drill-down on the components associated with the master node in the cluster, traininmn0.
The services and components that are running on the master node are displayed such as the HDFS NameNode, Spark3 History Server, YARN's Registry DNS, and Yarn's ResourceManager among other services.
Exit Apache Ambari. From the User drop-down menu, select Sign out.

Lab 6. Create a Hadoop Administrator User

Task 1: connect to the master node

In this task, you connect to the cluster's master node using SSH as user opc (the default Oracle Public Cloud user).

When you created a cluster, you used your SSH public key to create the nodes. In this section, you use the matching private key to connect to the master node.

Open the navigation menu and click Networking. Under IP Management, click Reserved IPs.
Copy the Reserved Public IP of trainingmn0-publib-ip. You use this IP address to connect to your master node.
Open a Terminal or a Command Prompt window, by using an application such as GitBash.
Change into the directory where you stored the SSH encryption keys you created in the Get Started tutorial.
Connect to your node with this ssh command. Your private key has no extension. Example: big-data-key.
```
ssh -i <your-private-key-file> opc@<public-ip-address>
```
Because you identified your public key when you created the nodes, this command logs you into your instance. You can now issue sudo commands to create a Linux administrator.

Task 2: create the linux OS administrator user

Create the training Linux administrator user and the OS group supergroup. Assign training the supergroup superuser group as the primary group, and hdfs, hadoop, and hive as the secondary groups.

The opc user has sudo privileges on the cluster which allows it to switch to the root user in order to run privileged commands. Change to the root user as follows:
```
sudo bash
```
The dcli utility allows you to run the command that you specify across each node of the cluster. The syntax for the dcli utility is as follows:
```
dcli [option] [command]
```
Use the -C option to run the specified command on all the nodes in the cluster.

Enter the following command at the # prompt to create the OS group supergroup which is defined as the superuser group in hadoop.
```
dcli -C "groupadd supergroup"
```
Enter the following command at the # prompt to create the training administrator user and add it to the listed groups on each node in the training-cluster. The useradd linux command creates the new training user and adds it to the specified groups.
```
dcli -C "useradd -g supergroup -G hdfs,hadoop,hive training"
```
The preceding command creates a new user named training on each node of the cluster. The -g option assigns the supergroup group as the primary group for training. The -G option assigns the hdfs, hadoop, and hive groups as the secondary groups for training.

Note

Since the training user is part of the hive group, it is considered an administrator for Sentry.
Use the linux id command to confirm the creation of the new user and to list its groups membership.
```
id training
```
You can now access HDFS using the new training administrator user on any node in the cluster such as the first master node in this example. Change to the training user as follows:
```
sudo -su training
```
Use the linux id command to confirm that you are connected now as the training user.
```
id
```
Perform a file listing of HDFS as the training user using the following command:
```
hadoop fs -ls /
```

Lab 7. Upload Data to HDFS and Object Storage

In this step, you download and run two sets of scripts.

First, you will download and run the Hadoop Distributed File System (HDFS) scripts to download data from Citi Bikes NYC to a new local directory on your master node in your BDS cluster. The HDFS scripts manipulates some of the downloaded data files, and then upload them to new HDFS directories. The HDFS scripts also create Hive databases and tables which you will query using Hue.

Second, you will download and run the object storage scripts to download data from Citi Bikes NYC to your local directory using OCI Cloud Shell. The object storage scripts uploads the data to a new bucket in Object Storage. See the Data License Agreement for information about the Citi Bikes NYC data license agreement.

Task 1: gather compartment and the reserved public IP address details

Open the navigation menu and click Identity & Security. Under Identity, click Compartments.
In the list of compartments, search for the training-compartment. In the row for the compartment, in the OCID column, hover over the OCID link and then click Copy. Next, paste that OCID to an editor or a file, so that you can retrieve it later in this step.
Click the Navigation menu and navigate to Networking > Reserved IPs. The Reserved Public IP Addresses page is displayed. In the List Scope on the left pane, make sure that your training-compartment is selected.
In row for the traininmn0-public-ip reserved IP address, copy the reserved public IP address associated with the master node in the Reserved Public IP column. Next, paste that IP address to an editor or a file, so that you can retrieve it later in this step. You might need this IP address to ssh to the master node, if you didn't save your ssh connection in step 5.

Task 2: connect to the master node as the new user

In this task, you connect to the master node in the cluster using SSH as the training Hadoop Administrator user that you created in Step 5.

In this task, you connect to the master node as the training user you created in Step 5: Create a Hadoop Administrator User.

Connect to the master node with this ssh command:

ssh -i <your-private-key-file> opc@<public-ip-address>

Log in as the training Hadoop Administrator user that you created in Step 5.
```
sudo -su training
```
Use the id command to confirm that you are connected as the training Hadoop Administrator user.
```
id
```
Use the cd command to change the working directory to that of the training user. Use the pwd command to confirm that you are in the training working directory:
```
cd
```
```
pwd
```

Task 3: download and run HDFS scripts to set up the HDFS data

In this task, you download two scripts that will set up your HDFS environment and download the HDFS dataset from Citibike System Data. The scripts and a randomized weather data file are stored in a public bucket in Object Storage.

The Citi Bikes detailed trip data files (in zipped format) are first downloaded to a new local directory. Next, the files are unzipped, and the header row is removed from each file. Finally, the updated files are uploaded to a new /data/biketrips HDFS directory. Next, a new bikes Hive database is created with two Hive tables. bikes.trips_ext is an external table defined over the source data. The bikes.trips table is created from this source; it is a partitioned table that stores the data in Parquet format. The tables are populated with data from the .csv files in the /data/biketrips directory.

The stations data file is downloaded (and then manipulated) from the station information page. The updated file is then uploaded to a new /data/stations HDFS directory.

The weather data is downloaded from a public bucket in Object Storage. Next, the header row in the file is removed. The updated file is then uploaded to a new /data/weather HDFS directory. Next, a new weather Hive database and weather.weather_ext table are created and populated with from the weather-newark-airport.csv file in the /data/weather directory.

Note

To view the complete data files that are available, navigate to Citibike System Data page. In the Citi Bike Trip Histories section, click downloadable files of Citi Bike trip data. The Index of bucket "tripdata" page displays the available data files. In this lab, you will be using only some of the data files on that page.

Run the following command to download the env.sh script from a public bucket in Object Storage to the training working directory. You will use this script to set up your HDFS environment.
```
wget https://objectstorage.us-phoenix-1.oraclecloud.com/n/oraclebigdatadb/b/workshop-data/o/bds-livelabs/env.sh
```
Run the following command to download the download-all-hdfs-data.sh script from a public bucket in Object Storage to the training working directory. You will run this script to download the dataset to your local working directory. The script will then upload this data to HDFS.
```
wget https://objectstorage.us-phoenix-1.oraclecloud.com/n/oraclebigdatadb/b/workshop-data/o/bds-livelabs/download-all-hdfs-data.sh
```
Add the execute privilege to the two downloaded .sh files as follows:
```
chmod +x *.sh
```
Display the content of the env.sh file using the cat command. This file sets the target local and HDFS directories.
```
cat env.sh
```
Note

You will download the data from Citi Bikes NYC to the new Downloads local target directory as specified in the env.sh file. You will upload the data from the local Downloads directory to the following new HDFS directories under the new /data HDFS directory as specified in the env.sh and the HDFS scripts: biketrips, stations, and weather.
Use the cat command to display the content of the download-all-hdfs-data.sh script. This script downloads the download-citibikes-hdfs.sh and download-weather-hdfs.sh scripts to the local training working directory, adds execute privilege on both of those scripts, and then runs the two scripts.
```
cat download-all-hdfs-data.sh
```
The download-citibikes-hdfs.sh script does the following:
- Runs the env.sh script to set up your local and HDFS target directories.
- Downloads the stations information from the Citi Bike Web site to the local Downloads target directory.
- Creates a new /data/stations HDFS directory, and then copies the stations.json file to this HDFS directory.
- Downloads the bike rental data files (the zipped .csv files) from Citi Bikes NYC to the local Downloads target directory.
- Unzips the bike rental files, and removes the header row from each file.
- Creates a new /data/biketrips HDFS directory, and then uploads the updated csv files to this HDFS directory. Next, it adds execute file permissions to both .sh files.
- Creates the bikes Hive database with two Hive tables. bikes.trips_ext is an external table defined over the source data. bikes.trips is created from this source; it is a partitioned table that stores the data in Parquet format. The tables are populated with data from the .csv files in the /data/biketrips directory.
The download-weather-hdfs.sh script provides a random weather data set for Newark Liberty Airport in New Jersey. It does the following:
- Runs the env.sh script to set up your local and HDFS target directories.
- Downloads the weather-newark-airport.csv file to the Downloads stations information from the Citi Bike Web site to the local Downloads target directory.
- Removes the header row from the file.
- Creates a new /data/weather HDFS directory, and then uploads the weather-newark-airport.csv file to this HDFS directory.
- Creates the weather Hive database and the weather.weather_ext Hive table. It then populates the table with the weather data from the weather-newark-airport.csv file in the local Downloads directory.
Run the download-all-hdfs-data.sh script as follows:
```
./download-all-hdfs-data.sh
```
Text messages will scroll on the screen. After a minute or so, the Weather data loaded and Done messages are displayed on the screen.
Navigate to the local Downloads directory, and then use the ls -l command to display the downloaded trips, stations, and weather data files.
```
cd Downloads
```
Use the head command to display the first two records from the stations.json file.
```
head -2 stations.json | jq
```
Use the head command to display the first 10 records from the weather-newark-airport.csv file.
```
head weather-newark-airport.csv
```
Use the following commands to display the HDFS directories that were created, and list their contents.
```
hadoop fs -ls /data
hadoop fs -ls /data/biketrips
hadoop fs -ls /data/stations
hadoop fs -ls /data/weather
```
Hit the [Enter] key on your keyboard to execute the last command above.
Use the following command to display the first 5 rows from the uploaded JC-201901-citibike-tripdata.csv file in the /data/biketrips HDFS folder. Remember, the header row for this uploaded .csv file was removed when you ran the download-citibikes-hdfs.sh script.
```
hadoop fs -cat /data/biketrips/JC-201902-citibike-tripdata.csv | head -5
```

Task 4: download and run Object Storage scripts to set up the Object Storage data

In this task, you will download two scripts . The scripts and a randomized weather data file are stored in a public bucket in Object Storage.

The scripts that you download for this section will set up your Object Storage environment and download the object storage dataset from Citi Bikes NYC.

On the Oracle Cloud Console banner at the top of the page, click Cloud Shell. It may take a few moments to connect and authenticate you.
Click Copy to copy the following command. Right-click your mouse, select Paste, and then paste it on the command line. You will use this script to set up your environment for the Object Store data. Press the [Enter] key to run the command.
```
wget https://objectstorage.us-phoenix-1.oraclecloud.com/n/oraclebigdatadb/b/workshop-data/o/bds-livelabs/env.sh
```
Edit the downloaded env.sh file using the vi Editor (or an editor of your choice) as follows:
```
vi env.sh
```
To input and edit text, press the [i] key on your keyboard (insert mode) at the current cursor position. The INSERT keyword is displayed at the bottom of the file to indicate that you can now make your edits to this file. Scroll-down to the line that you want to edit. Copy your training-compartment OCID value that you identified in Task 1, and then paste it between the " " in the export COMPARTMENT_OCID="" command.

Note

You will upload the Object Store data to the training bucket as specified in the env.shfile.
Press the [Esc] key on your keyboard, enter :wq, and then press the [Enter] key on your keyboard to save your changes and quit vi.
At the $ command line prompt, click Copy to copy the following command, and then paste it on the command line. You will run this script to download the dataset to your local working directory. You will then upload this data to a new object in a new bucket. Press the [Enter] key to run the command.
```
wget https://objectstorage.us-phoenix-1.oraclecloud.com/n/oraclebigdatadb/b/workshop-data/o/bds-livelabs/download-all-objstore.sh
```
Add the execute privilege to the two downloaded .sh files as follows:
```
chmod +x *.sh
```
Use the cat command to display the content of this script. This script runs the env.shscript, downloads the download-citibikes-objstore.sh and download-weather-objstore.sh scripts, adds execute privilege to both of those scripts, and then runs the two scripts.
```
cat download-all-objstore.sh
```
You can use the cat command to display the content of this script. The download-all-objstore.sh script runs the env.sh script which sets the environment. The script writes some of the data from Citi Bikes NYC, and a randomized weather data that is stored in a public bucket in Object Storage to the your local Cloud Shell directory and to new objects in a new bucket named training, as specified in the env.sh script. The training bucket will contain the following new objects:
- The weather object which stores the weather data.
- The stations object which stores the stations data.
- The biketrips object which stores the bike trips data.
Run the download-all-objstore.sh script as follows:
```
./download-all-objstore.sh
```
Text messages will scroll on the screen. After a minute or so, the Done message along with the location of the data (both compartment and bucket) are displayed on the screen.
Navigate to the local Downloads directory to display the downloaded data files.
Click the Navigation menu and navigate to Storage. In the Object Storage & Archive Storage section, click Buckets. The Buckets page is displayed. In the List Scope on the left pane, make sure that your training-compartment is selected. In the list of available buckets, the newly created training bucket is displayed in the Name column. Click the training link.
The Buckets Details page for the training bucket is displayed. Scroll-down to the Objects section to display the newly created biketrips, stations, and weather objects.
To display the data files in an object such as the biketrip object, click Expand next to the object's name. The files contained in that object are displayed. To collapse the list of files, click collapse next to the object's name.
To display the first 1KB of the file's content (in read-only mode), click the Actions button on the row for the file, and then select View Object Details from the context menu.

Note

To view all the data in a file, select Download from the context menu, and then double-click the downloaded file to open it using its native application, MS-Excel (.csv) in this example.

Lab 8. Manage your Cluster

Task 1: maintain the cluster

You use the Clusters and Cluster Details pages to maintain your clusters.

Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, on the row for training-cluster, click the Actions button. You can use the context menu to view the cluster's details, add nodes, add block storage, add Cloud SQL, rename the cluster, remove Cloud SQL (if it's already added), and terminate the Big Data cluster.
Alternatively, you can click the training-cluster link in the Name column to display the Cluster Details page. You can use the buttons at the top of the page to do the following:
- Add nodes to the cluster.
- Add block storage.
- Add Cloud SQL.
- Change shape.
- Use the More Actions drop-down list to rename the cluster, move a resource from the current compartment to a different compartment, add tags, remove Cloud SQL (if it's already added), and terminate Big data Cluster.

Note

Oracle Cloud Infrastructure Tagging allows you to add metadata to resources, which enables you to define keys and values and associate them with resources. You can use the tags to organize and list resources based on your business needs.

Task 2: monitor the cluster and nodes metrics

You can monitor the cluster's metrics and the metrics of any of its nodes.

From the Clusters page, click training-cluster in the Name column to display the Cluster Details page.
Scroll-down the Cluster Details page. In the Resources section on the left, click Cluster Metrics.
The Cluster Metrics section shows the various metrics such as HDFS Space Used, HDFS Space Free, Yarn Jobs Completed, and Spark Jobs Completed. You can adjust the Start time, End time, Interval, Statistic, and Options fields, as desired.
In the Resources section on the left, click Nodes (7).
In the List of cluster nodes section, click any node name link to display its metrics. Click the traininmn0 master node in the Name column.
In the Node Details page, scroll-down to the Node Metrics section. This section is displayed at the bottom of the Node Details page only after the cluster is successfully provisioned. It displays the following charts: CPU Utilization, Memory Utilization, Network Bytes In, Network Bytes Out, and Disk Utilization. You can hover over any chart to get additional details.
Click the Cluster Details link in the breadcrumbs at the top of the page to re-display the Cluster Details page.

Lab 9. Clean Up Tutorial Resources

You can delete the resources that you created in this workshop. If you want to run the labs in this workshop again, perform these clean up tasks.

If you want to list the resources in your training-compartment, you can use the Tenancy Explorer page. From the Navigation menu, navigate to Governance & Administration. In the Governance section, click Tenancy Explorer. On the Tenancy Explorer page, in the Search compartments field, type training, and then select training-compartment from the list of compartments. The resources in the training-compartment are displayed.

Task 1: delete the cluster

Open the navigation menu and click Analytics and AI. Under Data Lake, click Big Data Service.
On the Clusters page, on the row for training-cluster, click the Actions button, and then select Terminate Big Data Cluster from the context menu.
A confirmation message box is displayed. Enter the name of the cluster, and then click Terminate. The status of the cluster in the State column is Deleting. It can take up to 30 minutes before the cluster is deleted.
The status of the cluster in the State column changes from Active to Deleting.
To view the status of the deletion process, click the cluster's name link in the Name column to display the Cluster Details page. In the Resources section at the bottom left-hand side of the page, click Work Requests. In the Work Requests section, you can see the % Complete information.

For additional details on the deletion process, click CREATE_BDS in the Operation column. The DELETE_BDS page displays the logs, and errors, if any.
Click the Clusters link in the breadcrumbs to return to the Clusters page. When the cluster is successfully deleted, the status of the cluster in the State column changes from Deleting to Deleted.

Task 2: delete the IAM policies

Open the navigation menu and click Identity & Security. Under Identity, click Policies.
Click the Actions button associated with the training-admin-policy policy, and then select Delete from the context menu. A confirmation message box is displayed, click Delete.
Click the Actions button associated with the training-bds-policy policy, and then select Delete from the context menu. A confirmation message box is displayed, click Delete.

Task 3: delete the VCN

To delete a VCN, it must first be empty and have no related resources or attached gateways such as internet gateway, dynamic routing gateway, and so on. To delete a VCN's subnets, they must first be empty too.

Open the navigation menu and click Networking. Then click Virtual Cloud Networks.
From the list of available VCNs in your compartment, click the training-vcn name link in the Name column. The Virtual Cloud Network Details page is displayed.
In the Subnets section, click the Actions button associated with Private Subnet-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Subnets section, click the Actions button associated with Public Subnet-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click Route Tables.
In the Routes Tables section, click the Actions button associated with Route Table for Private Subnet-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Routes Tables section, click the Default Route Table for training-vcn link in the Name column. The Route Table Details page is displayed. In the Route Rules section, click the Actions icon associated with Internet Gateway-training-vcn. Select Remove from the context menu. A confirmation message box is displayed. Click Remove. Click training-vcn in the breadcrumbs to return to the training-vcn page.
In the Resources section on the left pane, click Internet Gateways. In the Internet Gateways section, click the Actions button associated with Internet Gateway-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click Security Lists. In the Security Lists section, click the Actions button associated with Security List for Private Subnet-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click NAT Gateways. In the NAT Gateways section, click the Actions button associated with NAT Gateway-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
In the Resources section on the left pane, click Service Gateways. In the Service Gateways section, click the Actions button associated with Service Gateway-training-vcn. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
At the top of the page, click Terminate to terminate your VCN. A Terminate Virtual Cloud Network window is displayed. After less than a minute, the Terminate All button is enabled. To delete your VCN, click Terminate All.
When the termination operation is completed successfully, a Virtual Cloud Network termination complete message is displayed in the window. Click Close.

Task 4: delete the reserved public IP addresses

Click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The Reserved Public IP Addresses page is displayed.
In the List Scope on the left pane, make sure that your training-compartment is selected.
In this workshop, you have created three reserved IP addresses: traininmn0-public-ip, traininqs0-public-ip, and traininun0-public-ip.
Click the Actions button associated with traininmn0-public-ip. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
Click the Actions button associated with traininqs0-public-ip. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.
Click the Actions button associated with traininun0-public-ip. Select Terminate from the context menu. A confirmation message box is displayed. Click Terminate.

Task 5: delete the Object Storage bucket

Before you can delete a bucket that contains objects, you must delete all the objects in the bucket first.

Click the Navigation menu and navigate to Storage. In the Object Storage & Archive Storage section, click Buckets. The Buckets page is displayed. In the List Scope on the left pane, make sure that your training-compartment is selected. In the list of available buckets, the newly created training bucket is displayed in the Name column. Click the training link.
The Bucket Details page for the training bucket is displayed. Scroll-down to the Objects section.
On the row for the biketrips object, click the Actions button, and then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter biketrips in the Type the folder name to confirm deletion text box, and then click Delete. The object is deleted and the Bucket Details page is re-displayed.
On the row for the stations object, click the Actions button, and then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter stations in the Type the folder name to confirm deletion text box, and then click Delete. The object is deleted and the Bucket Details page is re-displayed.
On the row for the weather object, click the Actions button, and then select Delete Folder from the context menu.
A confirmation message box is displayed. Enter weather in the Type the folder name to confirm deletion text box, and then click Delete. The object is deleted and the Bucket Details page is re-displayed.
Scroll up the page, and then click the Delete button. A confirmation message box is displayed. Click Delete. The bucket is deleted and the Buckets page is re-displayed.

Task 6: delete the compartment

Open the navigation menu and click Identity & Security. Under Identity, click Compartments.
From the list of available compartments, search for your training-compartment.
On the Compartments page, click the Actions button associated with training-compartment. Select Delete from the context menu.
A confirmation message box is displayed. Click Delete. The status of the deleted compartment changes from Active to Deleting until the compartment is successfully deleted. You can click on the compartment name link in the Name column to display the status of this operation.

What's Next

Learn more about Big Data Service or try other workshops.

Oracle Cloud Infrastructure Documentation

Get Started with a Non-Highly Available ODH Big Data Cluster

Before You Begin

Lab 1. Setup OCI Resources Needed for Big Data Clusters

Lab 2. Create a non-HA ODH Cluster

Lab 3. Add Oracle Cloud SQL to the Cluster

Lab 4. Map the Private IP Addresses to Public IP Addresses

Lab 5. Use Apache Ambari to Access the Cluster

Lab 6. Create a Hadoop Administrator User

Lab 7. Upload Data to HDFS and Object Storage

Lab 8. Manage your Cluster

Lab 9. Clean Up Tutorial Resources

What's Next