Get Started with a Non-Highly Available ODH Big Data Cluster

You use an Oracle Cloud Infrastructure account to create a non-highly available Big Data cluster with Oracle Distribution including Apache Hadoop.

You can create Big Data clusters with options for node shapes and storage sizes. Select these options based on your use case and performance needs. In this workshop, you create a non-HA cluster and assign small shapes to the nodes. This cluster is perfect for testing applications.

This simple non-HA cluster has the following profile:

  • Nodes: 1 Master node, 1 Utility node, and 3 Worker nodes.

  • Master and Utility Nodes Shapes: VM.Standard2.4 shape for the Master and Utility nodes. This shape provides 4 CPUs and 60 GB of memory.

  • Worker Nodes Shape: VM.Standard2.1 shape for the Worker nodes in the cluster. This shape provides 1 CPU and 150 GB of memory.

  • Storage Size: 150 GB block storage for the Master, Utility, and Worker nodes.

Graphical representation of non-HA cluster nodes

Before You Begin

To successfully perform this tutorial, you must have the following:

Lab 1. Setup OCI Resources Needed for Big Data Clusters

In this lab, you use an Oracle Cloud Infrastructure account to prepare the resources needed to create a Big Data cluster.

Lab 2. Create a non-HA ODH Cluster

Create a non-HA cluster and monitor the steps.

Lab 3. Add Oracle Cloud SQL to the Cluster

You add Oracle Cloud SQL to a cluster so that you can use SQL to query your big data sources. When you add Cloud SQL support to a cluster, a query server node is added and big data cell servers are created on all worker nodes.

Note

Cloud SQL is not included with Big Data Service. You must pay an extra fee for using Cloud SQL.
  1. On the Clusters page, on the row for training-cluster, click the Actions button.
  2. From the context menu, select Add Cloud SQL.
  3. In the Add Cloud SQL dialog box, provide the following information:
    • Query Server Node Shape: Select VM.Standard2.4.
    • Query Server Node Block Storage (In GB): Enter 1000.
    • Cluster Admin Password: Enter your cluster administration password that you chose when you created the cluster such as Training123.
  4. Click Add. The Clusters page is re-displayed. The status of the training-cluster is now Updating and the number of nodes in the cluster is increased by 1.
  5. Click the training-cluster name link in the Name column to display the Cluster Details page. Scroll-down the page to the List of cluster nodes section. The newly added Cloud SQL node, traininqs0, is displayed.
  6. Click the Cloud SQL Information tab to display information about the new Cloud SQL node.
  7. Click Work Requests in the Resources section. In the Work Requests section, the ADD_CLOUD_SQL operation is displayed along with the status of the operation and percent completed. Click the ADD_CLOUD_SQL link.
  8. The Work Request Details page displays the status, logs, and errors (if any) of adding the Cloud SQL node to the cluster.
  9. Click the Clusters link in the breadcrumbs at the top of the page to re-display the Clusters page. Once the Cloud SQL node is successfully added to the cluster, the cluster's state changes to Active and the number of nodes in the cluster is now increased by 1.

Lab 4. Map the Private IP Addresses to Public IP Addresses

Big Data Service nodes are by default assigned private IP addresses, which aren't accessible from the public internet.

You must make the nodes in the cluster available by establishing connections to the node. In this workshop, you map the private IP addresses of the nodes in the cluster to public IP addresses to make them publicly available on the internet. We assume that making the IP address public is an acceptable security risk.
Note

Using a bastion Host, VPN Connect, and OCI FastConnect provide more private and secure options than making the IP address public.

In this lab, you use Oracle Cloud Infrastructure Cloud Shell, which is a web browser-based terminal accessible from the Oracle Cloud Console.

Lab 5. Use Apache Ambari to Access the Cluster

In this task, you use Apache Ambari to access the cluster. In a Big Data cluster, Apache Ambari runs on the first utility node, traininun0. You use the reserved public IP address that is associated with traininun0 that you created in Task 2 of Lab 4.

Before you can access Apache Ambari on the utility node using a web browser, you must have opened the port associated with the service and mapped the private IP address to a public IP address.
  1. Open a Web browser window and enter the following URL. Substitute ip-address with your own ip-address that is associated with the utility node in your cluster, traininun0, which you created in the previously. To view your reserved public IP address in the console, click the Navigation menu and navigate to Networking. In the IP Management section, click Reserved IPs. The reserved public IP address is displayed in the Reserved Public IP Addresses page.
    https://<ip-address>:7183
  2. On the login screen, enter the following information:
    • username: admin
    • password: password you specified when you created the cluster
    Click Sign In.
  3. From the Dashboard, note the name of the cluster at the top right and the services running on the cluster from the left navigation.
  4. Click Hosts. The hosts of the cluster are displayed. Hosts are configured with one or more components, each of which corresponds to a service. The component indicates which daemon, also known as service, runs on the host. Typically, a host will run multiple components in support of the various services running in the cluster.
  5. Drill-down on the components associated with the master node in the cluster, traininmn0.
    The services and components that are running on the master node are displayed such as the HDFS NameNode, Spark3 History Server, YARN's Registry DNS, and Yarn's ResourceManager among other services.
  6. Exit Apache Ambari. From the User drop-down menu, select Sign out.

Lab 6. Create a Hadoop Administrator User

Lab 7. Upload Data to HDFS and Object Storage

In this step, you download and run two sets of scripts.

First, you will download and run the Hadoop Distributed File System (HDFS) scripts to download data from Citi Bikes NYC to a new local directory on your master node in your BDS cluster. The HDFS scripts manipulates some of the downloaded data files, and then upload them to new HDFS directories. The HDFS scripts also create Hive databases and tables which you will query using Hue.

Second, you will download and run the object storage scripts to download data from Citi Bikes NYC to your local directory using OCI Cloud Shell. The object storage scripts uploads the data to a new bucket in Object Storage. See the Data License Agreement for information about the Citi Bikes NYC data license agreement.

Lab 8. Manage your Cluster

Lab 9. Clean Up Tutorial Resources

You can delete the resources that you created in this workshop. If you want to run the labs in this workshop again, perform these clean up tasks.

If you want to list the resources in your training-compartment, you can use the Tenancy Explorer page. From the Navigation menu, navigate to Governance & Administration. In the Governance section, click Tenancy Explorer. On the Tenancy Explorer page, in the Search compartments field, type training, and then select training-compartment from the list of compartments. The resources in the training-compartment are displayed.