If you don't have a bucket in Object Storage where you can save your
input and results, you must create a bucket with a suitable folder
structure. In this example, the folder structure is /output/.
Run this code:
Copy
public class ETLWithJavaExample {
private static Logger logger = LoggerFactory.getLogger(ETLWithJavaExample.class);
String compartmentId = "<compartment-id>"; // need to change comapartment id
public static void main(String[] ars){
System.out.println("ETL with JAVA Tutorial");
new ETLWithJavaExample().createRun();
}
public void createRun(){
ConfigFileReader.ConfigFile configFile = null;
// Authentication Using config from ~/.oci/config file
try {
configFile = ConfigFileReader.parseDefault();
}catch (IOException ie){
logger.error("Need to fix the config for Authentication ", ie);
return;
}
try {
AuthenticationDetailsProvider provider =
new ConfigFileAuthenticationDetailsProvider(configFile);
// Creating a Data Flow Client
DataFlowClient client = new DataFlowClient(provider);
client.setRegion(Region.US_PHOENIX_1);
// creation of execute String
String executeString = "--class convert.Convert "
+ "--files oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/kaggle_berlin_airbnb_listings_summary.csv "
+ "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar "
+ "kaggle_berlin_airbnb_listings_summary.csv oci://<bucket-name>@<namespace-name>/output/optimized_listings";
// Create Run details and create run.
CreateRunResponse response;
CreateRunDetails runDetails = CreateRunDetails.builder()
.compartmentId(compartmentId).displayName("Tutorial_1_ETL_with_JAVA").execute(executeString)
.build();
CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
CreateRunResponse response = client.createRun(runRequest);
logger.info("Successful run creation for ETL_with_JAVA with OpcRequestID: "+response.getOpcRequestId()
+" and Run ID: "+response.getRun().getId());
}catch (Exception e){
logger.error("Exception creating run for ETL_with_JAVA ", e);
}
}
}
If you have run this tutorial before, delete the contents of the output
directory,
oci://<bucket-name>@<namespace-name>/output/optimized_listings,
to prevent the tutorial failing.
Note
To find the compartment-id, from the
navigation menu, click Identity and click Compartments. The
compartments available to you're listed, including the OCID of
each.
2: Machine Learning with PySpark 🔗
Using Spark-submit and Java SDK, carry out machine learning with PySpark.
Complete exercise 1. ETL with Java, before
attempting this exercise. The results are used in this exercise.
Run the following code:
Copy
public class PySParkMLExample {
private static Logger logger = LoggerFactory.getLogger(PySParkMLExample.class);
String compartmentId = "<compartment-id>"; // need to change comapartment id
public static void main(String[] ars){
System.out.println("ML_PySpark Tutorial");
new PySParkMLExample().createRun();
}
public void createRun(){
ConfigFileReader.ConfigFile configFile = null;
// Authentication Using config from ~/.oci/config file
try {
configFile = ConfigFileReader.parseDefault();
}catch (IOException ie){
logger.error("Need to fix the config for Authentication ", ie);
return;
}
try {
AuthenticationDetailsProvider provider =
new ConfigFileAuthenticationDetailsProvider(configFile);
DataFlowClient client = new DataFlowClient(provider);
client.setRegion(Region.US_PHOENIX_1);
String executeString = "oci://oow_2019_dataflow_lab@idehhejtnbtc/oow_2019_dataflow_lab/usercontent/oow_lab_2019_pyspark_ml.py oci://<bucket-name>@<namespace-name>/output/optimized_listings";
CreateRunResponse response;
CreateRunDetails runDetails = CreateRunDetails.builder()
.compartmentId(compartmentId).displayName("Tutorial_3_ML_PySpark").execute(executeString)
.build();
CreateRunRequest runRequest = CreateRunRequest.builder().createRunDetails(runDetails).build();
CreateRunResponse response = client.createRun(runRequest);
logger.info("Successful run creation for ML_PySpark with OpcRequestID: "+response.getOpcRequestId()
+" and Run ID: "+response.getRun().getId());
}catch (Exception e){
logger.error("Exception creating run for ML_PySpark ", e);
}
}
}
What's Next 🔗
Use Spark-submit and the CLI in other situations.
You can use spark-submit and Java SDK to create and run Java, Python, or SQL applications with Data Flow, and explore the results. Data Flow handles all details of deployment, tear down,
log management, security, and UI access. With Data Flow,
you focus on developing Spark applications without worrying about the
infrastructure.